* Alerting: Fix simplified routes '...' groupBy creating invalid routes
There were a few ways to go about this fix:
1. Modifying our copy of upstream validation to allow this
2. Modify our notification settings validation to prevent this
3. Normalize group by on save
4. Normalized group by on generate
Option 4. was chosen as the others have a mix of the following cons:
- Generated routes risk being incompatible with upstream/remote AM
- Awkward FE UX when using '...'
- Rule definition changing after save and potential pitfalls with TF
With option 4. generated routes stay compatible with external/remote AMs, FE
doesn't need to change as we allow mixed '...' and custom label groupBys, and
settings we save to db are the same ones requested.
In addition, it has the slight benefit of allowing us to hide the internal
implementation details of `alertname, grafana_folder` from the user in the
future, since we don't need to send them with every FE or TF request.
* Safer use of DefaultNotificationSettingsGroupBy
* Fix missed API tests
* Implement keep last state for state transitions
* Respect For duration when keeping state
* Only keep transition from recording an annotation
* Add keep last state option for nodata/error in UI
* Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping
* Support validation of notification settings when rules are updated
* Implement route generator for Alertmanager configuration. That fetches all notification settings.
* Update multi-tenant Alertmanager to run the generator before applying the configuration.
* Add notification settings labels to state calculation
* update the Multi-tenant Alertmanager to provide validation for notification settings
* update GET API so only admins can see auto-gen
* update GetUserVisibleNamespaces to use FolderSeriver
* update GetNamespaceByUID to use FolderService.GetFolders
* update GetAlertRulesForScheduling to use FolderService.GetFolders
* Update API and GetAlertRulesForScheduling to use the folder's full path
* get full path of folder in RouteTestGrafanaRuleConfig
* fix escaping of titles for MySQL
* Add single receiver method
* Add receiver permissions
* Add single/multi GET endpoints for receivers
* Remove stable tag from time intervals
See end of PR description here: https://github.com/grafana/grafana/pull/81672
* Create locking config store that mimics existing provisioning store
* Rename existing receivers(_test).go
* Introduce shared receiver group service
* Fix test
* Move query model to models package
* ReceiverGroup -> Receiver
* Remove locking config store
* Move convert methods to compat.go
* Cleanup
* Alerting: During legacy migration reduce the number of created silences
During legacy migration every migrated rule was given a label rule_uid=<uid>.
This was used to silence DatasourceError/DatasourceNoData alerts for
migrated rules that had either ExecutionErrorState/NoDataState set to
keep_state, respectively.
This could potentially create a large amount of silences and a high cardinality
label. Both of these scenarios have poor outcomes for CPU load and latency in
unified alerting.
Instead, this change creates one label per ExecutionErrorState/NoDataState when
they are set to keep_state as well as two silence rules, if rules with said
labels were created during migration. These silence rules are:
- __legacy_silence_error_keep_state__ = true
- __legacy_silence_nodata_keep_state__ = true
This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label `rule_uid`.
* Simple, per-base-interval jitter
* Add log just for test purposes
* Add strategy approach, allow choosing between group or rule
* Add flag to jitter rules
* Add second toggle for jittering within a group
* Wire up toggles to strategy
* Slightly improve comment ordering
* Add tests for offset generation
* Rename JitterStrategyFrom
* Improve debug log message
* Use grafana SDK labels rather than prometheus labels
* Change ruler API to expect the folder UID as namespace
* Update example requests
* Fix tests
* Update swagger
* Modify FIle field in /api/prometheus/grafana/api/v1/rules
* Fix ruler export
* Modify folder in responses to be formatted as <parent UID>/<title>
* Add alerting test with nested folders
* Apply suggestion from code review
* Alerting: use folder UID instead of title in rule API (#77166)
Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
* Drop a few more latent uses of namespace_id
* move getNamespaceKey to models package
* switch GetAlertRulesForScheduling to use folder table
* update GetAlertRulesForScheduling to return folder titles in format `parent_uid/title`.
* fi tests
* add tests for GetAlertRulesForScheduling when parent uid
* fix integration tests after merge
* fix test after merge
* change format of the namespace to JSON array
this is needed for forward compatibility, when we migrate to full paths
* update EF code to decode nested folder
---------
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
Co-authored-by: Virginia Cepeda <virginia.cepeda@grafana.com>
Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
Co-authored-by: Alex Weaver <weaver.alex.d@gmail.com>
Co-authored-by: Gilles De Mey <gilles.de.mey@gmail.com>
Backend:
* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
- add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
- add rule_fingerprint column to alert_instance
- update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
- add AlertingResultsFromRuleState that implements the new interface in eval package
- update getExprRequest to patch the hysteresis command.
* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.
Frontend:
* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions
* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions
---------
Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
* In migration, create one label per channel
This PR changes how routing is done by the legacy alerting migration.
Previously, we created a single label on each alert rule that contained an array of contact point names. Ex: __contact__="slack legacy testing","slack legacy testing2"
This label was then routed against a series of regex-matching policies with continue=true. Ex: __contacts__ =~ .*"slack legacy testing".*
In the case of many contact points, this array could quickly become difficult to manage and difficult to grok at-a-glance.
This PR replaces the single __contact__ label with multiple __legacy_c_{contactname}__ labels and simple equality-matching policies. These channel-specific policies are nested in a single route under the top-level route which matches against __legacy_use_channels__ = true for ease of organization.
This should improve the experience for users wanting to keep the default migrated routing strategy but who also want to modify which contact points an alert sends to.
* Alerting: Apply query optimization to eval endpoints
Previously, query optimization was applied to alert queries when scheduled but
not when ran through `api/v1/eval` or `/api/v1/rule/test/grafana`. This could
lead to discrepancies between preview and scheduled alert results.
* update storage's method InstertRules to return ids of added rules as slice to keep the same order as rules in the argument
* schematize response of update rule group endpoint, add created, updated, deleted fields that contain UID of affected rules.
* update integration tests to use the new fields
* extend RuleStore interface to get namespace by UID
* add new export API endpoints
* implement request handlers
* update authorization and wire handlers to paths
* add folder error matchers to errorToResponse
* add tests for export methods
* calculate cacheID instead of literals
* use mocked clocks
* advance clocks with the eval results
* use clearer timestamp aliases
* make expected state labels be more clear to read
Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
* Add limit query parameter
* Drop copy paste comment
* Extend history query limit to 30 days and 250 entries
* Fix history log entries ordering
* Update no history message, add empty history test
---------
Co-authored-by: Konrad Lalik <konrad.lalik@grafana.com>
* add NodeTypeFromDatasourceUID and DataSourceModelFromNodeType()
* deprecate expr.DataSourceModel
* replace usages of IsDataSource to NodeTypeFromDatasourceUID
* replace usages of DataSourceModel to DataSourceModelFromNodeType()
This commit adds support for limits and filters to the Prometheus Rules
API.
Limits:
It adds a number of limits to the Grafana flavour of the Prometheus Rules
API:
- `limit` limits the maximum number of Rule Groups returned
- `limit_rules` limits the maximum number of rules per Rule Group
- `limit_alerts` limits the maximum number of alerts per rule
It sorts Rule Groups and rules within Rule Groups such that data in the
response is stable across requests. It also returns summaries (totals)
for all Rule Groups, individual Rule Groups and rules.
Filters:
Alerts can be filtered by state with the `state` query string. An example
of an HTTP request asking for just firing alerts might be
`/api/prometheus/grafana/api/v1/rules?state=alerting`.
A request can filter by two or more states by adding additional `state`
query strings to the URL. For example `?state=alerting&state=normal`.
Like the alert list panel, the `firing`, `pending` and `normal` state are
first compared against the state of each alert rule. All other states are
ignored. If the alert rule matches then its alert instances are filtered
against states once more.
Alerts can also be filtered by labels using the `matcher` query string.
Like `state`, multiple matchers can be provided by adding additional
`matcher` query strings to the URL.
The match expression should be parsed using existing regular expression
and sent to the API as URL-encoded JSON in the format:
{
"name": "test",
"value": "value1",
"isRegex": false,
"isEqual": true
}
The `isRegex` and `isEqual` options work as follows:
| IsEqual | IsRegex | Operator |
| ------- | -------- | -------- |
| true | false | = |
| true | true | =~ |
| false | true | !~ |
| false | false | != |
* Alerting: Tiny refactor on the eval and schedule packages
two very small things:
- We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext`
- The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary.
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
---------
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
This commit adds a number of limits to the Grafana flavor of the
Prometheus Rules API:
1. `limit` limits the maximum number of Rule Groups returned
2. `limit_rules` limits the maximum number of rules per Rule Group
3. `limit_alerts` limits the maximum number of alerts per rule
It sorts Rule Groups and rules within Rule Groups such that data in the
response is stable across requests. It also returns summaries (totals) for
all Rule Groups, individual Rule Groups and rules.
* WIP
* skip invalid historic configurations instead of erroring
* add warning log when bad historic config is found
* remove unused custom marshaller for GettableHistoricUserConfig
* add id to historic user config, move limit check to store, fix typo
* swagger spec
* move export rules to definitions package
* move provisioning contact point methods to provisioning package
* move AlertRuleGroupWithFolderTitle to ngalert models and adapter functions to api's compat
* move rule_types files back to where they were before.
* Mark AM configuration as applied
* add missing checks, make linter happy
* fix deadlock, mark as valid on save and on load
* mark configurations only if needed
* check error after applyConfig()
* code review comments
* code review changes
* more code review changes
* clean HistoricConfigFromAlertConfig function
* Define endpoint and generate
* Wire up and register endpoint
* Cleanup, define authorization
* Forgot the leading slash
* Wire up query and SignedInUser
* Wire up timerange query params
* Add todo for label queries
* Drop comment
* Update path to rules subtree
* Add is_paused attr to the POST alert rule group endpoint
* Add is_paused to alerting API POST alert rule group
* Fixed tests
* Add is_paused to alerting gettable endpoints
* Fix integration tests
* Alerting: allow to pause existing rules (#62401)
* Display Pause Rule switch in Editing Rule form
* add isPaused property to form interface and dto
* map isPaused prop with is_paused value from DTO
Also update test snapshots
* Append '(Paused)' text on alert list state column when appropriate
* Change Switch styles according to discussion with UX
Also adding a tooltip with info what this means
* Adjust styles
* Fix alignment and isPaused type definition
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
* Fix test
* Fix test
* Fix RuleList test
---------
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
* wip
* Fix tests and add comments to clarify AlertRuleWithOptionals
* Fix one more test
* Fix tests
* Fix typo in comment
* Fix alert rule(s) cannot be paused via API
* Add integration tests for alerting api pausing flow
* Remove duplicated integration test
---------
Co-authored-by: Virginia Cepeda <virginia.cepeda@grafana.com>
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Use suggested value for uid
* update the snapshot
* use __expr__
* replace all -100 with __expr__
* update snapshot
* more changes
* revert redundant change
* Use expr.DatasourceUID where it's possible
* generate files
* update Delete and Reset methods to return state transitions
this will be used by notifier code to decide whether alert needs to be sent or not.
* update scheduler to provide reason to delete states and use transitions
* update FromAlertsStateToStoppedAlert to accept StateTransition and filter by old state
* fixup
* fix tests
* Add field in alert_rule model, add state to alert_instance model, and state to eval
* Remove paused state from eval package
* Skip paused alert rules in scheduler
* Add migration to add is_paused field to alert_rule table
* Convert to postable alerts only if not normal, pernding, or paused
* Handle paused eval results in state manager
* Add Paused state to eval package
* Add paused alerts logic in scheduler
* Skip alert on scheduler
* Remove paused status from eval package
* Apply suggestions from code review
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Remove state
* Rethink schedule and manager for paused alerts
* Change return to continue
* Remove unused var
* Rethink alert pausing
* Paused alerts storing annotations
* Only add one state transition
* Revert boolean method renaming refactor
* Revert take image refactor
* Make registry errors public
* Revert method extraction for getting a folder title
* Revert variable renaming refactor
* Undo unnecessary changes
* Revert changes in test
* Remove IsPause check in PatchPartiLAlertRule function
* Use SetNormal to set state
* Fix text by returning to old behaviour on alert rule deletion
* Add test in schedule_unit_test.go to test ticks with paused alerts
* Add coment to clarify usage of context.Background()
* Add comment to clarify resetStateByRuleUID method usage
* Move rule get to a more limited scope
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: George Robinson <george.robinson@grafana.com>
* rum gofmt on pkg/services/ngalert/schedule/schedule.go
* Remove defer cancel for context
* Update pkg/services/ngalert/models/instance_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/models/testing.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/schedule/schedule_unit_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/schedule/schedule_unit_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/models/instance_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* skip scheduler rule state clean up on paused alert rule
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Fix mock in test
* Add (hopefully) final suggestions
* Use error channel from recordAnnotationsSync to cancel context
* Run make gen-cue
* Place pause alert check in channel update after version check
* Reduce branching un update channel select
* Add if for error and move code inside if in state manager ResetStateByRuleUID
* Add reason to logs
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Do not delete alert rule routine, just exit on eval if is paused
* Reduce branching and create-close a channel to avoid deadlocks
* Separate state deletion and state reset (includes history saving)
* Add current pause state in rule route in scheduler
* Split clearState and bring errCh closer to RecordStatesAsync call
* Change rule to ruleMeta in RecordStatesAsync
* copy state to be able to modify it
* Add timeout to context creation
* Shorten the timeout
* Use resetState is rule is paused and deleteState if rule is not paused
* Remove Empty state reason
* Save every rule change in historian
* Add tests for DeleteStateByRuleUID and ResetStateByRuleUID
* Remove useless line
* Remove outdated comment
Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>