* add metrics and tracing to state manager
* propagate tracer to state manager
* add scheduler metrics
* fix backtesting
* add test for state metrics
* remove StateUpdateCount
* update docs
* metrics can be null
* add tracer to new tests
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
* Alerting: Repurpose rule testing endpoint to return potential alerts
This feature replaces the existing no-longer in-use grafana ruler testing API endpoint /api/v1/rule/test/grafana. The new endpoint returns a list of potential alerts created by the given alert rule, including built-in + interpolated labels and annotations.
The key priority of this endpoint is that it is intended to be as true as possible to what would be generated by the ruler except that the resulting alerts are not filtered to only Resolved / Firing and ready to be sent.
This means that the endpoint will, among other things:
- Attach static annotations and labels from the rule configuration to the alert instances.
- Attach dynamic annotations from the datasource to the alert instances.
- Attach built-in labels and annotations created by the Grafana Ruler (such as alertname and grafana_folder) to the alert instances.
- Interpolate templated annotations / labels and accept allowed template functions.
* use tokens or urls in image annotations
* improve tests, fix some comments
* fix empty tokens
* code review changes, check for url before checking for token (support old token formats)
Takes a specific code path for data that identifies itself as dataplane instead of "guessing" what the data is.
The data must identify itself by being in the dataplane by having both the following frame metadata properties:
- TypeVersion property that is greater than 0.0
- 'Type' property
The flag is disableSSEDataplane and disables this functionality and uses the old code for all queries regardless.
See https://github.com/grafana/grafana-plugin-sdk-go/blob/main/data/contract_docs/contract.md for dataplane details.
* Alerting: Tiny refactor on the eval and schedule packages
two very small things:
- We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext`
- The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary.
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
---------
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
* stop using the scheduler's Update and Delete methods all communication must be via the database
* update scheduler's registry to calculate diff before re-setting the cache
* update fetcher to return the diff generated by registry
* update processTick to update rule eval routine if the rule was updated and it is not going to be evaluated at this tick.
* remove references to the scheduler from api package
* remove unused methods in the scheduler
The `rule_groups_rules` metric is currently defined and computed by `State`.
It makes more sense for this metric to be computed off of the configured rule
set, not based on the rule evaluation state. There could be an edge condition
where a rule does not have a state yet, and so is uncounted.
Additionally, we would like this metric (and others), to have a `rule_group`
label, and this is much easier to achieve if the metric is produced from the
`Scheduler` package.
* Use suggested value for uid
* update the snapshot
* use __expr__
* replace all -100 with __expr__
* update snapshot
* more changes
* revert redundant change
* Use expr.DatasourceUID where it's possible
* generate files
* Allow pausing alerts from provisioning
* Update swagger
* Add IsPaused to provision export endpoints
* Add pause field in sample.yml
* Add exception for reset state in first loop iteration of scheduler if rule is paused
* Update provision definition and swagger docs
* Fix provisioning export tests
* Suggestion: Simplify if condition
* Add more context to a comment
* update Delete and Reset methods to return state transitions
this will be used by notifier code to decide whether alert needs to be sent or not.
* update scheduler to provide reason to delete states and use transitions
* update FromAlertsStateToStoppedAlert to accept StateTransition and filter by old state
* fixup
* fix tests
* Add field in alert_rule model, add state to alert_instance model, and state to eval
* Remove paused state from eval package
* Skip paused alert rules in scheduler
* Add migration to add is_paused field to alert_rule table
* Convert to postable alerts only if not normal, pernding, or paused
* Handle paused eval results in state manager
* Add Paused state to eval package
* Add paused alerts logic in scheduler
* Skip alert on scheduler
* Remove paused status from eval package
* Apply suggestions from code review
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Remove state
* Rethink schedule and manager for paused alerts
* Change return to continue
* Remove unused var
* Rethink alert pausing
* Paused alerts storing annotations
* Only add one state transition
* Revert boolean method renaming refactor
* Revert take image refactor
* Make registry errors public
* Revert method extraction for getting a folder title
* Revert variable renaming refactor
* Undo unnecessary changes
* Revert changes in test
* Remove IsPause check in PatchPartiLAlertRule function
* Use SetNormal to set state
* Fix text by returning to old behaviour on alert rule deletion
* Add test in schedule_unit_test.go to test ticks with paused alerts
* Add coment to clarify usage of context.Background()
* Add comment to clarify resetStateByRuleUID method usage
* Move rule get to a more limited scope
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: George Robinson <george.robinson@grafana.com>
* rum gofmt on pkg/services/ngalert/schedule/schedule.go
* Remove defer cancel for context
* Update pkg/services/ngalert/models/instance_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/models/testing.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/schedule/schedule_unit_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/schedule/schedule_unit_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/models/instance_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* skip scheduler rule state clean up on paused alert rule
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Fix mock in test
* Add (hopefully) final suggestions
* Use error channel from recordAnnotationsSync to cancel context
* Run make gen-cue
* Place pause alert check in channel update after version check
* Reduce branching un update channel select
* Add if for error and move code inside if in state manager ResetStateByRuleUID
* Add reason to logs
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Do not delete alert rule routine, just exit on eval if is paused
* Reduce branching and create-close a channel to avoid deadlocks
* Separate state deletion and state reset (includes history saving)
* Add current pause state in rule route in scheduler
* Split clearState and bring errCh closer to RecordStatesAsync call
* Change rule to ruleMeta in RecordStatesAsync
* copy state to be able to modify it
* Add timeout to context creation
* Shorten the timeout
* Use resetState is rule is paused and deleteState if rule is not paused
* Remove Empty state reason
* Save every rule change in historian
* Add tests for DeleteStateByRuleUID and ResetStateByRuleUID
* Remove useless line
* Remove outdated comment
Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
* Refactor state and manager to not depend directly on image interface
* Move generic errors to models package
* Move NotAvailableImageService to state as its only references are in state tests
* Move NoopImageService to state package
* Move mock to state package
* Fix linter error
* Fix comment styling
* Fix a couple added references introduced by rebase
* Empty commit to kick build
* extract method processTick
* make processTick return scheduled rules
* move state manager tests to state manager
* update test
* move all tests into one file
* remove unused fields
* delete all stale states in one lock
* do not use touched states to detect stale rely only on LastEvaluationTime maintained correctly
* fix tests to use correct eval time
* delete unused method
* add: IsServiceAccount to SignedInUser and IsRealUser
* fix: linting error
* refactor: add function IsServiceAccountUser()
By adding the function IsServiceAccountUser() we use it to identify for
ServiceAccounts in the HasUniqueID() since caching is built up on having
a uniqueID, see comment: https://github.com/grafana/grafana/pull/58015#discussion_r1011361880
* create contextual log context provider
* use contextual provider in scheduler
* init logger in the package
* use context for log context
* use context in state manager
* Touch up log statements, fix casing, add and normalize contexts
* Dedicated logger for dashboard resolver
* Avoid injecting logger to historian
* More minor log touch-ups
* Dedicated logger for state manager
* Use rule context in annotation creator
* Rename base logger and avoid redundant contextual loggers