Commit Graph

103 Commits

Author SHA1 Message Date
Alexander Weaver
99fa064576
Alerting: Emit warning when creating or updating unusually large groups (#82279)
* Add config for limit of rules per rule group

* Warn when editing big groups through normal API

* Warn on prov api writes for groups

* Wire up comp root, tests

* Also add warning to state manager warm

* Drop unnecessary conversion
2024-02-13 08:29:03 -06:00
Jean-Philippe Quéméner
aa25776f81
Alerting: Add a feature flag to periodically save states (#80987) 2024-01-23 17:03:30 +01:00
Jean-Philippe Quéméner
82638d059f
feat(alerting): add state persister interface (#80384) 2024-01-17 13:33:13 +01:00
Matthew Jacobson
1d4419fbe4
Alerting: Fix NoData & Error alerts not resolving when rule is reset (#80184)
* Alerting: Fix NoData & Error alerts not resolving when rule is reset

On rule reset, when creating the PostableAlerts StateToPostableAlert did not
attach the correct NoData/Error alertname and rulename labels to expire/resolve
the active alerts when the previous cached state was NoData/Error.
2024-01-09 14:47:19 -05:00
Yuri Tseretyan
f6a46744a6
Alerting: Support hysteresis command expression (#75189)
Backend: 

* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
   - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
  -  add rule_fingerprint column to alert_instance
   - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
   - add AlertingResultsFromRuleState that implements the new interface in eval package
   - update getExprRequest to patch the hysteresis command.

* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.


Frontend:

* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions

* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions

---------

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
2024-01-04 11:47:13 -05:00
Marcus Efraimsson
e4c1a7a141
Tracing: Standardize on otel tracing (#75528) 2023-10-03 14:54:20 +02:00
gotjosh
e877174501
Alerting: Expose metrics for Alertmanager Alerts - grafana_alerting_alertmanager_alerts (#75802)
* Alerting: Expose metrics for Alertmanager Alerts

In Grafana, the alert evaluation and alert delivery are combined. We're always used a metric named `grafana_alerting_alerts` to get a sense of what are the alerts that are currently firing (these come from the evaluation side) and opted to not map the alertmanager alerts metric directly.

I think it's important that we make a disction between alerts that happen at evaluation vs alerts that are received for delivery by the internal Alertmanager as we have options to skip the delivery of these alerts to the internal alertmanager altogether.
2023-10-02 16:36:23 +01:00
gotjosh
59694fb2be
Alerting: Don't use a separate collection system for metrics (#75296)
* Alerting: Don't use a separate collection system for metrics

The state package had a metric collection system that ran every 15s updating the values of the metrics - there is a common pattern for this in the Prometheus ecosystem called "collectors".

I have removed the behaviour of using a time-based interval to "set" the metrics in favour of a set of functions as the "value" that get called at scrape time.
2023-09-25 10:27:30 +01:00
Steve Simpson
894f420014
Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158) 2023-09-20 15:07:02 +02:00
Yuri Tseretyan
938e26b59f
Alerting: Add new metrics and tracings to state manager and scheduler (#71398)
* add metrics and tracing to state manager

* propagate tracer to state manager

* add scheduler metrics

* fix backtesting

* add test for state metrics

* remove StateUpdateCount

* update docs

* metrics can be null

* add tracer to new tests
2023-08-16 09:04:18 +02:00
Yuri Tseretyan
0717ec11d6
Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal (#68142) 2023-08-15 10:27:15 -04:00
Yuri Tseretyan
78fc3bcdf4
Alerting: Fix state manager to not keep datasource_uid and ref_id labels in state after Error (#72216) 2023-07-26 11:41:46 -04:00
George Robinson
594c851d4b
Alerting: Add duration to saving alert states done (#70844) 2023-06-28 15:19:21 +01:00
George Robinson
7edbe72483
Alerting: Support concurrent queries for saving alert instances (#70525)
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
2023-06-23 11:36:07 +01:00
George Robinson
8a13ee3cd4
Alerting: Add debug logs when saving instances is finished (#70447) 2023-06-21 14:19:04 +02:00
Santiago
b0881daf23
Alerting: Use URLs in image annotations (#66804)
* use tokens or urls in image annotations

* improve tests, fix some comments

* fix empty tokens

* code review changes, check for url before checking for token (support old token formats)
2023-04-26 13:06:18 -03:00
Matthew Jacobson
63187fae0c
Alerting: Remove and revert flag alertingBigTransactions (#65976)
* Alerting: Remove and revert flag alertingBigTransactions

This is a partial revert of #56575 and a removal of the `alertingBigTransactions` flag.

Real-word use has seen no clear performance incentive to maintain this flag. Lowered db connection count
came at the cost of significant increase in CPU usage and query latency.

* Fix lint backend

* Removed last bits of alertingBigTransactions

---------

Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
2023-04-06 18:06:25 +02:00
Yuri Tseretyan
622c23716a
Alerting: Use logger with context in the state cache (#65663) 2023-03-31 10:11:30 -04:00
Serge Zaitsev
0beb768427
Chore: Remove result fields from ngalert (#65410)
* remove result fields from ngalert

* remove duplicate imports
2023-03-28 10:34:35 +02:00
Alexander Weaver
a31672fa40
Alerting: Create new state history "fanout" backend that dispatches to multiple other backends at once (#64774)
* Rename RecordStatesAsync to Record

* Rename QueryStates to Query

* Implement fanout writes

* Implement primary queries

* Simplify error joining

* Add test for query path

* Add tests for writes and error propagation

* Allow fanout backend to be configured

* Touch up log messages and config validation

* Consistent documentation for all backend structs

* Parse and normalize backend names more consistently against an enum

* Touch-ups to documentation

* Improve clarity around multi-record blocking

* Keep primary and secondaries more distinct

* Rename fanout backend to multiple backend

* Simplify config keys for multi backend mode
2023-03-17 12:41:18 -05:00
Yuri Tseretyan
05bf241952
Alerting: Update state manager to return StateTransitions when Delete or Reset (#62264)
* update Delete and Reset methods to return state transitions

this will be used by notifier code to decide whether alert needs to be sent or not.

* update scheduler to provide reason to delete states and use transitions

* update FromAlertsStateToStoppedAlert to accept StateTransition and filter by old state

* fixup

* fix tests
2023-01-27 09:46:21 +01:00
Alex Moreno
531b439cf1
Alerting: Add alert pausing feature (#60734)
* Add field in alert_rule model, add state to alert_instance model, and state to eval

* Remove paused state from eval package

* Skip paused alert rules in scheduler

* Add migration to add is_paused field to alert_rule table

* Convert to postable alerts only if not normal, pernding, or paused

* Handle paused eval results in state manager

* Add Paused state to eval package

* Add paused alerts logic in scheduler

* Skip alert on scheduler

* Remove paused status from eval package

* Apply suggestions from code review

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Remove state

* Rethink schedule and manager for paused alerts

* Change return to continue

* Remove unused var

* Rethink alert pausing

* Paused alerts storing annotations

* Only add one state transition

* Revert boolean method renaming refactor

* Revert take image refactor

* Make registry errors public

* Revert method extraction for getting a folder title

* Revert variable renaming refactor

* Undo unnecessary changes

* Revert changes in test

* Remove IsPause check in PatchPartiLAlertRule function

* Use SetNormal to set state

* Fix text by returning to old behaviour on alert rule deletion

* Add test in schedule_unit_test.go to test ticks with paused alerts

* Add coment to clarify usage of context.Background()

* Add comment to clarify resetStateByRuleUID method usage

* Move rule get to a more limited scope

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* rum gofmt on pkg/services/ngalert/schedule/schedule.go

* Remove defer cancel for context

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/testing.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* skip scheduler rule state clean up on paused alert rule

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Fix mock in test

* Add (hopefully) final suggestions

* Use error channel from recordAnnotationsSync to cancel context

* Run make gen-cue

* Place pause alert check in channel update after version check

* Reduce branching un update channel select

* Add if for error and move code inside if in state manager ResetStateByRuleUID

* Add reason to logs

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Do not delete alert rule routine, just exit on eval if is paused

* Reduce branching and create-close a channel to avoid deadlocks

* Separate state deletion and state reset (includes history saving)

* Add current pause state in rule route in scheduler

* Split clearState and bring errCh closer to RecordStatesAsync call

* Change rule to ruleMeta in RecordStatesAsync

* copy state to be able to modify it

* Add timeout to context creation

* Shorten the timeout

* Use resetState is rule is paused and deleteState if rule is not paused

* Remove Empty state reason

* Save every rule change in historian

* Add tests for DeleteStateByRuleUID and ResetStateByRuleUID

* Remove useless line

* Remove outdated comment

Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
2023-01-26 18:29:10 +01:00
Alexander Weaver
046a9bb7c1
Alerting: Copy rule definitions into state history (#62032)
* Copy rules instead of accepting pointer

* Deep-copy the rule, for even more guarantees

* Create struct just for needed fields

* Move RuleMeta to historian/model package, iron out package dependencies

* Move tests for dash ID parsing to model package along with code
2023-01-25 11:29:57 -06:00
Denis Limarev
e6dee8a723
Perfomance: Preallocate slices (#61580) 2023-01-17 11:50:17 +00:00
Yuri Tseretyan
9d57b1c72e
Alerting: Do not persist noop transition from Normal state. (#61201)
* add feature flag `alertingNoNormalState`
* update instance database to support exclusion of state in list operation
* do not save normal state and delete transitions to normal
* update get methods to filter out normal state
2023-01-13 18:29:29 -05:00
Yuri Tseretyan
86b5fbbf60
Alerting: Introduce state manager config structure (#61249) 2023-01-10 16:26:15 -05:00
Yuri Tseretyan
4d989860fb
Alerting: Fix conversion of alert state from db state during manager warmup (#60933) 2023-01-04 09:40:04 -05:00
George Robinson
6359dab040
Alerting: Change resultError in preparation for supporting ForError duration (#59894) 2022-12-07 10:45:56 +00:00
Yuri Tseretyan
abb49d96b5
Alerting: update state manager to return StateTransition instead of State (#58867)
* improve test for stale states
* update state manager return StateTransition
* update scheduler to accept state transitions
2022-12-06 13:07:39 -05:00
Yuri Tseretyan
a85adeed96
Alerting: Update state history service to filter states transitions (#58863)
* rename the method to better reflect its behavior
* make historian filter transition on itself
* call historian with all changes
2022-12-06 12:33:15 -05:00
Yuri Tseretyan
28d39d35fd
Alerting: Update state manager to save state transitions in one batch (#58358)
* change stale results handler to not update database but return transitions
* save all transitions in one call
2022-11-14 10:57:51 -05:00
George Robinson
c5ae1bcfe0
Alerting: Fix logging pointer address of DashboardUID and PanelID variables (#58539) 2022-11-10 09:58:38 +00:00
Alexander Weaver
2bfdda5b68
Alerting: Break dependency between state and image packages (#58381)
* Refactor state and manager to not depend directly on image interface

* Move generic errors to models package

* Move NotAvailableImageService to state as its only references are in state tests

* Move NoopImageService to state package

* Move mock to state package

* Fix linter error

* Fix comment styling

* Fix a couple added references introduced by rebase

* Empty commit to kick build
2022-11-09 15:06:49 -06:00
George Robinson
1290951b65
Alerting: Small improvements to staleResultsHandler (#58007) 2022-11-09 11:08:32 +00:00
Yuri Tseretyan
3621cf5a12
Alerting: Update handling of stale state (#58276)
* delete all stale states in one lock
* do not use touched states to detect stale rely only on LastEvaluationTime maintained correctly
* fix tests to use correct eval time
* delete unused method
2022-11-07 11:03:53 -05:00
Yuri Tseretyan
623de12e35
Alerting: Create AlertInstanceKey in one place (#58278)
* use method GetAlertInstanceKey
* do not add key if error
2022-11-07 09:35:29 -05:00
Yuri Tseretyan
f9c88e72ae
Alerting: Update saveAlertStates in state manager to not return results (#58279) 2022-11-07 09:09:19 -05:00
Yuri Tseretyan
978f1119d7
Alerting: Run state manager as regular sub-service (#58246) 2022-11-04 17:06:47 -04:00
Yuri Tseretyan
dce8879145
Alerting: Update state manager to accept rule store as Warm method argument (#58244) 2022-11-04 14:23:08 -04:00
Alexander Weaver
cc8c1380e2
Alerting: Persist annotations from multidimensional rules in batches (#56575)
* Reduce piecemeal state fields

* Read data directly off state instead of rule

* Unify state and context into single struct

* Expose contextual information to layer above setNextState

* Work in terms of ContextualState and call historian in batches

* Call annotations service in batches

* Export format state and reason and remove workaround in unrelated test package

* Add new method to annotation service for batch inserting

* Fix loop variable aliasing bug caught by linter, didn't change behavior

* Incl timerange on annotation tests

* Insert one at a time if tags are present

* Point to rule from ContextualState rather than copy fields

* Build annotations and copy data prior to starting goroutine

* Rename to StateTransition

* Use new bulk-insert utility

* Remove rule from StateTransition and pass in directly to historian

* Simplify annotations logic since we have only one rule

* Fix logs and context, nilcheck, simplify method name

* Regenerate mock
2022-11-04 10:39:26 -05:00
Alex Moreno
ba15d675e7
Alerting: Add values to annotations (#57738)
* Add values to annotations

* Fix imports

* Use State attrs instead of Result attrs

* Remove unnecessary variable
2022-11-03 10:35:34 +01:00
George Robinson
215ffee437
Alerting: Fix screenshot is not taken for stale series (#57982) 2022-11-02 22:14:22 +00:00
Yuriy Tseretyan
3294918e9f
Alerting: Update state manager to support nil stores and metrics (#57791) 2022-10-28 13:10:28 -04:00
Yuriy Tseretyan
0a4121cef8
Alerting: Contextual log provider for rule key (#57476)
* create contextual log context provider
* use contextual provider in scheduler
* init logger in the package
* use context for log context
* use context in state manager
2022-10-26 19:16:02 -04:00
Alexander Weaver
de46c1b002
Alerting: Improve logs in state manager and historian (#57374)
* Touch up log statements, fix casing, add and normalize contexts

* Dedicated logger for dashboard resolver

* Avoid injecting logger to historian

* More minor log touch-ups

* Dedicated logger for state manager

* Use rule context in annotation creator

* Rename base logger and avoid redundant contextual loggers
2022-10-21 16:16:51 -05:00
Alexander Weaver
3ddb28bad9
Find-and-replace 'err' logs to 'error' to match log search conventions (#57309) 2022-10-19 17:36:54 -04:00
George Robinson
52965de369
Alerting: Add doc comments to state struct and normalize fields (#56647) 2022-10-11 09:30:33 +01:00
Yuriy Tseretyan
7b6437402a
Alerting: Refactor state manager's cache (#56197)
* remove ResetAllStates because it's not used
* refactor cache to accept logs, metrics and url as method args
* update manager Warm method to set the entire state at once
* remove unused reset method
* introduce ruleStates
* change getOrCreate to belong to ruleStates
* update Get to not return error
2022-10-06 15:30:12 -04:00
Joe Blubaugh
b476ae62fb
Alerting: Write and Delete multiple alert instances. (#55350)
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.

These new transactions are off by default, guarded by the feature toggle "alertingBigTransactions"

Before:

```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8           398           2991381 ns/op         1133537 B/op      27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
    util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok      github.com/grafana/grafana/pkg/services/ngalert/store   1.619s
```

After:

```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8          1440            816484 ns/op          352297 B/op       6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
    util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok      github.com/grafana/grafana/pkg/services/ngalert/store   1.383s
```

So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
2022-10-06 14:22:58 +08:00
Alexander Weaver
8df830557a
Alerting: Move annotation functionality behind a history persistence interface (#56133)
* Move annotation functionality behind a history persistence interface

* Rename to RecordState

* Fix lint error in import aliasing

* One more import linter error
2022-10-05 15:32:20 -05:00