This commit better defines how we set states in resultNormal,
resultAlerting, resultError and resultNoData. It changes the existing
code to call methods such as SetAlerting, SetPending, SetNormal,
SetError and NoData instead of assigning values to each individual field
whenever the state is changed. This should make it easier to understand
what fields should be set for which states and avoid cases where states are
missing, or have additional unexpected fields.
* Refactor state and manager to not depend directly on image interface
* Move generic errors to models package
* Move NotAvailableImageService to state as its only references are in state tests
* Move NoopImageService to state package
* Move mock to state package
* Fix linter error
* Fix comment styling
* Fix a couple added references introduced by rebase
* Empty commit to kick build
* extract method processTick
* make processTick return scheduled rules
* move state manager tests to state manager
* update test
* move all tests into one file
* remove unused fields
* delete all stale states in one lock
* do not use touched states to detect stale rely only on LastEvaluationTime maintained correctly
* fix tests to use correct eval time
* delete unused method
* Reduce piecemeal state fields
* Read data directly off state instead of rule
* Unify state and context into single struct
* Expose contextual information to layer above setNextState
* Work in terms of ContextualState and call historian in batches
* Call annotations service in batches
* Export format state and reason and remove workaround in unrelated test package
* Add new method to annotation service for batch inserting
* Fix loop variable aliasing bug caught by linter, didn't change behavior
* Incl timerange on annotation tests
* Insert one at a time if tags are present
* Point to rule from ContextualState rather than copy fields
* Build annotations and copy data prior to starting goroutine
* Rename to StateTransition
* Use new bulk-insert utility
* Remove rule from StateTransition and pass in directly to historian
* Simplify annotations logic since we have only one rule
* Fix logs and context, nilcheck, simplify method name
* Regenerate mock
* create contextual log context provider
* use contextual provider in scheduler
* init logger in the package
* use context for log context
* use context in state manager
* Touch up log statements, fix casing, add and normalize contexts
* Dedicated logger for dashboard resolver
* Avoid injecting logger to historian
* More minor log touch-ups
* Dedicated logger for state manager
* Use rule context in annotation creator
* Rename base logger and avoid redundant contextual loggers
* Create caching dashboard resolver
* A couple tests for dashboard resolving
* Log warning on not found
* Additional polish + review nits
* Move to singleflight instead of a plain mutex
* Store errors instead of -1 in cache and use reflection when reading
* Address linter error
* One more linter error
We have received a lot of feedback regarding the ValueString in alert notifications. Perhaps one of the most frequent complaints about ValueString is that it is difficult to read because it contains a lot of information, and the information is shown as a JSON-like string. Users have often asked how it can be templated and the answer is that it can't.
Until now users have been able to add custom annotations to their alert rules which contains values via the $values variable added in previous versions of Grafana. However, these custom annotations must be added for each of the user's alert rule, instead of once in a template that all of their alerts can be notified via.
This commit adds then the much requested feature to support values in notification templates. Users can then create a single template that prints the annotations, labels and values of their alerts in a format of their choice!
* remove ResetAllStates because it's not used
* refactor cache to accept logs, metrics and url as method args
* update manager Warm method to set the entire state at once
* remove unused reset method
* introduce ruleStates
* change getOrCreate to belong to ruleStates
* update Get to not return error
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.
These new transactions are off by default, guarded by the feature toggle "alertingBigTransactions"
Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s
```
After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s
```
So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
* Move annotation functionality behind a history persistence interface
* Rename to RecordState
* Fix lint error in import aliasing
* One more import linter error
* Refactor state manager to not depend on rule store interface
* Refactor grafana and proxied ruler APIs to not depend on store.RuleStore
* Refactor folder subscription logic to not use store.RuleStore
* Delete dead code
* Delete store.RuleStore
* Add consumer-side store interface to state manager
* Remove dead dependency
* Delete dead dependency in API struct
* Delete store-layer InstanceStore interface
* Move fake for state's InstanceStore interface to state package
This commit fixes a bug where we did not send resolved alerts to Alertmanager for resolved alert instances. This meant that resolved notifications did not have the annotations from the resolved state, and a result did not also have the resolved screenshot.
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.
Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s
```
After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s
```
So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
This change also updates some of our tests so that they run successfully against postgreSQL - we were using random Int64s, but postgres integers, which our tables use, max out at 2^31-1
* move saving the state to state manager when scheduler stops
* move saving state to ProcessEvalResults
* add GetRuleKey to State
* add LogContext to AlertRuleKey
* add tests for cache getOrCreate
* update ProcessEvalResults to accept extra lables
* extract to getRuleExtraLabels
* move populating of constant rule labels to extra labels
This commit fixes a bug where the state did not change from Alerting to Error if the evaluation result returned an error, or from Error to Alerting if evaluations stopped returning errors.
* Introduce AlertsRouter in the sender package, and move all fields and methods related to notifications out of the scheduler to this router.
* Introduce a new interface AlertsSender in the schedule package and replace calls of anonymous function `notify` inside the ruleRoutine to calling methods of that interface.
* Rename interface Scheduler in api package to ExternalAlertmanagerProvider, and replace scheduler with AlertRouter as struct that implements the interface.
* Alerting: decapitalize log lines and use "err" as the key for errors
Found using (logger|log).(Warn|Debug|Info|Error)\([A-Z] and (logger|log).(Warn|Debug|Info|Error)\(.+"error"