Commit Graph

100 Commits

Author SHA1 Message Date
Alexander Weaver
d17ab82b98
Alerting: Break up store.RuleStore interface, delete dead code (#55776)
* Refactor state manager to not depend on rule store interface

* Refactor grafana and proxied ruler APIs to not depend on store.RuleStore

* Refactor folder subscription logic to not use store.RuleStore

* Delete dead code

* Delete store.RuleStore
2022-09-27 08:56:30 -05:00
Alexander Weaver
a00879ae21
Alerting: Refactor store to not export its own interface for InstanceStore, delete dead dependency injection (#55772)
* Add consumer-side store interface to state manager

* Remove dead dependency

* Delete dead dependency in API struct

* Delete store-layer InstanceStore interface

* Move fake for state's InstanceStore interface to state package
2022-09-26 13:55:05 -05:00
Yuriy Tseretyan
879241a48f
Alerting: Fix state manager tests (#55593) 2022-09-21 13:57:18 -05:00
Yuriy Tseretyan
199996cbf9
Alerting: Resolve stale state + add state reason to notifications (#49352)
* adds a new reserved annotation `grafana_state_reason`
* explicitly resolve stale states
2022-09-21 13:24:47 -04:00
Yuriy Tseretyan
0629d3922a
stop flushing state when Grafana stops (#55504) 2022-09-21 10:10:17 -04:00
Sofia Papagiannaki
754eea20b3
Chore: SQL store split for annotations (#55089)
* Chore: SQL store split for annotations

* Apply suggestion from code review
2022-09-19 10:54:37 +03:00
George Robinson
5561f935e6
Alerting: Fix send resolved notifications (#54793)
This commit fixes a bug where we did not send resolved alerts to Alertmanager for resolved alert instances. This meant that resolved notifications did not have the annotations from the resolved state, and a result did not also have the resolved screenshot.
2022-09-15 17:25:05 +01:00
Joe Blubaugh
22c937340e
Revert "Alerting: Write and Delete multiple alert instances. (#54072)" (#54885)
This reverts commit 5e4fd94413.
2022-09-09 17:44:06 +02:00
Joe Blubaugh
5e4fd94413
Alerting: Write and Delete multiple alert instances. (#54072)
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.

Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8           398           2991381 ns/op         1133537 B/op      27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
    util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok      github.com/grafana/grafana/pkg/services/ngalert/store   1.619s
```

After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8          1440            816484 ns/op          352297 B/op       6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
    util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok      github.com/grafana/grafana/pkg/services/ngalert/store   1.383s
```

So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.

This change also updates some of our tests so that they run successfully against postgreSQL - we were using random Int64s, but postgres integers, which our tables use, max out at 2^31-1
2022-09-02 11:17:20 +08:00
Yuriy Tseretyan
03e746d9df
Alerting: Delete state from the database on reset (#53919)
* make ResetStatesByRuleUID return states
* delete rule states when reset
* rule eval routine to clean up the state only when rule is deleted
2022-08-25 14:12:22 -04:00
Yuriy Tseretyan
9f90a7b54d
Alerting: State manager to use InstanceStore (#53852)
* move saving the state to state manager when scheduler stops
* move saving state to ProcessEvalResults

* add GetRuleKey to State
* add LogContext to AlertRuleKey
2022-08-18 09:40:33 -04:00
Yuriy Tseretyan
e5e8747ee9
Alerting: Update state manager to accept reserved labels (#52189)
* add tests for cache getOrCreate
* update ProcessEvalResults to accept extra lables
* extract to getRuleExtraLabels
* move populating of constant rule labels to extra labels
2022-07-14 15:59:59 -04:00
George Robinson
34d45977ca
Alerting: Fix bug where state did not change between Alerting and Error (#52204)
This commit fixes a bug where the state did not change from Alerting to Error if the evaluation result returned an error, or from Error to Alerting if evaluations stopped returning errors.
2022-07-14 10:53:39 +01:00
Yuriy Tseretyan
a6b1090879
Alerting: refactor scheduler and separate notification logic (#48144)
* Introduce AlertsRouter in the sender package, and move all fields and methods related to notifications out of the scheduler to this router.
* Introduce a new interface AlertsSender in the schedule package and replace calls of anonymous function `notify` inside the ruleRoutine to calling methods of that interface.
* Rename interface Scheduler in api package to ExternalAlertmanagerProvider, and replace scheduler with AlertRouter as struct that implements the interface.
2022-07-12 15:13:04 -04:00
Yuriy Tseretyan
4b42cd3c1d
Alerting: State manager to use clock (#51219)
* manager to use clock, to be able to mock real time
2022-06-22 12:18:42 -04:00
Yuriy Tseretyan
157c12211d
Alerting: State manager to use tick time to determine stale states (#50991)
* use correct stale timestamp
* calculate stale using tick time instead of time.now

* remove unused dependency on sql store
2022-06-22 00:16:53 +02:00
gotjosh
0cde283505
Alerting: Logs should not be capitalized and the errors key should be "err" (#50333)
* Alerting: decapitalize log lines and use "err" as the key for errors

Found using (logger|log).(Warn|Debug|Info|Error)\([A-Z] and (logger|log).(Warn|Debug|Info|Error)\(.+"error"
2022-06-07 19:54:23 +02:00
Joe Blubaugh
56f40bd413
Alerting: Add Go error message to warning log for screenshots. (#49870)
Makes debugging problems with alert screenshotting easier.
2022-05-31 20:56:22 +08:00
Joe Blubaugh
9e8efaa459
Alerting: Add stored screenshot utilities to the channels package. (#49470)
Adds three functions:
`withStoredImages` iterates over a list of models.Alerts, extracting a stored image's data from storage, if available, and executing a user-provided function.
`withStoredImage` does this for an image attached to a specific alert.
`openImage` finds and opens an image file on disk.

Moves `store.Image` to `models.Image`
Simplifies `channels.ImageStore` interface and updates notifiers that use it to use the simpler methods.
Updates all pkg/alert/notifier/channels to use withStoredImage routines.
2022-05-26 13:29:56 +08:00
Joe Blubaugh
1cc034d960
Alerting: Add a "Reason" to Alert Instances to show underlying cause of state. (#49259)
This change adds a field to state.State and models.AlertInstance
that indicate the "Reason" that an instance has its current state. This
helps us account for cases where the state is "Normal" but the
underlying evaluation returned "NoData" or "Error", for example.

Fixes #42606

Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
2022-05-23 16:49:49 +08:00
Joe Blubaugh
1d724810de
Alerting: State Manager takes screenshots. (#49338)
The State Manager will now take screenshots when an alert instance
switches to an Alerting or Resolved state.

Signed-off-by: Joe Blubaugh joe.blubaugh@grafana.com
2022-05-23 10:53:41 +08:00
Joe Blubaugh
687e79538b
Alerting: Add a general screenshot service and alerting-specific image service. (#49293)
This commit adds a pkg/services/screenshot package for taking and uploading screenshots of Grafana dashboards. It supports taking screenshots of both dashboards and individual panels within a dashboard, using the rendering service.

The screenshot package has the following services, most of which can be composed:

BrowserScreenshotService (Takes screenshots with headless Chrome)
CachableScreenshotService (Caches screenshots taken with another service such as BrowserScreenshotService)
NoopScreenshotService (A no-op screenshot service for tests)
SingleFlightScreenshotService (Prevents duplicate screenshots when taking screenshots of the same dashboard or panel in parallel)
ScreenshotUnavailableService (A screenshot service that returns ErrScreenshotsUnavailable)
UploadingScreenshotService (A screenshot service that uploads taken screenshots)

The screenshot package does not support wire dependency injection yet. ngalert constructs its own version of the service. See https://github.com/grafana/grafana/issues/49296

This PR also adds an ImageScreenshotService to ngAlert. This is used to take screenshots with a screenshotservice and then store their location reference for use by alert instances and notifiers.
2022-05-22 22:33:49 +08:00
George Robinson
43358c7248
Alerting: Keep private annotations across evaluations (#49080) 2022-05-18 11:21:18 +02:00
Kristin Laemmert
1df340ff28
backend/services: Move GetDashboard from sqlstore to dashboard service (#48971)
* rename folder to match package name
* backend/sqlstore: move GetDashboard into DashboardService

This is a stepping-stone commit which copies the GetDashboard function - which lets us remove the sqlstore from the interfaces in dashboards - without changing any other callers.
* checkpoint: moving GetDashboard calls into dashboard service
* finish refactoring api tests for dashboardService.GetDashboard
2022-05-17 14:52:22 -04:00
Yuriy Tseretyan
4b417c8f3e
use NaN if condition value is nil (#48370) 2022-04-27 15:59:13 -03:00
George Robinson
c5547123bc
Remove redundant queries in GetAlertRules and GetOrgAlertRules and replace with ListAlertRules (#48108) 2022-04-25 11:42:42 +01:00
Yuriy Tseretyan
884c885289
Alerting: Support OK option for Error state (#47670)
* support OK state for Error
2022-04-13 14:45:29 -04:00
gotjosh
cb6124c921
Alerting: Accurately set value for prom-compatible APIs (#47216)
* Alerting: Accurately set value for prom-compatible APIs

Sets the value fields for the prometheus compatible API based on a combination of condition `refID` and the values extracted from the different frames.

* Fix an extra test

* Ensure a consitent ordering

* Address review comments

* address review comments
2022-04-05 19:36:42 +01:00
George Robinson
79769132c0
Alerting: Alert rule should wait For duration when execution error state is Alerting (#47052)
Alerting: Alert rule should wait For duration when execution error state is Alerting
2022-03-31 09:57:58 +01:00
gotjosh
84e5f336fe
Alerting: Classic conditions can now display multiple values (#46971)
* Alerting: Extract classic condition values by RefID

* uncapitalise function

* update documentation

* Update pkg/services/ngalert/eval/extract_md.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Update pkg/services/ngalert/state/state.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Update pkg/services/ngalert/state/state.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Update pkg/services/ngalert/eval/extract_md.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Update docs/sources/alerting/unified-alerting/alerting-rules/alert-annotation-label.md

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>

* Update pkg/services/ngalert/eval/extract_md.go

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>

* Run prettier

Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>
2022-03-29 20:33:03 +01:00
gotjosh
a338c78ca8
Alerting: Remove internal labels from prometheus compatible API responses (#46548)
* Alerting: Remove internal labels from prometheus compatible API responses

* Appease the linter

* Fix integration tests

* Fix API documentation & linter

* move removal of internal labels to the models
2022-03-16 16:04:19 +00:00
gotjosh
8d4a0a0396
Alerting: Include annotations in prometheus Alert response. (#45970)
* Alerting: Include annotations in prometheus Alert response.

* add tests

* re-order depedencies
2022-03-09 18:20:29 +00:00
George Robinson
789cfc31e3
Alerting: Fix use of > instead of >= when checking the For duration (#46011) 2022-03-01 17:06:42 +00:00
George Robinson
feae959c9d
Alerting: Create annotation if Firing alert is removed (#45703)
This commit changes staleResultsHandler to create an annotation if the current state is Alerting and the result is being removed from the state cache as it has not been updated since 2x the evaluation interval.
2022-02-24 16:25:28 +00:00
George Robinson
8d57318941
Alerting: Use expanded labels in dashboard annotations (#45726) 2022-02-24 10:58:54 +00:00
Yuriy Tseretyan
02f8e99ca1
Alerting: move fake stores to store package (#45428)
* make fake storage public
* move fake storages to store package
2022-02-15 17:24:39 -05:00
George Robinson
67a3e1d6fd
Add context.Context to InstanceStore (#45049) 2022-02-08 13:49:04 +00:00
George Robinson
a9399ab3cd
Alerting: Add context.Context to RuleStore (#45004)
Alerting: Add context.Context to RuleStore
2022-02-08 08:52:03 +00:00
idafurjes
7a23700e1a
Remove unused GetDashboard method (#44890)
* Remove unused GetDashboard method

* Uncomment test

* Fix dashboard service integration test

* Remove comment
2022-02-04 17:21:06 +01:00
Yuriy Tseretyan
984c95de63
Do not store EvaluationString in Evaluation. (#44606)
* do not store evaluation string in Evaluation.
* reduce number of buckets to store for a single state
2022-02-02 19:18:20 +01:00
George Robinson
5e2280ceee
Add metrics to ngalert scheduler (#44602)
This pull request adds metrics to the ngalert scheduler so we can see how long it takes to evaluate a tick.
2022-01-31 16:56:43 +00:00
idafurjes
56c3875bb9
Chore: Remove context.TODO (#43458)
* Remove context.TODO() from services

* Fix live test
2021-12-28 10:26:18 +01:00
George Robinson
c932dc959c
Alerting: Add Ref ID to DatasourceNoData and DatasourceError alerts (#42630) 2021-12-03 09:55:16 +00:00
gotjosh
357e9ed1ea
Alerting: Fix Annotation Creation when the alerting state changes (#42479)
* Fix Annotation creation
- Remove validation of panelID, now annotations are created irrespective on whether they're attached to a panel or not.
- Alwasy attach the annotation to an AlertID

* Fix annotation creation

* fix tests
2021-12-01 11:04:54 +00:00
Santiago
a21d1e50f1
avoid template execution errors on missing values (#41617) 2021-11-29 15:26:51 -03:00
gotjosh
dd5a2e5128
Alerting: Clear alerting rule evaluation errors after intermittent failures (#42386)
* Alerting: Clear alerting rule evaluation errors after intermittent failures

When an alert transitioned in a way that `alerting -> error -> (alerting|nodata)`, the error provided by the `error` state would never be cleared thus the API and UI would show the health as an error.
2021-11-26 17:58:19 +00:00
George Robinson
1b26d4d88e
Alerting: Create DatasourceError alert if evaluation returns error (#41869)
* Alerting: Create DatasourceError alert if evaluation returns error

* Alerting: Add docs for DatasourceError alert

* Alerting: Fix DatasourceError alert does not have dashboard_uid label

* Alerting: Add break when datasource_uid found

* Alerting: Update TestProcessEvalResults
2021-11-25 11:46:47 +01:00
Santiago
a45e4ff73f
graphLink and tableLink template functions (#41369)
* graphLink and tableLink functions, docs updated

* Code review changes

* extract query struct outside of graphLink and tableLink functions

* Fix docs

* Update docs/sources/alerting/unified-alerting/alerting-rules/alert-annotation-label.md

Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>

* Update docs/sources/alerting/unified-alerting/alerting-rules/alert-annotation-label.md

Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>

* Update docs/sources/alerting/unified-alerting/alerting-rules/alert-annotation-label.md

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>

* Update docs/sources/alerting/unified-alerting/alerting-rules/alert-annotation-label.md

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>

* Fix linting errors

Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>
2021-11-10 10:36:03 -03:00
Yuriy Tseretyan
610643a668
Alerting: Special alert instance if rule is in state NoData (#40540)
* do not suppress NoData state
* extract conversion of state to postable alert + tests
* create a special alert instance if nodata 
* use NoData when converting from Keep Last State instead of Alerting
* add silence during migration if NoData is mapped to KeepLastState.
2021-11-04 16:42:34 -04:00
Yuriy Tseretyan
1b5b747885
Alerting: Additional Tests for State Manager (#41291)
* rename fakeInstanceStore to FakeInstanceStore

* update test for state manager to initialize instance store with FakeInstanceStore
2021-11-04 15:15:56 -04:00