grafana

mirror of https://github.com/grafana/grafana.git synced 2025-02-16 18:34:52 -06:00

Author	SHA1	Message	Date
Joe Blubaugh	22c937340e	Revert "Alerting: Write and Delete multiple alert instances. (#54072 )" (#54885 ) This reverts commit `5e4fd94413`.	2022-09-09 17:44:06 +02:00
Joe Blubaugh	5e4fd94413	Alerting: Write and Delete multiple alert instances. (#54072 ) Prior to this change, all alert instance writes and deletes happened individually, in their own database transaction. This change batches up writes or deletes for a given rule's evaluation loop into a single transaction before applying it. Before: ``` goos: darwin goarch: arm64 pkg: github.com/grafana/grafana/pkg/services/ngalert/store BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op --- BENCH: BenchmarkAlertInstanceOperations-8 util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created PASS ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s ``` After: ``` goos: darwin goarch: arm64 pkg: github.com/grafana/grafana/pkg/services/ngalert/store BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op --- BENCH: BenchmarkAlertInstanceOperations-8 util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created PASS ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s ``` So we cut time by about 75% and memory allocations by about 60% when storing and deleting 100 instances. This change also updates some of our tests so that they run successfully against postgreSQL - we were using random Int64s, but postgres integers, which our tables use, max out at 2^31-1	2022-09-02 11:17:20 +08:00
Yuriy Tseretyan	03e746d9df	Alerting: Delete state from the database on reset (#53919 ) * make ResetStatesByRuleUID return states * delete rule states when reset * rule eval routine to clean up the state only when rule is deleted	2022-08-25 14:12:22 -04:00
Yuriy Tseretyan	9f90a7b54d	Alerting: State manager to use InstanceStore (#53852 ) * move saving the state to state manager when scheduler stops * move saving state to ProcessEvalResults * add GetRuleKey to State * add LogContext to AlertRuleKey	2022-08-18 09:40:33 -04:00
Yuriy Tseretyan	e5e8747ee9	Alerting: Update state manager to accept reserved labels (#52189 ) * add tests for cache getOrCreate * update ProcessEvalResults to accept extra lables * extract to getRuleExtraLabels * move populating of constant rule labels to extra labels	2022-07-14 15:59:59 -04:00
Yuriy Tseretyan	a6b1090879	Alerting: refactor scheduler and separate notification logic (#48144 ) * Introduce AlertsRouter in the sender package, and move all fields and methods related to notifications out of the scheduler to this router. * Introduce a new interface AlertsSender in the schedule package and replace calls of anonymous function `notify` inside the ruleRoutine to calling methods of that interface. * Rename interface Scheduler in api package to ExternalAlertmanagerProvider, and replace scheduler with AlertRouter as struct that implements the interface.	2022-07-12 15:13:04 -04:00
Yuriy Tseretyan	4b42cd3c1d	Alerting: State manager to use clock (#51219 ) * manager to use clock, to be able to mock real time	2022-06-22 12:18:42 -04:00
Yuriy Tseretyan	157c12211d	Alerting: State manager to use tick time to determine stale states (#50991 ) * use correct stale timestamp * calculate stale using tick time instead of time.now * remove unused dependency on sql store	2022-06-22 00:16:53 +02:00
gotjosh	0cde283505	Alerting: Logs should not be capitalized and the errors key should be "err" (#50333 ) * Alerting: decapitalize log lines and use "err" as the key for errors Found using (logger\|log).(Warn\|Debug\|Info\|Error)\([A-Z] and (logger\|log).(Warn\|Debug\|Info\|Error)\(.+"error"	2022-06-07 19:54:23 +02:00
Joe Blubaugh	56f40bd413	Alerting: Add Go error message to warning log for screenshots. (#49870 ) Makes debugging problems with alert screenshotting easier.	2022-05-31 20:56:22 +08:00
Joe Blubaugh	1cc034d960	Alerting: Add a "Reason" to Alert Instances to show underlying cause of state. (#49259 ) This change adds a field to state.State and models.AlertInstance that indicate the "Reason" that an instance has its current state. This helps us account for cases where the state is "Normal" but the underlying evaluation returned "NoData" or "Error", for example. Fixes #42606 Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>	2022-05-23 16:49:49 +08:00
Joe Blubaugh	1d724810de	Alerting: State Manager takes screenshots. (#49338 ) The State Manager will now take screenshots when an alert instance switches to an Alerting or Resolved state. Signed-off-by: Joe Blubaugh joe.blubaugh@grafana.com	2022-05-23 10:53:41 +08:00
Joe Blubaugh	687e79538b	Alerting: Add a general screenshot service and alerting-specific image service. (#49293 ) This commit adds a pkg/services/screenshot package for taking and uploading screenshots of Grafana dashboards. It supports taking screenshots of both dashboards and individual panels within a dashboard, using the rendering service. The screenshot package has the following services, most of which can be composed: BrowserScreenshotService (Takes screenshots with headless Chrome) CachableScreenshotService (Caches screenshots taken with another service such as BrowserScreenshotService) NoopScreenshotService (A no-op screenshot service for tests) SingleFlightScreenshotService (Prevents duplicate screenshots when taking screenshots of the same dashboard or panel in parallel) ScreenshotUnavailableService (A screenshot service that returns ErrScreenshotsUnavailable) UploadingScreenshotService (A screenshot service that uploads taken screenshots) The screenshot package does not support wire dependency injection yet. ngalert constructs its own version of the service. See https://github.com/grafana/grafana/issues/49296 This PR also adds an ImageScreenshotService to ngAlert. This is used to take screenshots with a screenshotservice and then store their location reference for use by alert instances and notifiers.	2022-05-22 22:33:49 +08:00
Kristin Laemmert	1df340ff28	backend/services: Move GetDashboard from sqlstore to dashboard service (#48971 ) * rename folder to match package name * backend/sqlstore: move GetDashboard into DashboardService This is a stepping-stone commit which copies the GetDashboard function - which lets us remove the sqlstore from the interfaces in dashboards - without changing any other callers. * checkpoint: moving GetDashboard calls into dashboard service * finish refactoring api tests for dashboardService.GetDashboard	2022-05-17 14:52:22 -04:00
George Robinson	c5547123bc	Remove redundant queries in GetAlertRules and GetOrgAlertRules and replace with ListAlertRules (#48108 )	2022-04-25 11:42:42 +01:00
gotjosh	cb6124c921	Alerting: Accurately set value for prom-compatible APIs (#47216 ) * Alerting: Accurately set value for prom-compatible APIs Sets the value fields for the prometheus compatible API based on a combination of condition `refID` and the values extracted from the different frames. * Fix an extra test * Ensure a consitent ordering * Address review comments * address review comments	2022-04-05 19:36:42 +01:00
gotjosh	8d4a0a0396	Alerting: Include annotations in prometheus Alert response. (#45970 ) * Alerting: Include annotations in prometheus Alert response. * add tests * re-order depedencies	2022-03-09 18:20:29 +00:00
George Robinson	feae959c9d	Alerting: Create annotation if Firing alert is removed (#45703 ) This commit changes staleResultsHandler to create an annotation if the current state is Alerting and the result is being removed from the state cache as it has not been updated since 2x the evaluation interval.	2022-02-24 16:25:28 +00:00
George Robinson	8d57318941	Alerting: Use expanded labels in dashboard annotations (#45726 )	2022-02-24 10:58:54 +00:00
George Robinson	67a3e1d6fd	Add context.Context to InstanceStore (#45049 )	2022-02-08 13:49:04 +00:00
George Robinson	a9399ab3cd	Alerting: Add context.Context to RuleStore (#45004 ) Alerting: Add context.Context to RuleStore	2022-02-08 08:52:03 +00:00
idafurjes	7a23700e1a	Remove unused GetDashboard method (#44890 ) * Remove unused GetDashboard method * Uncomment test * Fix dashboard service integration test * Remove comment	2022-02-04 17:21:06 +01:00
Yuriy Tseretyan	984c95de63	Do not store EvaluationString in Evaluation. (#44606 ) * do not store evaluation string in Evaluation. * reduce number of buckets to store for a single state	2022-02-02 19:18:20 +01:00
idafurjes	56c3875bb9	Chore: Remove context.TODO (#43458 ) * Remove context.TODO() from services * Fix live test	2021-12-28 10:26:18 +01:00
gotjosh	357e9ed1ea	Alerting: Fix Annotation Creation when the alerting state changes (#42479 ) * Fix Annotation creation - Remove validation of panelID, now annotations are created irrespective on whether they're attached to a panel or not. - Alwasy attach the annotation to an AlertID * Fix annotation creation * fix tests	2021-12-01 11:04:54 +00:00
Yuriy Tseretyan	5836def6c2	Alerting: declare constants for __dashboardUid__ and __panelId__ literals (#39976 )	2021-10-07 17:30:06 -04:00
idafurjes	2759b16ef5	Chore: Add context for dashboards (#39844 ) * Add context for dashboards * Remove GetDashboardCtx * Remove ctx.TODO	2021-10-05 13:26:24 +02:00
Santiago	562cd9e44e	Alerting template functions (#39261 ) * Alerting: (wip) add template funcs * Alerting: (wip) numeric template functions * Alerting: (wip) template functions * Test for the "args" function * Alerting: (wip) Documentation for template functions * Alerting: template functions - refactor * code review changes * disable linter error * Use Prometheus implementation of TemplateExpander * Update docs/sources/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule.md Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> * change templateCaptureValue to support using template functions * Update pkg/services/ngalert/state/template.go Co-authored-by: gotjosh <josue.abreu@gmail.com> * Test and documentation added for reReplaceAll template function * complete missing functions, documentation and tests * Use the alert instance's evaluation time for expanding the template * strvalue graphlink and tablelink functions * delete duplicate test * make strvalue return an empty string Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: gotjosh <josue.abreu@gmail.com>	2021-10-04 15:04:37 -03:00
gotjosh	fcbcfd232b	Alerting: Move spammy log line to debug in the state manager (#39410 )	2021-09-20 16:05:55 +01:00
Marcus Efraimsson	fa9857499b	Chore: GetDashboardQuery should be dispatched using DispatchCtx (#36877 ) * Chore: GetDashboardQuery should be dispatched using DispatchCtx * Fix after merge * Changes after review * Various fixes * Use GetDashboardCtx function instead of GetDashboard	2021-09-14 16:08:04 +02:00
gotjosh	a2f4344bf2	Alerting: Refactor & fix unified alerting metrics structure (#39151 ) * Alerting: Refactor & fix unified alerting metrics structure Fixes and refactors the metrics structure we have for the ngalert service. Now, each component has its own metric struct that includes the JUST the metrics it uses. Additionally, I have fixed the configuration metrics and added new metrics to determine if we have discovered and started all the necessary configurations of an instance. This allows us to alert on `grafana_alerting_discovered_configurations - grafana_alerting_active_configurations != 0` to know whether an alertmanager instance did not start successfully.	2021-09-14 12:55:01 +01:00
gotjosh	dd502f22eb	Alerting: Fix alert flapping in the internal alertmanager (#38648 ) * Alerting: Fix alert flapping in the alertmanager fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager. Mostly due to a combination of `EndsAt` and resend delay. The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard back from the alert generation system. Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself, and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing. This commit, increases the `EndsAt` by 3x the the resend delay or alert interval (depending on which one is higher). The resendDelay has been decreased to 30 seconds.	2021-09-02 16:22:59 +01:00
Kyle Brandt	aef67994a1	Annotations: Fix alerting annotation coloring (#37412 ) Co-authored-by: Ryan McKinley <ryantxu@gmail.com>	2021-08-12 09:37:54 -07:00
Kyle Brandt	aa904a5a04	NGAlert: Send resolve signal to alertmanager on alerting -> Normal (#37363 )	2021-07-29 20:29:17 +02:00
David Parrott	b5f464412d	Alerting: automatically remove stale alerting states (#36767 ) * initial attempt at automatic removal of stale states * test case, need espected states * finish unit test * PR feedback * still multiply by time.second * pr feedback	2021-07-26 18:12:04 +02:00
George Robinson	456dac1303	Expand the value of math and reduce expressions in annotations and labels (#36611 ) * Expand the value of math and reduce expressions in annotations and labels This commit makes it possible to use the values of reduce and math expressions in annotations and labels via their RefIDs. It uses the Stringer interface to ensure that "{{ $values.A }}" still prints the value in decimal format while also making the labels for each RefID available with "{{ $values.A.Labels }}" and the float64 value with "{{ $values.A.Value }}"	2021-07-15 13:10:56 +01:00
David Parrott	19f18bcecc	Alerting: annotation on state change (#36535 ) * WIP * Add annotation on alert state change * move annotation creation to manager * praise the linter! * add debug msg when creating annotation	2021-07-13 09:50:10 -07:00
gotjosh	a86ad1190c	Alerting: Refactor state manager as a dependency (#36513 ) * Alerting: Refactor state manager as a dependency Within the scheduler, the state manager was being passed around a certain number of functions. I've introduced it as a dependency to keep the "service" interfaces as clean and homogeneous as possible. This is relevant, because I'm going to introduce live reload of these components as part of my next PR and it is better if dependencies are self-contained. * remove unused functions * Fix a few more tests * Make sure the `stateManager` is declared before the schedule	2021-07-07 17:18:31 +01:00
David Parrott	4732f832f7	Alerting: recalculate EndsAt (#35830 ) * setEndsAt * one more test case * add should clause to tests	2021-06-17 10:01:46 -07:00
David Parrott	20d356947c	set state correctly and test (#34680 )	2021-05-26 11:37:42 -07:00
David Parrott	7a83d1f9ff	Alerting resend delay for sending to notifiers (#34312 ) * adds resend delay to avoid saturating notifier * correct method signatures * pr feedback	2021-05-19 22:15:09 +02:00
David Parrott	25485100b0	Alerting: Trim results when at processing instead of on ticker (#34248 ) * Trim results when at processing instead of on ticker * User RWMutex correctly * remove comment	2021-05-18 10:56:14 -07:00
Kyle Brandt	63b2dd06a5	Alerting: Set "value" with evalmatches in G Managed (#34075 ) When, and currently only when using a classic condition, evaluation information is added (which is like the EvalMatches from dashboard alerting). This is returned via the API and can be included in notifications by reading the `__value__` label attached `.Alerts` in the template. It is a string.	2021-05-18 09:12:39 -04:00
David Parrott	39099bf3c0	Alerting nested state cache (#33666 ) * nest cache by orgID, ruleUID, stateID * update accessors to use new cache structure * test and linter fixup * fix panic Co-authored-by: Kyle Brandt <kyle@grafana.com> * add comment to identify what's going on with nested maps in cache Co-authored-by: Kyle Brandt <kyle@grafana.com>	2021-05-04 09:57:50 -07:00
Kyle Brandt	48358efc13	Alerting: remove State cache entries on Ruler Delete (#33638 ) for https://github.com/grafana/alerting-squad/issues/133	2021-05-03 14:01:33 -04:00
Owen Diehl	5e48b54549	Alerting/metrics (#33547 ) * moves alerting metrics to their own pkg * adds grafana_alerting_alerts (by state) metric * alerts_received_{total,invalid} * embed alertmanager alerting struct in ng metrics & remove duplicated notification metrics (already embed alertmanager notifier metrics) * use silence metrics from alertmanager lib * fix - manager has metrics * updates ngalert tests * comment lint Signed-off-by: Owen Diehl <ow.diehl@gmail.com> * cleaner prom registry code * removes ngalert global metrics * new registry use in all tests * ngalert metrics impl service, hack testinfra code to prevent duplicate metric registrations * nilmetrics unexported	2021-04-30 12:28:06 -04:00
Kyle Brandt	914443c816	Alerting: Fix state cache id duplication (#33480 )	2021-04-28 11:42:19 -04:00
David Parrott	788bc2a793	Alerting: refactor state tracker (#33292 ) * set processing time * merge labels and set on response * use state cache for adding alerts to rules * minor cleanup * add support for NoData and Error results * rename test * bring in changes from other PRs tha have been merged * pr feedback * add integration test * close state tracker cleanup on context.Done * fixup test * rename state tracker * set EvaluationDuration on Result * default labels set as constants * separate cache and state from manager * use RWMutex in cache	2021-04-23 21:32:25 +02:00

48 Commits