Commit Graph

135 Commits

Author SHA1 Message Date
Alexander Weaver
7a171fd14a Regenerate openapidocs at 1.21.8 to match ci (#84037)
* Regenerate openapidocs at 1.21.8 to match ci

* Adjust trigger to work on the actual outputted files

* Also put go.mod and go.sum in the triggers

* manually fix

* Make an arbitrary change rather than touching the trigger to force a run

* Drop all triggers - run all the time

* Print diff - taken from @papagian's PR

* Manual fixes to swagger doc

---------

Co-authored-by: Ryan McKinley <ryantxu@gmail.com>
2024-03-06 16:08:45 -06:00
Alexander Weaver
d5fda06147 Alerting: Decouple rule routine from scheduler (#84018)
* create rule factory for more complicated dep injection into rules

* Rules get direct access to metrics, logs, traces utilities, use factory in tests

* Use clock internal to rule

* Use sender, statemanager, evalfactory directly

* evalApplied and stopApplied

* use schedulableAlertRules behind interface

* loaded metrics reader

* 3 relevant config options

* Drop unused scheduler parameter

* Rename ruleRoutine to run

* Update READMED

* Handle long parameter lists

* remove dead branch
2024-03-06 13:44:53 -06:00
Alexander Weaver
1bb38e8f95 Alerting: Move ruleRoutine to be a method on ruleInfo (#83866)
* Move ruleRoutine to ruleInfo file

* Move tests as well

* swap ruleInfo and scheduler parameters on ruleRoutine

* Fix linter complaint, receiver name
2024-03-04 17:15:55 -06:00
Alexander Weaver
f2a9d0a89d Alerting: Refactor ruleRoutine to take an entire ruleInfo instance (#83858)
* Make stop a real method

* ruleRoutine takes a ruleInfo reference directly rather than pieces of it

* Fix whitespace
2024-03-04 15:15:01 -06:00
Yuri Tseretyan
1eebd2a4de Alerting: Support for simplified notification settings in rule API (#81011)
* Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping
* Support validation of notification settings when rules are updated

* Implement route generator for Alertmanager configuration. That fetches all notification settings.
* Update multi-tenant Alertmanager to run the generator before applying the configuration.

* Add notification settings labels to state calculation
* update the Multi-tenant Alertmanager to provide validation for notification settings

* update GET API so only admins can see auto-gen
2024-02-15 09:45:10 -05:00
Alexander Weaver
d4ae10ecc6 Alerting: Small refactor, move unrelated functions out of fetcher (#82459)
Move unrelated functions out of fetcher
2024-02-14 20:01:32 +02:00
Alexander Weaver
843c477899 Alerting: Add exported API to scheduler to access currently loaded rules (#82031)
* Add exported API to fetch rule definitions from scheduler

* Add comment
2024-02-07 09:31:22 -06:00
Yuri Tseretyan
131c72d655 Alerting: Fix scheduler to group folders by the unique key (orgID and UID) (#81303) 2024-01-30 17:14:11 -05:00
Alexander Weaver
00a260effa Alerting: Add setting to distribute rule group evaluations over time (#80766)
* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels
2024-01-18 12:48:11 -06:00
Alexander Weaver
3c796ecc8f Alerting: Add metric counting rule groups per org (#80669)
* Refactor, fix bad map hint

* Count groups per org
2024-01-16 16:35:56 -06:00
Alexander Weaver
542741f748 Alerting: Log scheduler maxAttempts, guard against invalid retry counts, log retry errors (#80234)
* Log maxAttempts, add guard, log retry errors

* fix whitespace

* Initialize evaluator in TestProcessTicks
2024-01-09 13:19:37 -06:00
Yuri Tseretyan
f6a46744a6 Alerting: Support hysteresis command expression (#75189)
Backend: 

* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
   - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
  -  add rule_fingerprint column to alert_instance
   - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
   - add AlertingResultsFromRuleState that implements the new interface in eval package
   - update getExprRequest to patch the hysteresis command.

* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.


Frontend:

* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions

* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions

---------

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
2024-01-04 11:47:13 -05:00
gotjosh
c631261681 Alerting: Attempt to retry retryable errors (#79161)
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 20:45:08 +00:00
gotjosh
07915703fe Revert "Alerting: Attempt to retry retryable errors" (#79158)
Revert "Alerting: Attempt to retry retryable errors (#79037)"

This reverts commit 3e51cf0949.
2023-12-06 19:12:01 +00:00
gotjosh
3e51cf0949 Alerting: Attempt to retry retryable errors (#79037)
* Alerting: Attempt to retry retryable errors

Currently in a draft state, but this was the minimal diff I could put together to exemplify how could achieve this.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 16:35:22 +00:00
Santiago
61cb26711e Alerting: Fetch alerts from a remote Alertmanager (#75844)
* Alerting: post alerts to the remote Alertmanager and fetch them

* fix broken tests

* Alerting: Add Mimir Backend image to devenv (blocks)

* add alerting as code owner for mimir_backend block

* Alerting: Use Mimir image to run integration tests for the remote Alertmanager

* skip integration test when running all tests

* skipping integration test when no Alertmanager URL is provided

* fix bad host for mimir_backend

* remove basic auth testing until we have an nginx image in our CI

* add integration tests for alerts

* fix tests

* change SendCtx -> Send, add context.Context to Send, fix CI

* add reover() for functions from the Prometheus Alertmanager HTTP client that could panic

* add TODO to implement PutAlerts in a way that mimicks what Prometheus does

* fix log format
2023-10-19 11:27:37 +02:00
Marcus Efraimsson
e4c1a7a141 Tracing: Standardize on otel tracing (#75528) 2023-10-03 14:54:20 +02:00
Steve Simpson
894f420014 Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158) 2023-09-20 15:07:02 +02:00
Yuri Tseretyan
938e26b59f Alerting: Add new metrics and tracings to state manager and scheduler (#71398)
* add metrics and tracing to state manager

* propagate tracer to state manager

* add scheduler metrics

* fix backtesting

* add test for state metrics

* remove StateUpdateCount

* update docs

* metrics can be null

* add tracer to new tests
2023-08-16 09:04:18 +02:00
Yuri Tseretyan
c7598cc6fb Alerting: Add ability to control scheduler tick interval via config (#71980)
* add ability to control scheduler interval via config
* add feature flag `configurableSchedulerTick`
2023-07-26 12:44:12 -04:00
Kyle Brandt
f6a28cadbc Alerting: (Chore/Instrumentation) Add traceID to logs with contextual logger (#71289)
Alerting: (Chore) Add traceID to logs with contextual logger
2023-07-11 10:59:52 +02:00
SatVeer Singh
1bfa3a0f1e Chore: Replace go-multierror with errors package (#66432)
* code refactor and type assertions added to tests

* no-lint rule added for specific line
2023-06-19 12:29:45 +03:00
Matthew Jacobson
ba3994d338 Alerting: Repurpose rule testing endpoint to return potential alerts (#69755)
* Alerting: Repurpose rule testing endpoint to return potential alerts

This feature replaces the existing no-longer in-use grafana ruler testing API endpoint /api/v1/rule/test/grafana. The new endpoint returns a list of potential alerts created by the given alert rule, including built-in + interpolated labels and annotations.

The key priority of this endpoint is that it is intended to be as true as possible to what would be generated by the ruler except that the resulting alerts are not filtered to only Resolved / Firing and ready to be sent.

This means that the endpoint will, among other things:

- Attach static annotations and labels from the rule configuration to the alert instances.
- Attach dynamic annotations from the datasource to the alert instances.
- Attach built-in labels and annotations created by the Grafana Ruler (such as alertname and grafana_folder) to the alert instances.
- Interpolate templated annotations / labels and accept allowed template functions.
2023-06-08 18:59:54 -04:00
Yuri Tseretyan
9eb10bee1f Alerting: Scheduler use rule fingerprint instead of version (#66531)
* implement calculation of fingerprint for ruleWithFolder
* update scheduler to use fingerprint instead of rule's version
2023-04-28 10:42:16 -04:00
gotjosh
1c3ce0735f Alerting: Tiny refactor on the eval and schedule packages (#66130)
* Alerting: Tiny refactor on the eval and schedule packages

two very small things:

- We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext`
- The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary.

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

---------

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2023-04-06 16:02:28 +01:00
Alexander Weaver
9bcf8819d3 Alerting: Handful of small adjustments to log levels and parameters (#64572)
Calculate duration earlier in scheduler
2023-03-17 12:15:49 +00:00
Yuri Tseretyan
85a954cd81 Alerting: Update scheduler to get updates only from database (#64635)
* stop using the scheduler's Update and Delete methods all communication must be via the database
* update scheduler's registry to calculate diff before re-setting the cache
* update fetcher to return the diff generated by registry
* update processTick to update rule eval routine if the rule was updated and it is not going to be evaluated at this tick.
* remove references to the scheduler from api package
* remove unused methods in the scheduler
2023-03-14 18:02:51 -04:00
Alex Moreno
f60dc4441f Alerting: Add status label to GroupRules metric (#63454)
* Add status label to GroupRules metric

* Add state (active and paused) label to GrouRules

* Add active/paused metrics tests
2023-02-23 12:38:27 +01:00
Steve Simpson
4d1a2c3370 Alerting: Move rule_groups_rules metric from State to Scheduler. (#63144)
The `rule_groups_rules` metric is currently defined and computed by `State`.
It makes more sense for this metric to be computed off of the configured rule
set, not based on the rule evaluation state. There could be an edge condition
where a rule does not have a state yet, and so is uncounted.

Additionally, we would like this metric (and others), to have a `rule_group`
label, and this is much easier to achieve if the metric is produced from the
`Scheduler` package.
2023-02-09 17:05:19 +01:00
Yuri Tseretyan
f066e8cdcd Alerting: Update to alerting 20230203015918-0e4e2675d7aa (after refactoring) (#62823)
* add alerting prefix to some packages from alerting that have similar names in prometheus alertmanager
2023-02-03 11:36:49 -05:00
Alex Moreno
7a465f42a6 Alerting: Allow pausing alerts from provisioning (#62263)
* Allow pausing alerts from provisioning

* Update swagger

* Add IsPaused to provision export endpoints

* Add pause field in sample.yml

* Add exception for reset state in first loop iteration of scheduler if rule is paused

* Update provision definition and swagger docs

* Fix provisioning export tests

* Suggestion: Simplify if condition

* Add more context to a comment
2023-01-30 16:29:05 +01:00
Serge Zaitsev
d6d4097567 Chore: Fix goimports grouping in alerting (#62424)
* fix goimports

* fix goimports order
2023-01-30 09:55:35 +01:00
Yuri Tseretyan
05bf241952 Alerting: Update state manager to return StateTransitions when Delete or Reset (#62264)
* update Delete and Reset methods to return state transitions

this will be used by notifier code to decide whether alert needs to be sent or not.

* update scheduler to provide reason to delete states and use transitions

* update FromAlertsStateToStoppedAlert to accept StateTransition and filter by old state

* fixup

* fix tests
2023-01-27 09:46:21 +01:00
Alex Moreno
531b439cf1 Alerting: Add alert pausing feature (#60734)
* Add field in alert_rule model, add state to alert_instance model, and state to eval

* Remove paused state from eval package

* Skip paused alert rules in scheduler

* Add migration to add is_paused field to alert_rule table

* Convert to postable alerts only if not normal, pernding, or paused

* Handle paused eval results in state manager

* Add Paused state to eval package

* Add paused alerts logic in scheduler

* Skip alert on scheduler

* Remove paused status from eval package

* Apply suggestions from code review

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Remove state

* Rethink schedule and manager for paused alerts

* Change return to continue

* Remove unused var

* Rethink alert pausing

* Paused alerts storing annotations

* Only add one state transition

* Revert boolean method renaming refactor

* Revert take image refactor

* Make registry errors public

* Revert method extraction for getting a folder title

* Revert variable renaming refactor

* Undo unnecessary changes

* Revert changes in test

* Remove IsPause check in PatchPartiLAlertRule function

* Use SetNormal to set state

* Fix text by returning to old behaviour on alert rule deletion

* Add test in schedule_unit_test.go to test ticks with paused alerts

* Add coment to clarify usage of context.Background()

* Add comment to clarify resetStateByRuleUID method usage

* Move rule get to a more limited scope

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* rum gofmt on pkg/services/ngalert/schedule/schedule.go

* Remove defer cancel for context

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/testing.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* skip scheduler rule state clean up on paused alert rule

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Fix mock in test

* Add (hopefully) final suggestions

* Use error channel from recordAnnotationsSync to cancel context

* Run make gen-cue

* Place pause alert check in channel update after version check

* Reduce branching un update channel select

* Add if for error and move code inside if in state manager ResetStateByRuleUID

* Add reason to logs

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Do not delete alert rule routine, just exit on eval if is paused

* Reduce branching and create-close a channel to avoid deadlocks

* Separate state deletion and state reset (includes history saving)

* Add current pause state in rule route in scheduler

* Split clearState and bring errCh closer to RecordStatesAsync call

* Change rule to ruleMeta in RecordStatesAsync

* copy state to be able to modify it

* Add timeout to context creation

* Shorten the timeout

* Use resetState is rule is paused and deleteState if rule is not paused

* Remove Empty state reason

* Save every rule change in historian

* Add tests for DeleteStateByRuleUID and ResetStateByRuleUID

* Remove useless line

* Remove outdated comment

Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
2023-01-26 18:29:10 +01:00
George Robinson
2a291afbae Alerting: Use consts from alerting package (#61241) 2023-01-10 19:59:13 +00:00
Yuri Tseretyan
da18c89e91 Alerting: Scheduler to call DeleteAlertRule once when it stops deleted rules (#61189)
scheduler to call DeleteAlertRule once when it stops deleted rules
2023-01-09 14:39:32 -05:00
Yuri Tseretyan
48f1db63ff Alerting: Add support for tracing to alerting scheduler (#61057) 2023-01-06 21:21:43 -05:00
Yuri Tseretyan
abb49d96b5 Alerting: update state manager to return StateTransition instead of State (#58867)
* improve test for stale states
* update state manager return StateTransition
* update scheduler to accept state transitions
2022-12-06 13:07:39 -05:00
Alexander Weaver
9977c7ea43 Alerting: Simplify scheduler configuration and remove dependency on Grafana-wide settings (#59735)
* Make scheduler not depend directly on grafana-wide settings

* Re-add missing interval
2022-12-02 16:02:07 -06:00
Yuri Tseretyan
bad4f28d0d Alerting: update test TestAlertingTicker to not rely on clock (#58544)
* extract method processTick
* make processTick return scheduled rules
* move state manager tests to state manager
* update test
* move all tests into one file
* remove unused fields
2022-11-09 15:08:57 -05:00
Yuri Tseretyan
978f1119d7 Alerting: Run state manager as regular sub-service (#58246) 2022-11-04 17:06:47 -04:00
Ryan McKinley
e6a9fa1cf9 ServiceAccounts: enable service accounts after IsRealUser change (#58263)
* suppor service accounts

* add: IsServiceAccount to scheduleUser in scheduler

Co-authored-by: eleijonmarck <eric.leijonmarck@gmail.com>
2022-11-04 15:53:35 -04:00
Eric Leijonmarck
72d0c6b428 Auth: add IsServiceAccount to IsRealUser (#58015)
* add: IsServiceAccount to SignedInUser and IsRealUser

* fix: linting error

* refactor: add function IsServiceAccountUser()

By adding the function IsServiceAccountUser() we use it to identify for
ServiceAccounts in the HasUniqueID() since caching is built up on having
a uniqueID, see comment: https://github.com/grafana/grafana/pull/58015#discussion_r1011361880
2022-11-04 12:39:54 +00:00
Yuriy Tseretyan
e3a4bde622 Alerting: Condition evaluator with cached pipeline (#57479)
* create rule evaluator
* load header from the context
* init one factory
* update scheduler
2022-11-02 10:13:39 -04:00
Yuriy Tseretyan
0a4121cef8 Alerting: Contextual log provider for rule key (#57476)
* create contextual log context provider
* use contextual provider in scheduler
* init logger in the package
* use context for log context
* use context in state manager
2022-10-26 19:16:02 -04:00
Yuriy Tseretyan
f3c219a980 Alerting: update format of logs in scheduler (#57302)
* Change the severity level of the log messages
2022-10-20 13:43:48 -04:00
Alexander Weaver
3ddb28bad9 Find-and-replace 'err' logs to 'error' to match log search conventions (#57309) 2022-10-19 17:36:54 -04:00
Alexander Weaver
4eb8e4ff66 Alerting: Add traceability headers for alert queries (#57127)
* Define EvaluationContext

* Refactor ConditionEval to use new context struct

* Refactor QueriesAndExpressionsEval to use EvaluationContext

* Remove dead field from AlertExecCtx

* Refactor Validate to use EvaluationContext

* Get rid of privately used AlertExecCtx

* Move EvaluationContext to new file and add helper

* Add builder pattern and bind rule info to context

* Extract header logic and add rule UID header

* Fix missing call
2022-10-19 14:19:43 -05:00
Yuriy Tseretyan
ad2a1dd680 Alerting: Start ticker only when scheduler starts (#56339) 2022-10-05 09:35:02 -04:00
Alexander Weaver
a00879ae21 Alerting: Refactor store to not export its own interface for InstanceStore, delete dead dependency injection (#55772)
* Add consumer-side store interface to state manager

* Remove dead dependency

* Delete dead dependency in API struct

* Delete store-layer InstanceStore interface

* Move fake for state's InstanceStore interface to state package
2022-09-26 13:55:05 -05:00