grafana

mirror of https://github.com/grafana/grafana.git synced 2024-11-25 18:30:41 -06:00

Author	SHA1	Message	Date
Yuri Tseretyan	131c72d655	Alerting: Fix scheduler to group folders by the unique key (orgID and UID) (#81303 )	2024-01-30 17:14:11 -05:00
Alexander Weaver	18b9c8fd5f	Alerting: Nilcheck JitterStrategyFrom so it can be used in contexts without feature toggles (#80841 ) Nilcheck so tests can have a nil feature toggles	2024-01-18 15:43:41 -06:00
Alexander Weaver	00a260effa	Alerting: Add setting to distribute rule group evaluations over time (#80766 ) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels	2024-01-18 12:48:11 -06:00
Jean-Philippe Quéméner	82638d059f	feat(alerting): add state persister interface (#80384 )	2024-01-17 13:33:13 +01:00
Alexander Weaver	3c796ecc8f	Alerting: Add metric counting rule groups per org (#80669 ) * Refactor, fix bad map hint * Count groups per org	2024-01-16 16:35:56 -06:00
Alexander Weaver	542741f748	Alerting: Log scheduler maxAttempts, guard against invalid retry counts, log retry errors (#80234 ) * Log maxAttempts, add guard, log retry errors * fix whitespace * Initialize evaluator in TestProcessTicks	2024-01-09 13:19:37 -06:00
Yuri Tseretyan	f6a46744a6	Alerting: Support hysteresis command expression (#75189 ) Backend: * Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command. - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels. - add rule_fingerprint column to alert_instance - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it. - add AlertingResultsFromRuleState that implements the new interface in eval package - update getExprRequest to patch the hysteresis command. * Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition. Frontend: * Add hysteresis option to Threshold in UI. It's called "Recovery Threshold" * Add test for getUnloadEvaluatorTypeFromCondition * Hide hysteresis in panel expressions * Refactor isInvalid and add test for it * Remove unnecesary React.memo * Add tests for updateEvaluatorConditions --------- Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>	2024-01-04 11:47:13 -05:00
gotjosh	c631261681	Alerting: Attempt to retry retryable errors (#79161 ) * Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com>	2023-12-06 20:45:08 +00:00
gotjosh	07915703fe	Revert "Alerting: Attempt to retry retryable errors" (#79158 ) Revert "Alerting: Attempt to retry retryable errors (#79037)" This reverts commit `3e51cf0949`.	2023-12-06 19:12:01 +00:00
gotjosh	3e51cf0949	Alerting: Attempt to retry retryable errors (#79037 ) * Alerting: Attempt to retry retryable errors Currently in a draft state, but this was the minimal diff I could put together to exemplify how could achieve this. Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: gotjosh <josue.abreu@gmail.com>	2023-12-06 16:35:22 +00:00
Santiago	61cb26711e	Alerting: Fetch alerts from a remote Alertmanager (#75844 ) * Alerting: post alerts to the remote Alertmanager and fetch them * fix broken tests * Alerting: Add Mimir Backend image to devenv (blocks) * add alerting as code owner for mimir_backend block * Alerting: Use Mimir image to run integration tests for the remote Alertmanager * skip integration test when running all tests * skipping integration test when no Alertmanager URL is provided * fix bad host for mimir_backend * remove basic auth testing until we have an nginx image in our CI * add integration tests for alerts * fix tests * change SendCtx -> Send, add context.Context to Send, fix CI * add reover() for functions from the Prometheus Alertmanager HTTP client that could panic * add TODO to implement PutAlerts in a way that mimicks what Prometheus does * fix log format	2023-10-19 11:27:37 +02:00
Marcus Efraimsson	e4c1a7a141	Tracing: Standardize on otel tracing (#75528 )	2023-10-03 14:54:20 +02:00
Steve Simpson	894f420014	Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158 )	2023-09-20 15:07:02 +02:00
Will Browne	e855efb13d	Plugins: Move store and plugin dto to pluginsintegration (#74655 ) move store and plugin dto	2023-09-11 13:59:24 +02:00
Ryan McKinley	025b2f3011	Chore: use any rather than interface{} (#74066 )	2023-08-30 18:46:47 +03:00
Yuri Tseretyan	938e26b59f	Alerting: Add new metrics and tracings to state manager and scheduler (#71398 ) * add metrics and tracing to state manager * propagate tracer to state manager * add scheduler metrics * fix backtesting * add test for state metrics * remove StateUpdateCount * update docs * metrics can be null * add tracer to new tests	2023-08-16 09:04:18 +02:00
Yuri Tseretyan	c7598cc6fb	Alerting: Add ability to control scheduler tick interval via config (#71980 ) * add ability to control scheduler interval via config * add feature flag `configurableSchedulerTick`	2023-07-26 12:44:12 -04:00
Will Browne	a8577c21ba	Plugins: Migrate PluginStore mock to pre-existing fakes package (#71664 ) * migrate to existing fakes package * fix imports	2023-07-17 10:21:44 +00:00
Kyle Brandt	f6a28cadbc	Alerting: (Chore/Instrumentation) Add traceID to logs with contextual logger (#71289 ) Alerting: (Chore) Add traceID to logs with contextual logger	2023-07-11 10:59:52 +02:00
Yuri Tseretyan	ada325de2a	Alerting: Use unsafe.Slice for hashing a string during rule fingerprint calculation (#71000 )	2023-06-30 14:58:23 -04:00
George Robinson	7edbe72483	Alerting: Support concurrent queries for saving alert instances (#70525 ) This commit adds support for concurrent queries when saving alert instances to the database. This is an experimental feature in response to some customers experiencing delays between rule evaluation and sending alerts to Alertmanager, resulting in flapping. It is disabled by default.	2023-06-23 11:36:07 +01:00
SatVeer Singh	1bfa3a0f1e	Chore: Replace go-multierror with errors package (#66432 ) * code refactor and type assertions added to tests * no-lint rule added for specific line	2023-06-19 12:29:45 +03:00
Matthew Jacobson	ba3994d338	Alerting: Repurpose rule testing endpoint to return potential alerts (#69755 ) * Alerting: Repurpose rule testing endpoint to return potential alerts This feature replaces the existing no-longer in-use grafana ruler testing API endpoint /api/v1/rule/test/grafana. The new endpoint returns a list of potential alerts created by the given alert rule, including built-in + interpolated labels and annotations. The key priority of this endpoint is that it is intended to be as true as possible to what would be generated by the ruler except that the resulting alerts are not filtered to only Resolved / Firing and ready to be sent. This means that the endpoint will, among other things: - Attach static annotations and labels from the rule configuration to the alert instances. - Attach dynamic annotations from the datasource to the alert instances. - Attach built-in labels and annotations created by the Grafana Ruler (such as alertname and grafana_folder) to the alert instances. - Interpolate templated annotations / labels and accept allowed template functions.	2023-06-08 18:59:54 -04:00
Yuri Tseretyan	9eb10bee1f	Alerting: Scheduler use rule fingerprint instead of version (#66531 ) * implement calculation of fingerprint for ruleWithFolder * update scheduler to use fingerprint instead of rule's version	2023-04-28 10:42:16 -04:00
Santiago	b0881daf23	Alerting: Use URLs in image annotations (#66804 ) * use tokens or urls in image annotations * improve tests, fix some comments * fix empty tokens * code review changes, check for url before checking for token (support old token formats)	2023-04-26 13:06:18 -03:00
Kyle Brandt	840fb32ad8	SSE: (Instrumentation) Add Tracing (#66700 ) spans are prefixed `SSE.`	2023-04-18 08:04:51 -04:00
Kyle Brandt	2f13c851e4	SSE: (Chore/Instrumentation) Add ds_queries_total metric and move met… (#66695 ) * SSE: (Chore/Instrumentation) Add ds_queries_total metric and move metrics to service	2023-04-17 16:12:44 -07:00
Kyle Brandt	e78be44e1a	SSE: Dataplane Compliance (#65927 ) Takes a specific code path for data that identifies itself as dataplane instead of "guessing" what the data is. The data must identify itself by being in the dataplane by having both the following frame metadata properties: - TypeVersion property that is greater than 0.0 - 'Type' property The flag is disableSSEDataplane and disables this functionality and uses the old code for all queries regardless. See https://github.com/grafana/grafana-plugin-sdk-go/blob/main/data/contract_docs/contract.md for dataplane details.	2023-04-12 12:24:34 -04:00
gotjosh	1c3ce0735f	Alerting: Tiny refactor on the eval and schedule packages (#66130 ) * Alerting: Tiny refactor on the eval and schedule packages two very small things: - We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext` - The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary. * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com> * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com> --------- Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>	2023-04-06 16:02:28 +01:00
Alexander Weaver	9bcf8819d3	Alerting: Handful of small adjustments to log levels and parameters (#64572 ) Calculate duration earlier in scheduler	2023-03-17 12:15:49 +00:00
Yuri Tseretyan	85a954cd81	Alerting: Update scheduler to get updates only from database (#64635 ) * stop using the scheduler's Update and Delete methods all communication must be via the database * update scheduler's registry to calculate diff before re-setting the cache * update fetcher to return the diff generated by registry * update processTick to update rule eval routine if the rule was updated and it is not going to be evaluated at this tick. * remove references to the scheduler from api package * remove unused methods in the scheduler	2023-03-14 18:02:51 -04:00
Alex Moreno	f60dc4441f	Alerting: Add status label to GroupRules metric (#63454 ) * Add status label to GroupRules metric * Add state (active and paused) label to GrouRules * Add active/paused metrics tests	2023-02-23 12:38:27 +01:00
Steve Simpson	4d1a2c3370	Alerting: Move `rule_groups_rules` metric from State to Scheduler. (#63144 ) The `rule_groups_rules` metric is currently defined and computed by `State`. It makes more sense for this metric to be computed off of the configured rule set, not based on the rule evaluation state. There could be an edge condition where a rule does not have a state yet, and so is uncounted. Additionally, we would like this metric (and others), to have a `rule_group` label, and this is much easier to achieve if the metric is produced from the `Scheduler` package.	2023-02-09 17:05:19 +01:00
Yuri Tseretyan	f066e8cdcd	Alerting: Update to alerting 20230203015918-0e4e2675d7aa (after refactoring) (#62823 ) * add alerting prefix to some packages from alerting that have similar names in prometheus alertmanager	2023-02-03 11:36:49 -05:00
ismail simsek	91221bc436	Expressions: Fixes the issue showing expressions editor (#62510 ) * Use suggested value for uid * update the snapshot * use __expr__ * replace all -100 with __expr__ * update snapshot * more changes * revert redundant change * Use expr.DatasourceUID where it's possible * generate files	2023-01-31 18:50:10 +01:00
Alex Moreno	7a465f42a6	Alerting: Allow pausing alerts from provisioning (#62263 ) * Allow pausing alerts from provisioning * Update swagger * Add IsPaused to provision export endpoints * Add pause field in sample.yml * Add exception for reset state in first loop iteration of scheduler if rule is paused * Update provision definition and swagger docs * Fix provisioning export tests * Suggestion: Simplify if condition * Add more context to a comment	2023-01-30 16:29:05 +01:00
Serge Zaitsev	d6d4097567	Chore: Fix goimports grouping in alerting (#62424 ) * fix goimports * fix goimports order	2023-01-30 09:55:35 +01:00
Yuri Tseretyan	05bf241952	Alerting: Update state manager to return StateTransitions when Delete or Reset (#62264 ) * update Delete and Reset methods to return state transitions this will be used by notifier code to decide whether alert needs to be sent or not. * update scheduler to provide reason to delete states and use transitions * update FromAlertsStateToStoppedAlert to accept StateTransition and filter by old state * fixup * fix tests	2023-01-27 09:46:21 +01:00
Alex Moreno	531b439cf1	Alerting: Add alert pausing feature (#60734 ) * Add field in alert_rule model, add state to alert_instance model, and state to eval * Remove paused state from eval package * Skip paused alert rules in scheduler * Add migration to add is_paused field to alert_rule table * Convert to postable alerts only if not normal, pernding, or paused * Handle paused eval results in state manager * Add Paused state to eval package * Add paused alerts logic in scheduler * Skip alert on scheduler * Remove paused status from eval package * Apply suggestions from code review Co-authored-by: George Robinson <george.robinson@grafana.com> * Remove state * Rethink schedule and manager for paused alerts * Change return to continue * Remove unused var * Rethink alert pausing * Paused alerts storing annotations * Only add one state transition * Revert boolean method renaming refactor * Revert take image refactor * Make registry errors public * Revert method extraction for getting a folder title * Revert variable renaming refactor * Undo unnecessary changes * Revert changes in test * Remove IsPause check in PatchPartiLAlertRule function * Use SetNormal to set state * Fix text by returning to old behaviour on alert rule deletion * Add test in schedule_unit_test.go to test ticks with paused alerts * Add coment to clarify usage of context.Background() * Add comment to clarify resetStateByRuleUID method usage * Move rule get to a more limited scope * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: George Robinson <george.robinson@grafana.com> * rum gofmt on pkg/services/ngalert/schedule/schedule.go * Remove defer cancel for context * Update pkg/services/ngalert/models/instance_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/models/testing.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/schedule/schedule_unit_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/schedule/schedule_unit_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/models/instance_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * skip scheduler rule state clean up on paused alert rule * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Fix mock in test * Add (hopefully) final suggestions * Use error channel from recordAnnotationsSync to cancel context * Run make gen-cue * Place pause alert check in channel update after version check * Reduce branching un update channel select * Add if for error and move code inside if in state manager ResetStateByRuleUID * Add reason to logs * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: George Robinson <george.robinson@grafana.com> * Do not delete alert rule routine, just exit on eval if is paused * Reduce branching and create-close a channel to avoid deadlocks * Separate state deletion and state reset (includes history saving) * Add current pause state in rule route in scheduler * Split clearState and bring errCh closer to RecordStatesAsync call * Change rule to ruleMeta in RecordStatesAsync * copy state to be able to modify it * Add timeout to context creation * Shorten the timeout * Use resetState is rule is paused and deleteState if rule is not paused * Remove Empty state reason * Save every rule change in historian * Add tests for DeleteStateByRuleUID and ResetStateByRuleUID * Remove useless line * Remove outdated comment Co-authored-by: George Robinson <george.robinson@grafana.com> Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>	2023-01-26 18:29:10 +01:00
Santiago	b5fa9e3501	Chore: Fix "manger" typo (#61649 ) fix mangers -> managers	2023-01-17 23:13:27 +00:00
Yuri Tseretyan	86b5fbbf60	Alerting: Introduce state manager config structure (#61249 )	2023-01-10 16:26:15 -05:00
George Robinson	2a291afbae	Alerting: Use consts from alerting package (#61241 )	2023-01-10 19:59:13 +00:00
Yuri Tseretyan	da18c89e91	Alerting: Scheduler to call DeleteAlertRule once when it stops deleted rules (#61189 ) scheduler to call DeleteAlertRule once when it stops deleted rules	2023-01-09 14:39:32 -05:00
Yuri Tseretyan	48f1db63ff	Alerting: Add support for tracing to alerting scheduler (#61057 )	2023-01-06 21:21:43 -05:00
Yuri Tseretyan	c5ee4e4ae1	Alerting: Improve rule validation to check if rule uses backend datasources (#58986 ) * validate if rule uses backend datasources * add backend datasource to test * fix tests * another forgotten import * remove unused var	2022-12-08 10:44:02 +01:00
Yuri Tseretyan	abb49d96b5	Alerting: update state manager to return StateTransition instead of State (#58867 ) * improve test for stale states * update state manager return StateTransition * update scheduler to accept state transitions	2022-12-06 13:07:39 -05:00
Alexander Weaver	9977c7ea43	Alerting: Simplify scheduler configuration and remove dependency on Grafana-wide settings (#59735 ) * Make scheduler not depend directly on grafana-wide settings * Re-add missing interval	2022-12-02 16:02:07 -06:00
Alexander Weaver	2bfdda5b68	Alerting: Break dependency between state and image packages (#58381 ) * Refactor state and manager to not depend directly on image interface * Move generic errors to models package * Move NotAvailableImageService to state as its only references are in state tests * Move NoopImageService to state package * Move mock to state package * Fix linter error * Fix comment styling * Fix a couple added references introduced by rebase * Empty commit to kick build	2022-11-09 15:06:49 -06:00
Yuri Tseretyan	bad4f28d0d	Alerting: update test TestAlertingTicker to not rely on clock (#58544 ) * extract method processTick * make processTick return scheduled rules * move state manager tests to state manager * update test * move all tests into one file * remove unused fields	2022-11-09 15:08:57 -05:00
Yuri Tseretyan	3621cf5a12	Alerting: Update handling of stale state (#58276 ) * delete all stale states in one lock * do not use touched states to detect stale rely only on LastEvaluationTime maintained correctly * fix tests to use correct eval time * delete unused method	2022-11-07 11:03:53 -05:00

1 2 3 4

197 Commits