grafana

mirror of https://github.com/grafana/grafana.git synced 2024-11-25 18:30:41 -06:00

Author	SHA1	Message	Date
Yuri Tseretyan	1eebd2a4de	Alerting: Support for simplified notification settings in rule API (#81011 ) * Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping * Support validation of notification settings when rules are updated * Implement route generator for Alertmanager configuration. That fetches all notification settings. * Update multi-tenant Alertmanager to run the generator before applying the configuration. * Add notification settings labels to state calculation * update the Multi-tenant Alertmanager to provide validation for notification settings * update GET API so only admins can see auto-gen	2024-02-15 09:45:10 -05:00
Alexander Weaver	99fa064576	Alerting: Emit warning when creating or updating unusually large groups (#82279 ) * Add config for limit of rules per rule group * Warn when editing big groups through normal API * Warn on prov api writes for groups * Wire up comp root, tests * Also add warning to state manager warm * Drop unnecessary conversion	2024-02-13 08:29:03 -06:00
Alexander Weaver	5bbe9c6e61	Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle (#82212 ) * remove jitter feature flag * Add an out so users can manually disable jitter * Pass in cfg * Add TODO to remove knob in future	2024-02-09 15:53:58 -06:00
George Robinson	90a26e18db	Alerting: Update Alertmanager to e82436c (#82145 ) This commit updates Alertmanager to commit e82436c, which is based on commit f69a508 from Prometheus Alertmanager.	2024-02-08 11:25:27 +00:00
William Wernert	2ea82af6e7	Alerting: Pass in receiver service to API struct (#81978 )	2024-02-06 16:49:47 +02:00
George Robinson	c8ccc4649c	Alerting: Support UTF-8 (#81512 ) This pull request updates our fork of Alertmanager to commit 65bdab0, which is based on commit 5658f8c in Prometheus Alertmanager. It applies the changes from grafana/alerting#155 which removes the overrides for validation of alerts, labels and silences that we had put in place to allow alerts and silences to work for non-Prometheus datasources. However, as this is now supported in Alertmanager with the UTF-8 work, we can use the new upstream functions and remove these overrides. The compat package is a package in Alertmanager that takes care of backwards compatibility when parsing matchers, validating alerts, labels and silences. It has three modes: classic mode, UTF-8 strict mode, fallback mode. These modes are controlled via compat.InitFromFlags. Grafana initializes the compat package without any feature flags, which is the equivalent of fallback mode. Classic and UTF-8 strict mode are used in Mimir. While Grafana Managed Alerts have no need for fallback mode, Grafana can still be used as an interface to manage the configurations of Mimir Alertmanagers and view configurations of Prometheus Alertmanager, and those installations might not have migrated or being running on older versions. Such installations behave as if in classic mode, and Grafana must be able to parse their configurations to interact with them for some period of time. As such, Grafana uses fallback mode until we are ready to drop support for outdated installations of Mimir and the Prometheus Alertmanager.	2024-02-06 08:33:47 +00:00
William Wernert	7e939401dc	Alerting: Introduce initial common receiver service (#81211 ) * Create locking config store that mimics existing provisioning store * Rename existing receivers(_test).go * Introduce shared receiver group service * Fix test * Move query model to models package * ReceiverGroup -> Receiver * Remove locking config store * Move convert methods to compat.go * Cleanup	2024-02-01 14:42:59 -05:00
Jean-Philippe Quéméner	aa25776f81	Alerting: Add a feature flag to periodically save states (#80987 )	2024-01-23 17:03:30 +01:00
Alexander Weaver	00a260effa	Alerting: Add setting to distribute rule group evaluations over time (#80766 ) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels	2024-01-18 12:48:11 -06:00
Jean-Philippe Quéméner	82638d059f	feat(alerting): add state persister interface (#80384 )	2024-01-17 13:33:13 +01:00
Santiago	9e78faa7ba	Alerting: Add metrics to the remote Alertmanager struct (#79835 ) * Alerting: Add metrics to the remote Alertmanager struct * rephrase http_requests_failed description * make linter happy * remove unnecessary metrics * extract timed client to separate package * use histogram collector from dskit * remove weaveworks dependency * capture metrics for all requests to the remote Alertmanager (both clients) * use the timed client in the MimirAuthRoundTripper * HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function * refactor * less git diff * gauge for last readiness check in seconds * initialize LastReadinesCheck to 0, tweak metric names and descriptions * add counters for sync attempts/errors * last config sync and last state sync timestamps (gauges) * change latency metric name * metric for remote Alertmanager mode * code review comments * move label constants to metrics package	2024-01-10 11:18:24 +01:00
Matthew Jacobson	aa03b8f8a7	Alerting: Guided legacy alerting upgrade dry-run (#80071 ) This PR has two steps that together create a functional dry-run capability for the migration. By enabling the feature flag alertingPreviewUpgrade when on legacy alerting it will: a. Allow all Grafana Alerting background services except for the scheduler to start (multiorg alertmanager, state manager, routes, …). b. Allow the UI to show Grafana Alerting pages alongside legacy ones (with appropriate in-app warnings that UA is not actually running). c. Show a new “Alerting Upgrade” page and register associated /api/v1/upgrade endpoints that will allow the user to upgrade their organization live without restart and present a summary of the upgrade in a table.	2024-01-05 18:19:12 -05:00
Santiago	a77ba40ed4	Alerting: Use the forked Alertmanager for remote secondary mode (#79646 ) * (WIP) Alerting: Use the forked Alertmanager for remote secondary mode * fall back to using internal AM in case of error * remove TODOs, clean up .ini file, add orgId as part of remote AM config struct * log warnings and errors, fall back to remoteSecondary, fall back to internal AM only * extract logic to decide remote Alertmanager mode to a separate function, switch on mode * tests * make linter happy * remove func to decide remote Alertmanager mode * refactor factory function and options * add default case to switch statement * remove ineffectual assignment	2023-12-21 15:26:31 +01:00
Santiago	9945514baa	Alerting: Validate configuration for the remote Alertmanager struct (#79691 ) * Alerting: Validate configuration for the remote Alertmanager struct * add TenantID to test * add OrgID to config struct in tests	2023-12-19 18:41:48 +01:00
William Wernert	62bdbe5b44	Annotations/Alerting: Add Loki historian store stub (#78363 ) * Add Loki historian store stub * Add composite store * Use composite store if Loki historian enabled * Split store interface into read/write * Make composite + historian stores read only * Use variadic constructor for composite * Modify Loki store enable logic * Use dskit.concurrency.ForEachJob for parallelism	2023-12-12 17:43:09 -05:00
Alexander Weaver	ab0ef5276f	Alerting: Decouple quota configuration logic from API interfaces and add tests (#78930 ) * Separate usage reporter from API * Extract quota registration * Decouple from API store interface * Move to ngalert package and add tests * linter	2023-12-01 10:47:19 -06:00
Steve Simpson	520c927931	Alerting: Only warm alert state cache if execute_alerts=true. (#78895 ) * Alerting: Only warm alert state cache if execute_alerts=true. If the Grafana instance is not executing alerts, then Warm()-ing the state manager is wasteful and could lead to misleading rule status queries, as the status returned will be always based on the state loaded from the database at startup, and not the most recent evaluation state. * Move Warm() down to shared conditional.	2023-12-01 10:17:32 +01:00
Santiago	01d274852c	Alerting: Add GetFullState method to FileStore (#78701 ) * Alerting: Add GetFullState method to FileStore * make tests compile, create stateStore in NewAlertmanager * return errors instead of logging, accept an arbitrary number of strings * make NewAlertmanager() accept a stateStore	2023-11-28 15:34:45 +01:00
Tania	39754ba2d6	Nested Folders: Wrap create/update operations with transactions (#78000 ) * Nested Folders: Add transaction to create and update methods * Update tests * Make IncreaseVersionForAllRulesInNamespace synchronous * Resolve merge conflicts	2023-11-21 23:06:20 +02:00
Ryan McKinley	f69fd3726b	FeatureToggles: Add context and and an explicit global check (#78081 )	2023-11-14 12:50:27 -08:00
Santiago	488a60aee6	Alerting: Rename remote.ExternalAlertmanager to remote.Alertmanager (#76956 )	2023-10-23 15:37:14 +02:00
gotjosh	866acbd5ac	Alerting: Move `ExternalAlertmanager` to its own package (#76854 ) * Alerting: Move `ExternalAlertmanager` to its own package We'll avoid import cycles when using components from other packages. In addition to that, I've created an `Options` approach for the multiorg alertmanger to allow us to override how per tenant alertmanagers are created. * switch things around * address review comments * fix references and warnings	2023-10-20 14:08:13 +02:00
Matthew Jacobson	c2efcdde09	Alerting: Fix flaky SQLITE_BUSY when migrating with provisioned dashboards (#76658 ) * Alerting: Move migration from background service run to ngalert init sqlite database write contention between the migration's single transaction and dashboard provisioning's frequent commits was causing the migration to fail with SQLITE_BUSY/SQLITE_BUSY_SNAPSHOT on all retries. This is not a new issue for sqlite+grafana, but the discrepancy between the length of the transactions was causing it to be very consistent. In addition, since a failed migration has implications on the assumed correctness of the alertmanager and alert rule definition state, we cause a server shutdown on error. This can make e2e tests as well as some high-load provisioned sqlite installations flaky on startup. The correct fix for this is better transaction management across various services and is out of scope for this change as we're primarily interested in mitigating the current bout of server failures in e2e tests when using sqlite.	2023-10-19 10:03:00 -04:00
Matthew Jacobson	82f3127e23	Alerting: Move legacy alert migration from sqlstore migration to service (#72702 )	2023-10-12 13:43:10 +01:00
Alexander Weaver	f6649d7a97	Revert "Alerting: Remove vendored models in migration service" (#76387 ) Revert "Alerting: Remove vendored models in migration service (#74503)" This reverts commit `6a8649d544`.	2023-10-11 14:21:21 -05:00
Matthew Jacobson	6a8649d544	Alerting: Remove vendored models in migration service (#74503 ) This PR replaces the vendored models in the migration with their equivalent ngalert models. It also replaces the raw SQL selects and inserts with service calls. It also fills in some gaps in the testing suite around: - Migration of alert rules: verifying that the actual data model (queries, conditions) are correct 9a7cfa9 - Secure settings migration: verifying that secure fields remain encrypted for all available notifiers and certain fields migrate from plain text to encrypted secure settings correctly e7d3993 Replacing the checks for custom dashboard ACLs will be replaced in a separate targeted PR as it will be complex enough alone.	2023-10-11 17:22:09 +01:00
gotjosh	59694fb2be	Alerting: Don't use a separate collection system for metrics (#75296 ) * Alerting: Don't use a separate collection system for metrics The state package had a metric collection system that ran every 15s updating the values of the metrics - there is a common pattern for this in the Prometheus ecosystem called "collectors". I have removed the behaviour of using a time-based interval to "set" the metrics in favour of a set of functions as the "value" that get called at scrape time.	2023-09-25 10:27:30 +01:00
Steve Simpson	894f420014	Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158 )	2023-09-20 15:07:02 +02:00
Will Browne	e855efb13d	Plugins: Move store and plugin dto to pluginsintegration (#74655 ) move store and plugin dto	2023-09-11 13:59:24 +02:00
Yuri Tseretyan	938e26b59f	Alerting: Add new metrics and tracings to state manager and scheduler (#71398 ) * add metrics and tracing to state manager * propagate tracer to state manager * add scheduler metrics * fix backtesting * add test for state metrics * remove StateUpdateCount * update docs * metrics can be null * add tracer to new tests	2023-08-16 09:04:18 +02:00
Yuri Tseretyan	0717ec11d6	Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal (#68142 )	2023-08-15 10:27:15 -04:00
Yuri Tseretyan	6b4a9d73d7	Alerting: Export contact points to check access control action instead legacy role (#71990 ) * introduce a new action "alert.provisioning.secrets:read" and role "fixed:alerting.provisioning.secrets:reader" * update alerting API authorization layer to let the user read provisioning with the new action * let new action use decrypt flag * add action and role to docs	2023-08-08 19:29:34 +03:00
Alexander Weaver	18b910e654	Alerting: Refactor annotation historian to isolate dashboard service dependency (#71689 ) * Refactor annotation historian to isolate dashboard service dependency * Export PanelKey * Don't export parsePanelKey * Remove commented out code	2023-07-18 08:18:55 -05:00
Steve Simpson	21ac224c45	Alerting: Make ImageService public in NGAlert. (#70737 )	2023-06-27 13:11:22 +02:00
George Robinson	7edbe72483	Alerting: Support concurrent queries for saving alert instances (#70525 ) This commit adds support for concurrent queries when saving alert instances to the database. This is an experimental feature in response to some customers experiencing delays between rule evaluation and sending alerts to Alertmanager, resulting in flapping. It is disabled by default.	2023-06-23 11:36:07 +01:00
Arati R	6cb1a5e368	Nested folders: Add alert rule counts and deletion to folder registry (#67259 ) * Let alert rule service implement registry service * Add count method to RuleStore interface * Add implementation for deletion of alert rules * Rename uid to folderUID in registry methods * Check forceDeleteRule value for registry deletion * Register alerting store with folder service * Move folder test functions to separate package * Add testing for alert rule counting, deletion * Remove redundant count method * Fix deleteChildrenInFolder signature * Update pkg/services/ngalert/store/alert_rule.go Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com> * Add tests for nested folder deletion * Refactor TestIntegrationNestedFolderService * Add rules store as parameter for alertng provider --------- Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>	2023-06-02 16:38:02 +02:00
Steve Simpson	9effb9a708	Alerting: Allow hooking into request handler functions. (#67000 ) * Alerting: Allow hooking into request handler functions. Adds a facility to AlertNG for hooking into API handlers, allowing the replacement of request handlers for specific paths. One of goals of this approach was to allow hooking as late as possible in the request, e.g. after all middleware has been applied, to simplfiy usage. * Update pkg/services/ngalert/api/hooks.go Co-authored-by: gotjosh <josue.abreu@gmail.com> * Update pkg/services/ngalert/api/hooks.go Co-authored-by: gotjosh <josue.abreu@gmail.com> * Update pkg/services/ngalert/ngalert.go Co-authored-by: gotjosh <josue.abreu@gmail.com> * Fixes to review comments * Fix passing logger in --------- Co-authored-by: gotjosh <josue.abreu@gmail.com>	2023-04-24 18:18:44 +02:00
Alexander Weaver	da4832724e	Alerting: Delete stub for SQL alert state history backend (#65667 ) Delete stub for SQL backend	2023-03-31 11:15:56 -05:00
Alexander Weaver	b2abb63286	Alerting: Introduce proper feature toggles for common state history backend combinations (#65497 ) * define 3 feature toggles for rollout phases * Pass feature toggles along * Implement first feature toggle * Try a different strategy with fall-throughs to specific configurations * Apply toggle overrides once outside of backend composition * Emit log messages when we coerce backends * Run code generator for feature toggle files * Improve wording in flag descs * Re-run generator * Use code-generated constants instead of plain strings * Use converted enum values rather than strings for pre-parsing	2023-03-30 13:53:21 -05:00
Alexander Weaver	a31672fa40	Alerting: Create new state history "fanout" backend that dispatches to multiple other backends at once (#64774 ) * Rename RecordStatesAsync to Record * Rename QueryStates to Query * Implement fanout writes * Implement primary queries * Simplify error joining * Add test for query path * Add tests for writes and error propagation * Allow fanout backend to be configured * Touch up log messages and config validation * Consistent documentation for all backend structs * Parse and normalize backend names more consistently against an enum * Touch-ups to documentation * Improve clarity around multi-record blocking * Keep primary and secondaries more distinct * Rename fanout backend to multiple backend * Simplify config keys for multi backend mode	2023-03-17 12:41:18 -05:00
gotjosh	02a8f62021	Alerting: Fix stats that display alert count when using unified alerting (#64852 ) * Alerting: Fix stats when using unified alerting	2023-03-17 11:19:18 +00:00
Yuri Tseretyan	85a954cd81	Alerting: Update scheduler to get updates only from database (#64635 ) * stop using the scheduler's Update and Delete methods all communication must be via the database * update scheduler's registry to calculate diff before re-setting the cache * update fetcher to return the diff generated by registry * update processTick to update rule eval routine if the rule was updated and it is not going to be evaluated at this tick. * remove references to the scheduler from api package * remove unused methods in the scheduler	2023-03-14 18:02:51 -04:00
Alexander Weaver	faef3a8258	Alerting: Log error but don't fail initialization if state history connection test fails (#64699 ) Don't return init error if ping fails, add tests	2023-03-13 15:54:46 -05:00
Alexander Weaver	19d01dff91	Alerting: Expose Prometheus metrics for persisting state history (#63157 ) * Create historian metrics and dependency inject * Record counter for total number of state transitions logged * Track write failures * Track current number of active write goroutines * Record histogram of how long it takes to write history data * Don't copy the registerer * Adjust naming of write failures metric * Introduce WritesTotal to complement WritesFailedTotal * Measure TransitionsFailedTotal to complement TransitionsTotal * Rename all to state_history * Remove redundant Total suffix * Increment totals all the time, not just on success * Drop ActiveWriteGoroutines * Drop PersistDuration in favor of WriteDuration * Drop unused gauge * Make writes and writesFailed per org * Add metric indicating backend and a spot for future metadata * Drop _batch_ from names and update help * Add metric for bytes written * Better pairing of total + failure metric updates * Few tweaks to wording and naming * Record info metric during composition * Create fakeRequester and simple happy path test using it * Blocking test for the full historian and test for happy path metrics * Add tests for failure case metrics * Smoke test for full annotation persistence * Create test for metrics on annotation persistence, both happy and failing paths * Address linter complaints * More linter complaints * Remove unnecessary whitespace * Consistency improvements to help texts * Update tests to match new descs	2023-03-06 10:40:37 -06:00
Alexander Weaver	e77621649d	Alerting: Instrument outgoing state history requests using weaveworks/common (#63600 ) * Loki backend and client depend on a requester * Instrument all requests to loki using weaveworks TimedClient * Construct collector in metrics package	2023-02-23 17:52:02 -06:00
Alexander Weaver	6ad1cfef38	Alerting: Add endpoint for querying state history (#62166 ) * Define endpoint and generate * Wire up and register endpoint * Cleanup, define authorization * Forgot the leading slash * Wire up query and SignedInUser * Wire up timerange query params * Add todo for label queries * Drop comment * Update path to rules subtree	2023-02-02 11:34:00 -06:00
Alexander Weaver	e7ace4ed62	Alerting: Allow separate read and write path URLs for Loki state history (#62268 ) Extract config parsing and add tests	2023-01-30 16:30:05 -06:00
Matthew Jacobson	c006df375a	Alerting: Create endpoints for exporting in provisioning file format (#58623 ) This adds provisioning endpoints for downloading alert rules and alert rule groups in a format that is compatible with file provisioning. Each endpoint supports both json and yaml response types via Accept header as well as a query parameter download=true/false that will set Content-Disposition to recommend initiating a download or inline display. This also makes some package changes to keep structs with potential to drift closer together. Eventually, other alerting file structs should also move into this new file package, but the rest require some refactoring that is out of scope for this PR.	2023-01-27 11:39:16 -05:00
Alex Moreno	531b439cf1	Alerting: Add alert pausing feature (#60734 ) * Add field in alert_rule model, add state to alert_instance model, and state to eval * Remove paused state from eval package * Skip paused alert rules in scheduler * Add migration to add is_paused field to alert_rule table * Convert to postable alerts only if not normal, pernding, or paused * Handle paused eval results in state manager * Add Paused state to eval package * Add paused alerts logic in scheduler * Skip alert on scheduler * Remove paused status from eval package * Apply suggestions from code review Co-authored-by: George Robinson <george.robinson@grafana.com> * Remove state * Rethink schedule and manager for paused alerts * Change return to continue * Remove unused var * Rethink alert pausing * Paused alerts storing annotations * Only add one state transition * Revert boolean method renaming refactor * Revert take image refactor * Make registry errors public * Revert method extraction for getting a folder title * Revert variable renaming refactor * Undo unnecessary changes * Revert changes in test * Remove IsPause check in PatchPartiLAlertRule function * Use SetNormal to set state * Fix text by returning to old behaviour on alert rule deletion * Add test in schedule_unit_test.go to test ticks with paused alerts * Add coment to clarify usage of context.Background() * Add comment to clarify resetStateByRuleUID method usage * Move rule get to a more limited scope * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: George Robinson <george.robinson@grafana.com> * rum gofmt on pkg/services/ngalert/schedule/schedule.go * Remove defer cancel for context * Update pkg/services/ngalert/models/instance_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/models/testing.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/schedule/schedule_unit_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/schedule/schedule_unit_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Update pkg/services/ngalert/models/instance_test.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * skip scheduler rule state clean up on paused alert rule * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> * Fix mock in test * Add (hopefully) final suggestions * Use error channel from recordAnnotationsSync to cancel context * Run make gen-cue * Place pause alert check in channel update after version check * Reduce branching un update channel select * Add if for error and move code inside if in state manager ResetStateByRuleUID * Add reason to logs * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: George Robinson <george.robinson@grafana.com> * Do not delete alert rule routine, just exit on eval if is paused * Reduce branching and create-close a channel to avoid deadlocks * Separate state deletion and state reset (includes history saving) * Add current pause state in rule route in scheduler * Split clearState and bring errCh closer to RecordStatesAsync call * Change rule to ruleMeta in RecordStatesAsync * copy state to be able to modify it * Add timeout to context creation * Shorten the timeout * Use resetState is rule is paused and deleteState if rule is not paused * Remove Empty state reason * Save every rule change in historian * Add tests for DeleteStateByRuleUID and ResetStateByRuleUID * Remove useless line * Remove outdated comment Co-authored-by: George Robinson <george.robinson@grafana.com> Co-authored-by: Santiago <santiagohernandez.1997@gmail.com> Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>	2023-01-26 18:29:10 +01:00
George Robinson	a7eab8e46e	Alerting: Support context.Context in Loki interface (#61979 ) This commit adds support for canceleable contexts in the Loki interface.	2023-01-26 09:31:20 +00:00

1 2 3 4

170 Commits