Commit Graph

188 Commits

Author SHA1 Message Date
Santiago
5f4d07bb75
Alerting: Enable remote primary mode using feature toggles (#88976) 2024-06-10 17:07:13 +02:00
Santiago
e15e40fbd3
Alerting: Skip setting up clustering in remote primary/only modes (#88968)
* Alerting: Skip setting up clustering in remote primary mode

* Update pkg/services/ngalert/notifier/multiorg_alertmanager.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

---------

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-06-10 13:51:11 +02:00
William Wernert
63e9969c1b
Alerting: Recording rule mapping logic for data frames to Prometheus metrics (#88550)
* Add stub Prometheus writer with mapping logic

* Add tests
2024-06-07 20:00:22 +03:00
Sofia Papagiannaki
17ca61d7f8
Alerting: Export and provisioning rules into subfolders (#77450)
* Folders: Optionally include fullpath in service responses
* Alerting: Export folder fullpath instead of title
* Escape separator in folder title
* Add support for provisiong alret rules into subfolders
* Use FolderService for creating folders during provisioning
* Export WithFullpath() folder service function

---------

Co-authored-by: Tania B <yalyna.ts@gmail.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-05-31 11:09:20 +03:00
William Wernert
5de7d4d06d
Alerting: Create writer interface for recording rules (#88459)
* Create writer interface for recording rules

Also create fake impl + use it for stub in scheduler
2024-05-29 22:38:33 +03:00
Alexander Weaver
b926b6336d
Alerting: Scheduled recording rules execute their queries (#88309)
* Basic eval flow

* Wiring-up

* fix

* Extend todo

* Start with tests

* Include some relevant tests, skip ones that seem to have timing-based race conditions

* Some tests, touch up linter and todo

* Solve TODO

* Add tracing

* Tests to make sure an eval went through

* Wire up feature toggles

* Update pkg/services/ngalert/schedule/recording_rule.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

---------

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-05-28 10:59:21 -05:00
Steve Simpson
08b18113d2
Alerting: Wire up alertmanagerRemoteOnly feature toggle. (#88329)
* Alerting: Wire up alertmanagerRemoteOnly feature toggle.

Though the mode isn't feature complete yet, it will be useful to have the
feature toggle wired up in order to start testing.

* Apply suggestions from code review

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Formatting

---------

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
2024-05-27 16:18:46 +02:00
Steve Simpson
8421919cb5
Alerting: Feature toggle to disallow sending alerts externally (#87982)
* Define feature toggle

* Implement feature toggle
2024-05-23 14:29:19 +02:00
Santiago
e41434c332
Alerting: Promote configuration in the remote Alertmanager (#87388) 2024-05-16 12:06:03 +02:00
Steve Simpson
67fa96f88d
Alerting: Pass logger into NewAnnotationBackend. (#87812)
* Alerting: Pass logger into NewAnnotationBackend.

Make it possible to pass loggers into more places for code reuse.

* Mistake in passing logger
2024-05-14 15:51:27 +02:00
Steve Simpson
fbaa847a3c
Alerting: Pass logger into NewRemoteLokiBackend. (#87029)
Tiny refactor to allow a logger to be passed into NewRemoteLokiBackend.
2024-04-29 12:10:23 +02:00
Santiago
309a7e7684
Alerting: Implement SaveAndApplyDefaultConfig in the remote Alertmanager struct (#85005)
* Alerting: Implement SaveAndApplyDefaultConfig in the remote Alertmanager struct

* send the hash of the encrypted configuration

* tests, default config hash in AM struct

* add missing default config to test

* restore build directory

* go work file...

* fix broken test

* remove unnecessary conversion to []byte

* go work again...

* make things work again with latest main branch changes

* update error messages in tests for decrypting config
2024-04-19 15:11:07 +02:00
Jean-Philippe Quéméner
7cfd470c91
fix(alerting): only expose metrics if executing alerts (#85512) 2024-04-03 17:18:02 +02:00
Matthew Jacobson
0c3c5c5607
Alerting: Stop persisting silences and nflog to disk (#84706)
With this change, we no longer need to persist silence/nflog states to disk in addition to the kvstore
2024-03-23 00:37:33 +02:00
Yuri Tseretyan
b9abb8cabb
Alerting: Update provisioning API to support regular permissions (#77007)
* allow users with regular actions access provisioning API paths
* update methods that read rules
skip new authorization logic if user CanReadAllRules to avoid performance impact on file-provisioning
update all methods to accept identity.Requester that contains all permissions and is required by access control.

* create deltas for single rul e 

* update modify methods
skip new authorization logic if user CanWriteAllRules to avoid performance impact on file-provisioning
update all methods to accept identity.Requester that contains all permissions and is required by access control.

* implement RuleAccessControlService in provisioning

* update file provisioning user to have all permissions to bypass authz

* update provisioning API to return errutil errors correctly

---------

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2024-03-22 15:37:10 -04:00
Santiago
c9bb18101c
Alerting: Decrypt secrets before sending configuration to the remote Alertmanager (#83640)
* (WIP) Alerting: Decrypt secrets before sending configuration to the remote Alertmanager

* refactor, fix tests

* test decrypting secrets

* tidy up

* test SendConfiguration, quote keys, refactor tests

* make linter happy

* decrypt configuration before comparing

* copy configuration struct before decrypting

* reduce diff in TestCompareAndSendConfiguration

* clean up remote/alertmanager.go

* make linter happy

* avoid serializing into JSON to copy struct

* codeowners
2024-03-19 12:12:03 +01:00
Gilles De Mey
8765c48389
Alerting: Remove legacy alerting (#83671)
Removes legacy alerting, so long and thanks for all the fish! 🐟

---------

Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
Co-authored-by: Sonia Aguilar <soniaAguilarPeiron@users.noreply.github.com>
Co-authored-by: Armand Grillet <armandgrillet@users.noreply.github.com>
Co-authored-by: William Wernert <rwwiv@users.noreply.github.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-03-14 15:36:35 +01:00
William Wernert
8690a42e33
Alerting: Disallow invalid rule namespace UIDs in provisioning API (#83938)
* Disallow invalid rule namespace UIDs in provisioning

Reject requests with rules that reference a nonexistent folder or have an empty folder uid
2024-03-14 09:58:25 -04:00
Yuri Tseretyan
1eebd2a4de
Alerting: Support for simplified notification settings in rule API (#81011)
* Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping
* Support validation of notification settings when rules are updated

* Implement route generator for Alertmanager configuration. That fetches all notification settings.
* Update multi-tenant Alertmanager to run the generator before applying the configuration.

* Add notification settings labels to state calculation
* update the Multi-tenant Alertmanager to provide validation for notification settings

* update GET API so only admins can see auto-gen
2024-02-15 09:45:10 -05:00
Alexander Weaver
99fa064576
Alerting: Emit warning when creating or updating unusually large groups (#82279)
* Add config for limit of rules per rule group

* Warn when editing big groups through normal API

* Warn on prov api writes for groups

* Wire up comp root, tests

* Also add warning to state manager warm

* Drop unnecessary conversion
2024-02-13 08:29:03 -06:00
Alexander Weaver
5bbe9c6e61
Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle (#82212)
* remove jitter feature flag

* Add an out so users can manually disable jitter

* Pass in cfg

* Add TODO to remove knob in future
2024-02-09 15:53:58 -06:00
George Robinson
90a26e18db
Alerting: Update Alertmanager to e82436c (#82145)
This commit updates Alertmanager to commit e82436c, which is based
on commit f69a508 from Prometheus Alertmanager.
2024-02-08 11:25:27 +00:00
William Wernert
2ea82af6e7
Alerting: Pass in receiver service to API struct (#81978) 2024-02-06 16:49:47 +02:00
George Robinson
c8ccc4649c
Alerting: Support UTF-8 (#81512)
This pull request updates our fork of Alertmanager to commit 65bdab0, which is based on commit 5658f8c in Prometheus Alertmanager.

It applies the changes from grafana/alerting#155 which removes the overrides for validation of alerts, labels and silences that we had put in place to allow alerts and silences to work for non-Prometheus datasources. However, as this is now supported in Alertmanager with the UTF-8 work, we can use the new upstream functions and remove these overrides.

The compat package is a package in Alertmanager that takes care of backwards compatibility when parsing matchers, validating alerts, labels and silences. It has three modes: classic mode, UTF-8 strict mode, fallback mode. These modes are controlled via compat.InitFromFlags. Grafana initializes the compat package without any feature flags, which is the equivalent of fallback mode. Classic and UTF-8 strict mode are used in Mimir.

While Grafana Managed Alerts have no need for fallback mode, Grafana can still be used as an interface to manage the configurations of Mimir Alertmanagers and view configurations of Prometheus Alertmanager, and those installations might not have migrated or being running on older versions. Such installations behave as if in classic mode, and Grafana must be able to parse their configurations to interact with them for some period of time. As such, Grafana uses fallback mode until we are ready to drop support for outdated installations of Mimir and the Prometheus Alertmanager.
2024-02-06 08:33:47 +00:00
William Wernert
7e939401dc
Alerting: Introduce initial common receiver service (#81211)
* Create locking config store that mimics existing provisioning store

* Rename existing receivers(_test).go

* Introduce shared receiver group service

* Fix test

* Move query model to models package

* ReceiverGroup -> Receiver

* Remove locking config store

* Move convert methods to compat.go

* Cleanup
2024-02-01 14:42:59 -05:00
Jean-Philippe Quéméner
aa25776f81
Alerting: Add a feature flag to periodically save states (#80987) 2024-01-23 17:03:30 +01:00
Alexander Weaver
00a260effa
Alerting: Add setting to distribute rule group evaluations over time (#80766)
* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels
2024-01-18 12:48:11 -06:00
Jean-Philippe Quéméner
82638d059f
feat(alerting): add state persister interface (#80384) 2024-01-17 13:33:13 +01:00
Santiago
9e78faa7ba
Alerting: Add metrics to the remote Alertmanager struct (#79835)
* Alerting: Add metrics to the remote Alertmanager struct

* rephrase http_requests_failed description

* make linter happy

* remove unnecessary metrics

* extract timed client to separate package

* use histogram collector from dskit

* remove weaveworks dependency

* capture metrics for all requests to the remote Alertmanager (both clients)

* use the timed client in the MimirAuthRoundTripper

* HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function

* refactor

* less git diff

* gauge for last readiness check in seconds

* initialize LastReadinesCheck to 0, tweak metric names and descriptions

* add counters for sync attempts/errors

* last config sync and last state sync timestamps (gauges)

* change latency metric name

* metric for remote Alertmanager mode

* code review comments

* move label constants to metrics package
2024-01-10 11:18:24 +01:00
Matthew Jacobson
aa03b8f8a7
Alerting: Guided legacy alerting upgrade dry-run (#80071)
This PR has two steps that together create a functional dry-run capability for the migration.

By enabling the feature flag alertingPreviewUpgrade when on legacy alerting it will:
    a. Allow all Grafana Alerting background services except for the scheduler to start (multiorg alertmanager, state manager, routes, …).
    b. Allow the UI to show Grafana Alerting pages alongside legacy ones (with appropriate in-app warnings that UA is not actually running).
    c. Show a new “Alerting Upgrade” page and register associated /api/v1/upgrade endpoints that will allow the user to upgrade their organization live without restart and present a summary of the upgrade in a table.
2024-01-05 18:19:12 -05:00
Santiago
a77ba40ed4
Alerting: Use the forked Alertmanager for remote secondary mode (#79646)
* (WIP) Alerting: Use the forked Alertmanager for remote secondary mode

* fall back to using internal AM in case of error

* remove TODOs, clean up .ini file, add orgId as part of remote AM config struct

* log warnings and errors, fall back to remoteSecondary, fall back to internal AM only

* extract logic to decide remote Alertmanager mode to a separate function, switch on mode

* tests

* make linter happy

* remove func to decide remote Alertmanager mode

* refactor factory function and options

* add default case to switch statement

* remove ineffectual assignment
2023-12-21 15:26:31 +01:00
Santiago
9945514baa
Alerting: Validate configuration for the remote Alertmanager struct (#79691)
* Alerting: Validate configuration for the remote Alertmanager struct

* add TenantID to test

* add OrgID to config struct in tests
2023-12-19 18:41:48 +01:00
William Wernert
62bdbe5b44
Annotations/Alerting: Add Loki historian store stub (#78363)
* Add Loki historian store stub

* Add composite store

* Use composite store if Loki historian enabled

* Split store interface into read/write

* Make composite + historian stores read only

* Use variadic constructor for composite

* Modify Loki store enable logic

* Use dskit.concurrency.ForEachJob for parallelism
2023-12-12 17:43:09 -05:00
Alexander Weaver
ab0ef5276f
Alerting: Decouple quota configuration logic from API interfaces and add tests (#78930)
* Separate usage reporter from API

* Extract quota registration

* Decouple from API store interface

* Move to ngalert package and add tests

* linter
2023-12-01 10:47:19 -06:00
Steve Simpson
520c927931
Alerting: Only warm alert state cache if execute_alerts=true. (#78895)
* Alerting: Only warm alert state cache if execute_alerts=true.

If the Grafana instance is not executing alerts, then Warm()-ing the state
manager is wasteful and could lead to misleading rule status queries, as the
status returned will be always based on the state loaded from the database at
startup, and not the most recent evaluation state.

* Move Warm() down to shared conditional.
2023-12-01 10:17:32 +01:00
Santiago
01d274852c
Alerting: Add GetFullState method to FileStore (#78701)
* Alerting: Add GetFullState method to FileStore

* make tests compile, create stateStore in NewAlertmanager

* return errors instead of logging, accept an arbitrary number of strings

* make NewAlertmanager() accept a stateStore
2023-11-28 15:34:45 +01:00
Tania
39754ba2d6
Nested Folders: Wrap create/update operations with transactions (#78000)
* Nested Folders: Add transaction to create and update methods

* Update tests

* Make IncreaseVersionForAllRulesInNamespace synchronous

* Resolve merge conflicts
2023-11-21 23:06:20 +02:00
Ryan McKinley
f69fd3726b
FeatureToggles: Add context and and an explicit global check (#78081) 2023-11-14 12:50:27 -08:00
Santiago
488a60aee6
Alerting: Rename remote.ExternalAlertmanager to remote.Alertmanager (#76956) 2023-10-23 15:37:14 +02:00
gotjosh
866acbd5ac
Alerting: Move ExternalAlertmanager to its own package (#76854)
* Alerting: Move `ExternalAlertmanager` to its own package

We'll avoid import cycles when using components from other packages. In addition to that, I've created an `Options` approach for the multiorg alertmanger to allow us to override how per tenant alertmanagers are created.

* switch things around

* address review comments

* fix references and warnings
2023-10-20 14:08:13 +02:00
Matthew Jacobson
c2efcdde09
Alerting: Fix flaky SQLITE_BUSY when migrating with provisioned dashboards (#76658)
* Alerting: Move migration from background service run to ngalert init

sqlite database write contention between the migration's single transaction and
dashboard provisioning's frequent commits was causing the migration to
 fail with SQLITE_BUSY/SQLITE_BUSY_SNAPSHOT on all retries.

 This is not a new issue for sqlite+grafana, but the discrepancy between the
 length of  the transactions was causing it to be very consistent. In addition,
 since a failed migration has implications on the assumed correctness of the
 alertmanager and alert rule definition state, we cause a server shutdown on
 error. This can make e2e tests as well as some high-load provisioned
 sqlite installations flaky on startup.

 The correct fix for this is better transaction management across various
 services and is out of scope for this change as we're primarily interested in
 mitigating the current bout of server failures in e2e tests when using sqlite.
2023-10-19 10:03:00 -04:00
Matthew Jacobson
82f3127e23
Alerting: Move legacy alert migration from sqlstore migration to service (#72702) 2023-10-12 13:43:10 +01:00
Alexander Weaver
f6649d7a97
Revert "Alerting: Remove vendored models in migration service" (#76387)
Revert "Alerting: Remove vendored models in migration service (#74503)"

This reverts commit 6a8649d544.
2023-10-11 14:21:21 -05:00
Matthew Jacobson
6a8649d544
Alerting: Remove vendored models in migration service (#74503)
This PR replaces the vendored models in the migration with their equivalent ngalert models. It also replaces the raw SQL selects and inserts with service calls.

It also fills in some gaps in the testing suite around:

    - Migration of alert rules: verifying that the actual data model (queries, conditions) are correct 9a7cfa9
    - Secure settings migration: verifying that secure fields remain encrypted for all available notifiers and certain fields migrate from plain text to encrypted secure settings correctly e7d3993

Replacing the checks for custom dashboard ACLs will be replaced in a separate targeted PR as it will be complex enough alone.
2023-10-11 17:22:09 +01:00
gotjosh
59694fb2be
Alerting: Don't use a separate collection system for metrics (#75296)
* Alerting: Don't use a separate collection system for metrics

The state package had a metric collection system that ran every 15s updating the values of the metrics - there is a common pattern for this in the Prometheus ecosystem called "collectors".

I have removed the behaviour of using a time-based interval to "set" the metrics in favour of a set of functions as the "value" that get called at scrape time.
2023-09-25 10:27:30 +01:00
Steve Simpson
894f420014
Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158) 2023-09-20 15:07:02 +02:00
Will Browne
e855efb13d
Plugins: Move store and plugin dto to pluginsintegration (#74655)
move store and plugin dto
2023-09-11 13:59:24 +02:00
Yuri Tseretyan
938e26b59f
Alerting: Add new metrics and tracings to state manager and scheduler (#71398)
* add metrics and tracing to state manager

* propagate tracer to state manager

* add scheduler metrics

* fix backtesting

* add test for state metrics

* remove StateUpdateCount

* update docs

* metrics can be null

* add tracer to new tests
2023-08-16 09:04:18 +02:00
Yuri Tseretyan
0717ec11d6
Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal (#68142) 2023-08-15 10:27:15 -04:00
Yuri Tseretyan
6b4a9d73d7
Alerting: Export contact points to check access control action instead legacy role (#71990)
* introduce a new action "alert.provisioning.secrets:read" and role "fixed:alerting.provisioning.secrets:reader"
* update alerting API authorization layer to let the user read provisioning with the new action
* let new action use decrypt flag
* add action and role to docs
2023-08-08 19:29:34 +03:00