* (WIP) Alerting: Decrypt secrets before sending configuration to the remote Alertmanager
* refactor, fix tests
* test decrypting secrets
* tidy up
* test SendConfiguration, quote keys, refactor tests
* make linter happy
* decrypt configuration before comparing
* copy configuration struct before decrypting
* reduce diff in TestCompareAndSendConfiguration
* clean up remote/alertmanager.go
* make linter happy
* avoid serializing into JSON to copy struct
* codeowners
Removes legacy alerting, so long and thanks for all the fish! 🐟
---------
Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
Co-authored-by: Sonia Aguilar <soniaAguilarPeiron@users.noreply.github.com>
Co-authored-by: Armand Grillet <armandgrillet@users.noreply.github.com>
Co-authored-by: William Wernert <rwwiv@users.noreply.github.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
* Implement keep last state for state transitions
* Respect For duration when keeping state
* Only keep transition from recording an annotation
* Add keep last state option for nodata/error in UI
* export Evaluation
* Export Evaluation
* Export RuleVersionAndPauseStatus
* export Eval, create interface
* Export update and add to interface
* Export Stop and Run and add to interface
* Registry and scheduler use rule by interface and not concrete type
* Update factory to use interface, update tests to work over public API rather than writing to channels directly
* Rename map in registry
* Rename getOrCreateInfo to not reference a specific implementation
* Genericize alertRuleInfoRegistry into ruleRegistry
* Rename alertRuleInfo to alertRule
* Comments on interface
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
---------
Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
* Regenerate openapidocs at 1.21.8 to match ci
* Adjust trigger to work on the actual outputted files
* Also put go.mod and go.sum in the triggers
* manually fix
* Make an arbitrary change rather than touching the trigger to force a run
* Drop all triggers - run all the time
* Print diff - taken from @papagian's PR
* Manual fixes to swagger doc
---------
Co-authored-by: Ryan McKinley <ryantxu@gmail.com>
* Alerting: Use Alertmanager types extracted into grafana/alerting
We're in the process of exporting all Alertmanager types into grafana/alerting so that they can be imported in the Mimir Alertmanager, without a neeed to import Grafana directly.
This change introduces type aliasing for all Alertmanager types based on their 1:1 copy that now live in grafana/alerting.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* create rule factory for more complicated dep injection into rules
* Rules get direct access to metrics, logs, traces utilities, use factory in tests
* Use clock internal to rule
* Use sender, statemanager, evalfactory directly
* evalApplied and stopApplied
* use schedulableAlertRules behind interface
* loaded metrics reader
* 3 relevant config options
* Drop unused scheduler parameter
* Rename ruleRoutine to run
* Update READMED
* Handle long parameter lists
* remove dead branch
Updates Grafana Alertmanager to work with new interface from grafana/alerting#161. This change stops passing user-defined templates to the Grafana Alertmanager by persisting them to disk and instead passes them by string.
* ValidateInterval doesn't need the entire config
* Validation no longer depends on entire folder now that we've dropped foldertitle from api
* Don't depend on entire config struct
* Export validate group
* Alerting: feat: support deleting rule groups in the provisioning API
Adds support for DELETE to the provisioning API's alert rule groups route, which allows deleting the rule group with a
single API call. Previously, groups were deleted by deleting rules one-by-one.
Fixes#81860
This change doesn't add any new paths to the API, only new methods.
---------
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
* Chore: Replace response status with const var
* Apply suggestions from code review
Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>
* Add net/http import
---------
Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>
This commit adds basic support for time_intervals, as mute_time_intervals
is deprecated in Alertmanager and scheduled to be removed before 1.0.
It does not add support for time_intervals in API or file provisioning,
nor does it support exporting time intervals. This will be added in
later commits to keep the changes as simple as possible.
Previously receivers were only validated before saving the alertmanager
configuration. This is a suboptimal experience for those upgrading with preview
as the failed channel upgrade will return an API error instead of being
summarized in the table.
Adds a feature flag (alertingUpgradeDryrunOnStart) that will dry-run the legacy
alert upgrade on startup. It is enabled by default.
When on legacy alerting, this feature flag will log the results of the legacy
alerting upgrade on startup and draw attention to anything in the current legacy
alerting configuration that will cause issues when the upgrade is eventually
performed. It acts as a log warning for those where action is required before
upgrading to Grafana v11 where legacy alerting will be removed.
If the db already has an entry in the kvstore for the silences of an
alertmanager before the migration has taken place, then it's possible that the
active alertmanager will overwrite the silence file created by the migration
before it has a chance to load it into memory.
This should not happen normally but is possible in edge-cases.
This change opts to bypass the unnecessary step of writing the silences to disk
during the migration and instead write them directly to the kvstore. This avoids
the race condition entirely and is more correct as we treat the database as the
source of truth for AM state.
Alerting: Remove start page of upgrade preview
Alerting upgrade page will now always show the summary table even before
upgrading any alerts or notification channels. There a few reasons for this:
- The information on the start page is redundant as it's now contained in the
documentation.
- Previously, if some unexpected issue prevented performing a full upgrade, a
user would have limited to no means to using the preview tool to help fix the
problem. This is because you could not see the summary table until the full
upgrade was performed at least once. Now, you can upgrade individual alerts and
notification channels from the beginning.
* Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping
* Support validation of notification settings when rules are updated
* Implement route generator for Alertmanager configuration. That fetches all notification settings.
* Update multi-tenant Alertmanager to run the generator before applying the configuration.
* Add notification settings labels to state calculation
* update the Multi-tenant Alertmanager to provide validation for notification settings
* update GET API so only admins can see auto-gen
* Alerting: fix race condition in (*ngalert/sender.ExternalAlertmanager).Run
* Chore: Fix data races when accessing members of *ngalert/state.FakeInstanceStore
* Chore: Fix data races in tests in ngalert/schedule and enable some parallel tests
* Chore: fix linters
* Chore: add TODO comment to remove loopvar once we move to Go 1.22
* Add config for limit of rules per rule group
* Warn when editing big groups through normal API
* Warn on prov api writes for groups
* Wire up comp root, tests
* Also add warning to state manager warm
* Drop unnecessary conversion
* Add user uid migration to run on every startup to protect against empty values in a upgrade downgrade scenario
* Add team uid migration to run on every startup to protect against empty values in a upgrade downgrade scenario
* Run team uid migration
* streamline initialization of test databases, support on-disk sqlite test db
* clean up test databases
* introduce testsuite helper
* use testsuite everywhere we use a test db
* update documentation
* improve error handling
* disable entity integration test until we can figure out locking error
* update GetUserVisibleNamespaces to use FolderSeriver
* update GetNamespaceByUID to use FolderService.GetFolders
* update GetAlertRulesForScheduling to use FolderService.GetFolders
* Update API and GetAlertRulesForScheduling to use the folder's full path
* get full path of folder in RouteTestGrafanaRuleConfig
* fix escaping of titles for MySQL
This pull request updates our fork of Alertmanager to commit 65bdab0, which is based on commit 5658f8c in Prometheus Alertmanager.
It applies the changes from grafana/alerting#155 which removes the overrides for validation of alerts, labels and silences that we had put in place to allow alerts and silences to work for non-Prometheus datasources. However, as this is now supported in Alertmanager with the UTF-8 work, we can use the new upstream functions and remove these overrides.
The compat package is a package in Alertmanager that takes care of backwards compatibility when parsing matchers, validating alerts, labels and silences. It has three modes: classic mode, UTF-8 strict mode, fallback mode. These modes are controlled via compat.InitFromFlags. Grafana initializes the compat package without any feature flags, which is the equivalent of fallback mode. Classic and UTF-8 strict mode are used in Mimir.
While Grafana Managed Alerts have no need for fallback mode, Grafana can still be used as an interface to manage the configurations of Mimir Alertmanagers and view configurations of Prometheus Alertmanager, and those installations might not have migrated or being running on older versions. Such installations behave as if in classic mode, and Grafana must be able to parse their configurations to interact with them for some period of time. As such, Grafana uses fallback mode until we are ready to drop support for outdated installations of Mimir and the Prometheus Alertmanager.
* Add single receiver method
* Add receiver permissions
* Add single/multi GET endpoints for receivers
* Remove stable tag from time intervals
See end of PR description here: https://github.com/grafana/grafana/pull/81672
* declare new API and models GettableTimeIntervals, PostableTimeIntervals
* add new actions alert.notifications.time-intervals:read and alert.notifications.time-intervals:write.
* update existing alerting roles with the read action. Add to all alerting roles.
* add integration tests
* Create locking config store that mimics existing provisioning store
* Rename existing receivers(_test).go
* Introduce shared receiver group service
* Fix test
* Move query model to models package
* ReceiverGroup -> Receiver
* Remove locking config store
* Move convert methods to compat.go
* Cleanup
This commit prevents saving configurations containing inhibition
rules in Grafana Alertmanager. It does not reject inhibition
rules when using external Alertmanagers, such as Mimir. This meant
the validation had to be put in the MultiOrgAlertmanager instead of
in the validation of PostableUserConfig. We can remove this when
inhibition rules are supported in Grafana Managed Alerts.
AM config applied via API would use the PostableUserConfig as the AM raw
config and also the hash used to decide when the AM config has changed.
However, when applied via the periodic sync the PostableApiAlertingConfig would
be used instead.
This leads to two issues:
- Inconsistent hash comparisons when modifying the AM causing redundant applies.
- GetStatus assumed the raw config was PostableUserConfig causing the endpoint
to return correctly after a new config is applied via API and then nothing once
the periodic sync runs.
Note: Technically, the upstream GrafanaAlertamanger GetStatus shouldn't be
returning PostableUserConfig or PostableApiAlertingConfig, but instead
GettableStatus. However, this issue required changes elsewhere and is out of
scope.
* Add folder store method for fetching all folder descendants
* Modify GetDescendantCounts() to fetch folder descendants at once
* Reduce DB calls when counting library panels under dashboard
* Reduce DB calls when counting dashboards under folder
* Reduce DB calls during folder delete
* Modify folder registry to count/delete entities under multiple folders
* Reduce DB calls when counting
* Reduce DB calls when deleting
* Remove folderID from service tests
* Remove folderID from ngalert migration tests
* Remove tests related to folderIDs
* Roll back change
Before removing FolderID from this test, we need to adjust the code
* Remove FolderID from publicdashboard pkg
* Add back annotations test
* RBAC: Search add user login filter
* Switch to a userService resolving instead
* Remove unused error
* Fallback to use the cache
* account for userID filter
* Account for the error
* snake case
* Add test cases
* Add api tests
* Fix return on error
* Re-order imports
* Add API test
* Add move tests
* Fix create folder
* Fix move
* Fix test
* Drop and re-create index so that allows a folder to contain a dashboard and a subfolder with same name
* Get folder by title defaults to root folder and optionally fetches folder by provided parent folder
* Apply suggestions from code review
* Alerting: During legacy migration reduce the number of created silences
During legacy migration every migrated rule was given a label rule_uid=<uid>.
This was used to silence DatasourceError/DatasourceNoData alerts for
migrated rules that had either ExecutionErrorState/NoDataState set to
keep_state, respectively.
This could potentially create a large amount of silences and a high cardinality
label. Both of these scenarios have poor outcomes for CPU load and latency in
unified alerting.
Instead, this change creates one label per ExecutionErrorState/NoDataState when
they are set to keep_state as well as two silence rules, if rules with said
labels were created during migration. These silence rules are:
- __legacy_silence_error_keep_state__ = true
- __legacy_silence_nodata_keep_state__ = true
This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label `rule_uid`.
These don't get marshalled and unmarshalled in the same way as they are represented in Go
This PR changes the OpenAPI spec to reflect what the API accepts and sends back
* Simple, per-base-interval jitter
* Add log just for test purposes
* Add strategy approach, allow choosing between group or rule
* Add flag to jitter rules
* Add second toggle for jittering within a group
* Wire up toggles to strategy
* Slightly improve comment ordering
* Add tests for offset generation
* Rename JitterStrategyFrom
* Improve debug log message
* Use grafana SDK labels rather than prometheus labels
* ngalert openapi: Use same `basePath` as rest of Grafana
Currently, there are two issues that prevent easily merging `ngalert` and grafana openapi specs:
- The basePath is different. `grafana` has `/api` and `ngalert` has `/api/v1`. I changed `ngalert` to use `/api`
- The `ngalert` endpoints have their basePath in the each operation path. The basePath should actually be omitted
---------
Co-authored-by: Yuriy Tseretyan <yuriy.tseretyan@grafana.com>
* Change ruler API to expect the folder UID as namespace
* Update example requests
* Fix tests
* Update swagger
* Modify FIle field in /api/prometheus/grafana/api/v1/rules
* Fix ruler export
* Modify folder in responses to be formatted as <parent UID>/<title>
* Add alerting test with nested folders
* Apply suggestion from code review
* Alerting: use folder UID instead of title in rule API (#77166)
Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
* Drop a few more latent uses of namespace_id
* move getNamespaceKey to models package
* switch GetAlertRulesForScheduling to use folder table
* update GetAlertRulesForScheduling to return folder titles in format `parent_uid/title`.
* fi tests
* add tests for GetAlertRulesForScheduling when parent uid
* fix integration tests after merge
* fix test after merge
* change format of the namespace to JSON array
this is needed for forward compatibility, when we migrate to full paths
* update EF code to decode nested folder
---------
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
Co-authored-by: Virginia Cepeda <virginia.cepeda@grafana.com>
Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
Co-authored-by: Alex Weaver <weaver.alex.d@gmail.com>
Co-authored-by: Gilles De Mey <gilles.de.mey@gmail.com>
* Alerting: Add metric to check for default AM configurations
* Use a gauge for the config hash
* don't go out of bounds when converting uint64 to float64
* expose metric for config hash
* update metrics after applying config
* add get mute timing by name to MuteTimingService
* update get mute timing request handler to use the service method
* replace validation, uniqueness and used errors with errutils
* update mute timing methods return errutil responses
* use the term "time interval" in errors bevause mute timings are deprecated in Alertmanager and will be replaced by time intervals in the future.
* update create and update methods to return struct instead of pointer
* Remove FolderID from service tests
* Add models
* Add folderID pack to publicdashboard tests
* Remove folderID from dashboard tests
* Remove folderID from folders
* Remove folderID from ngalert tests
* Remove nolint comment
* Add back some tests after rebase
* Move scope type vars to testutil package
* Expose parts of state historian for use in annotation backend
* Implement Loki ASH Annotation store
This store will only implement the `Get` method of a RepositoryImpl since alert state history
writes to Loki elsewhere.
* Use interface for Loki HTTP Client
* Add tests for Loki ASH Annotation store
* Add missing test
* Fix lint
* Organize tests
* Add filter tests
* Improve tests
* Move filter logic into outer function
* Fix lint
* Add comment
* Fix tests
* Fix lint
* Rename historian store + refactor
* Cleanup historian store
* Fix tests
* Minor cleanup
* Use new `ShouldRecordAnnotation` filter
* Fix logic and add tests for this check
* Fix typos, remove unused variables, `< 1` -> `== 0`
* More closely mimic RBAC filter from xorm to ensure correct logic
* Move off weaveworks client
* Address PR comments
* Alerting: Create feature flag for alert query optimization
Adds a feature flag alertingQueryOptimization for an already existing
functionality: alert query optimization. This feature flag will now be disabled
by default.
* Alerting: Add metrics to the remote Alertmanager struct
* rephrase http_requests_failed description
* make linter happy
* remove unnecessary metrics
* extract timed client to separate package
* use histogram collector from dskit
* remove weaveworks dependency
* capture metrics for all requests to the remote Alertmanager (both clients)
* use the timed client in the MimirAuthRoundTripper
* HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function
* refactor
* less git diff
* gauge for last readiness check in seconds
* initialize LastReadinesCheck to 0, tweak metric names and descriptions
* add counters for sync attempts/errors
* last config sync and last state sync timestamps (gauges)
* change latency metric name
* metric for remote Alertmanager mode
* code review comments
* move label constants to metrics package