Commit Graph

1280 Commits

Author SHA1 Message Date
Matthew Jacobson
0ce1ccd6f9
Alerting: Fix inconsistent AM raw config when applied via sync vs API (#81655)
AM config applied via API would use the PostableUserConfig as the AM raw
config and also the hash used to decide when the AM config has changed.
However, when applied via the periodic sync the PostableApiAlertingConfig would
be used instead.

This leads to two issues:
- Inconsistent hash comparisons when modifying the AM causing redundant applies.
- GetStatus assumed the raw config was PostableUserConfig causing the endpoint
to return correctly after a new config is applied via API and then nothing once
 the periodic sync runs.

Note: Technically, the upstream GrafanaAlertamanger GetStatus shouldn't be
returning PostableUserConfig or PostableApiAlertingConfig, but instead
GettableStatus. However, this issue required changes elsewhere and is out of
scope.
2024-01-31 21:05:30 +02:00
Ashley Harrison
39057552dc
QueryField: Handle autocomplete better (#81484)
* extract out function + add unit tests

* add feature toggle and default it to on
2024-01-31 10:01:20 +00:00
Yuri Tseretyan
131c72d655
Alerting: Fix scheduler to group folders by the unique key (orgID and UID) (#81303) 2024-01-30 17:14:11 -05:00
Sofia Papagiannaki
89d3b55bec
Folders: Reduce DB queries when counting and deleting resources under folders (#81153)
* Add folder store method for fetching all folder descendants

* Modify GetDescendantCounts() to fetch folder descendants at once

* Reduce DB calls when counting library panels under dashboard

* Reduce DB calls when counting dashboards under folder

* Reduce DB calls during folder delete

* Modify folder registry to count/delete entities under multiple folders

* Reduce DB calls when counting

* Reduce DB calls when deleting
2024-01-30 18:26:34 +02:00
William Wernert
de662810cf
Alerting: Create instance of alert rule generator in historian annotation tests (#81394)
* Create generator variable to ensure closures have correct context
2024-01-29 11:22:43 -05:00
idafurjes
f44592a97a
Remove folderID from service tests (#80615)
* Remove folderID from service tests

* Remove folderID from ngalert migration tests

* Remove tests related to folderIDs

* Roll back change

Before removing FolderID from this test, we need to adjust the code

* Remove FolderID from publicdashboard pkg

* Add back annotations test
2024-01-26 17:36:35 +02:00
Gabriel MABILLE
722b78f3e0
RBAC: Add userLogin filter to the permission search endpoint (#81137)
* RBAC: Search add user login filter

* Switch to a userService resolving instead

* Remove unused error

* Fallback to use the cache

* account for userID filter

* Account for the error

* snake case

* Add test cases

* Add api tests

* Fix return on error

* Re-order imports
2024-01-26 09:43:16 +01:00
Sofia Papagiannaki
b1eec36df3
Alerting: Fix authorisation to use namespace UIDs for scope (#81231) 2024-01-25 15:19:51 -05:00
idafurjes
7e5544ab21
Add MFolderIDsServiceCount to count folderIDs in services pkg (#81237) 2024-01-25 11:10:35 +01:00
Sofia Papagiannaki
478d7d58fa
Nested folders: Allow creating folders with duplicate names in different locations (#77076)
* Add API test

* Add move tests

* Fix create folder

* Fix move

* Fix test

* Drop and re-create index so that allows a folder to contain a dashboard and a subfolder with same name

* Get folder by title defaults to root folder and optionally fetches folder by provided parent folder

* Apply suggestions from code review
2024-01-25 11:29:56 +02:00
William Wernert
2203bc2a3d
Alerting: Refactor provisioning tests/fakes (#81205)
* Fix up test Alertmanager config JSON

* Move fake AM config and provisioning stores to fakes package
2024-01-24 17:15:55 -05:00
Matthew Jacobson
71e70c424f
Alerting: During legacy migration reduce the number of created silences (#78505)
* Alerting: During legacy migration reduce the number of created silences

During legacy migration every migrated rule was given a label rule_uid=<uid>.
This was used to silence DatasourceError/DatasourceNoData alerts for
migrated rules that had either ExecutionErrorState/NoDataState set to
keep_state, respectively.

This could potentially create a large amount of silences and a high cardinality
label. Both of these scenarios have poor outcomes for CPU load and latency in
unified alerting.

Instead, this change creates one label per ExecutionErrorState/NoDataState when
they are set to keep_state as well as two silence rules, if rules with said
labels were created during migration. These silence rules are:

- __legacy_silence_error_keep_state__ = true
- __legacy_silence_nodata_keep_state__ = true

This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label `rule_uid`.
2024-01-24 15:56:19 -05:00
Santiago
fbbda6c05e
Alerting: Retry readiness check to the remote Alertmanager on 5xx status code responses (#81174) 2024-01-24 21:39:06 +01:00
George Robinson
05d858635c
Alerting: Add metric for inhibition rules (#81119)
This commit adds a metric for the number of inhibition rules.
It matches the metric added upstream in #3681.
2024-01-23 19:43:17 +00:00
Jean-Philippe Quéméner
aa25776f81
Alerting: Add a feature flag to periodically save states (#80987) 2024-01-23 17:03:30 +01:00
George Robinson
85b9edcd28
Alerting: Fix incorrect initialization of logger (#81099) 2024-01-23 17:29:38 +02:00
Marcus Efraimsson
6768c6c059
Chore: Remove public vars in setting package (#81018)
Removes the public variable setting.SecretKey plus some other ones. 
Introduces some new functions for creating setting.Cfg.
2024-01-23 12:36:22 +01:00
Jean-Philippe Quéméner
eb7e1216a1
feat(alerting): add async state persister (#80763) 2024-01-22 13:07:11 +01:00
Julien Duchesne
40312c527b
ngalert openapi: Fix ObjectMatchers definition (#79477)
These don't get marshalled and unmarshalled in the same way as they are represented in Go
This PR changes the OpenAPI spec to reflect what the API accepts and sends back
2024-01-19 14:37:11 -05:00
Alexander Weaver
18b9c8fd5f
Alerting: Nilcheck JitterStrategyFrom so it can be used in contexts without feature toggles (#80841)
Nilcheck so tests can have a nil feature toggles
2024-01-18 15:43:41 -06:00
Alexander Weaver
00a260effa
Alerting: Add setting to distribute rule group evaluations over time (#80766)
* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels
2024-01-18 12:48:11 -06:00
Julien Duchesne
c9211fbd69
ngalert openapi: Use same basePath as rest of Grafana (#79025)
* ngalert openapi: Use same `basePath` as rest of Grafana
Currently, there are two issues that prevent easily merging `ngalert` and grafana openapi specs:
- The basePath is different. `grafana` has `/api` and `ngalert` has `/api/v1`. I changed `ngalert` to use `/api`
- The `ngalert` endpoints have their basePath in the each operation path. The basePath should actually be omitted
---------

Co-authored-by: Yuriy Tseretyan <yuriy.tseretyan@grafana.com>
2024-01-17 11:53:16 -05:00
Jean-Philippe Quéméner
82638d059f
feat(alerting): add state persister interface (#80384) 2024-01-17 13:33:13 +01:00
Santiago
3217a0dc05
Alerting: Fix state sync errors counter increment (#80702) 2024-01-17 11:04:27 +01:00
Sofia Papagiannaki
d1dab5828d
Alerting: Update rule API to address folders by UID (#74600)
* Change ruler API to expect the folder UID as namespace

* Update example requests

* Fix tests

* Update swagger

* Modify FIle field in /api/prometheus/grafana/api/v1/rules

* Fix ruler export

* Modify folder in responses to be formatted as <parent UID>/<title>

* Add alerting test with nested folders

* Apply suggestion from code review

* Alerting: use folder UID instead of title in rule API (#77166)

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>

* Drop a few more latent uses of namespace_id

* move getNamespaceKey to models package

* switch GetAlertRulesForScheduling to use folder table

* update GetAlertRulesForScheduling to return folder titles in format `parent_uid/title`.

* fi tests

* add tests for GetAlertRulesForScheduling when parent uid

* fix integration tests after merge

* fix test after merge

* change format of the namespace to JSON array

this is needed for forward compatibility, when we migrate to full paths

* update EF code to decode nested folder

---------

Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
Co-authored-by: Virginia Cepeda <virginia.cepeda@grafana.com>
Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
Co-authored-by: Alex Weaver <weaver.alex.d@gmail.com>
Co-authored-by: Gilles De Mey <gilles.de.mey@gmail.com>
2024-01-17 11:07:39 +02:00
Alexander Weaver
3c796ecc8f
Alerting: Add metric counting rule groups per org (#80669)
* Refactor, fix bad map hint

* Count groups per org
2024-01-16 16:35:56 -06:00
Santiago
3afd94185c
Alerting: Add metric to check for default AM configurations (#80225)
* Alerting: Add metric to check for default AM configurations

* Use a gauge for the config hash

* don't go out of bounds when converting uint64 to float64

* expose metric for config hash

* update metrics after applying config
2024-01-16 17:12:24 +01:00
Yuri Tseretyan
4b071f5452
Alerting: Fix MuteTiming Get API to return provenance status (#80494) 2024-01-13 00:16:54 +02:00
Julien Duchesne
2fb03dfa56
fix(swagger): Mute Timing PUT OK status is 202 (#80459) 2024-01-12 16:58:20 -05:00
Yuri Tseretyan
4479e7218d
Alerting: MuteTiming service return errutil + GetTiming by name (#79772)
* add get mute timing by name to MuteTimingService
* update get mute timing request handler to use the service method

* replace validation, uniqueness and used errors with errutils
* update mute timing methods return errutil responses
* use the term "time interval" in errors bevause mute timings are deprecated in Alertmanager and will be replaced by time intervals in the future.

* update create and update methods to return struct instead of pointer
2024-01-12 21:23:44 +02:00
idafurjes
cb419e799b
Remove folderid service test (#80433)
* Remove FolderID from service tests

* Add models

* Add folderID pack to publicdashboard tests

* Remove folderID from dashboard tests

* Remove folderID from folders

* Remove folderID from ngalert tests

* Remove nolint comment

* Add back some tests after rebase
2024-01-12 16:43:39 +01:00
Yuri Tseretyan
77db6a9ca4
Alerting: Fix GetAlertRulesForScheduling to use folder table and join by org_id (#80330) 2024-01-11 09:21:03 -05:00
Santiago
6c87d9a1e7
Alerting: Stop retries on 4xx status code responses (remote Alertmanager readiness check) (#80350) 2024-01-11 12:12:35 +01:00
William Wernert
48b5ac779b
Alerting/Annotations: Add annotation backend for Loki alert state history (#78156)
* Move scope type vars to testutil package

* Expose parts of state historian for use in annotation backend

* Implement Loki ASH Annotation store

This store will only implement the `Get` method of a RepositoryImpl since alert state history
writes to Loki elsewhere.

* Use interface for Loki HTTP Client

* Add tests for Loki ASH Annotation store

* Add missing test

* Fix lint

* Organize tests

* Add filter tests

* Improve tests

* Move filter logic into outer function

* Fix lint

* Add comment

* Fix tests

* Fix lint

* Rename historian store + refactor

* Cleanup historian store

* Fix tests

* Minor cleanup

* Use new `ShouldRecordAnnotation` filter

* Fix logic and add tests for this check

* Fix typos, remove unused variables, `< 1` -> `== 0`

* More closely mimic RBAC filter from xorm to ensure correct logic

* Move off weaveworks client

* Address PR comments
2024-01-10 18:42:35 -05:00
Matthew Jacobson
afa33f12b2
Alerting: Create alertingQueryOptimization feature flag for alert query optimization (#78932)
* Alerting: Create feature flag for alert query optimization

Adds a feature flag alertingQueryOptimization for an already existing 
functionality: alert query optimization. This feature flag will now be disabled 
by default.
2024-01-10 15:52:58 -05:00
Matthew Jacobson
f365d35cf8
Alerting: Show warning when query optimized (#78751)
* Alerting: Show warning when query optimized

* Use frame.AppendNotices

* Improve warning to include why and a prompt for action
2024-01-10 14:40:00 -05:00
Santiago
9e78faa7ba
Alerting: Add metrics to the remote Alertmanager struct (#79835)
* Alerting: Add metrics to the remote Alertmanager struct

* rephrase http_requests_failed description

* make linter happy

* remove unnecessary metrics

* extract timed client to separate package

* use histogram collector from dskit

* remove weaveworks dependency

* capture metrics for all requests to the remote Alertmanager (both clients)

* use the timed client in the MimirAuthRoundTripper

* HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function

* refactor

* less git diff

* gauge for last readiness check in seconds

* initialize LastReadinesCheck to 0, tweak metric names and descriptions

* add counters for sync attempts/errors

* last config sync and last state sync timestamps (gauges)

* change latency metric name

* metric for remote Alertmanager mode

* code review comments

* move label constants to metrics package
2024-01-10 11:18:24 +01:00
Matthew Jacobson
1d4419fbe4
Alerting: Fix NoData & Error alerts not resolving when rule is reset (#80184)
* Alerting: Fix NoData & Error alerts not resolving when rule is reset

On rule reset, when creating the PostableAlerts StateToPostableAlert did not
attach the correct NoData/Error alertname and rulename labels to expire/resolve
the active alerts when the previous cached state was NoData/Error.
2024-01-09 14:47:19 -05:00
Alexander Weaver
542741f748
Alerting: Log scheduler maxAttempts, guard against invalid retry counts, log retry errors (#80234)
* Log maxAttempts, add guard, log retry errors

* fix whitespace

* Initialize evaluator in TestProcessTicks
2024-01-09 13:19:37 -06:00
Matthew Jacobson
aa03b8f8a7
Alerting: Guided legacy alerting upgrade dry-run (#80071)
This PR has two steps that together create a functional dry-run capability for the migration.

By enabling the feature flag alertingPreviewUpgrade when on legacy alerting it will:
    a. Allow all Grafana Alerting background services except for the scheduler to start (multiorg alertmanager, state manager, routes, …).
    b. Allow the UI to show Grafana Alerting pages alongside legacy ones (with appropriate in-app warnings that UA is not actually running).
    c. Show a new “Alerting Upgrade” page and register associated /api/v1/upgrade endpoints that will allow the user to upgrade their organization live without restart and present a summary of the upgrade in a table.
2024-01-05 18:19:12 -05:00
Yuri Tseretyan
72182e02a4
Alerting: Mute timing service tests (#79817)
split tests for mute timing service to functions for each method this makes it clear the scope of tests
2024-01-06 00:26:15 +02:00
Yuri Tseretyan
494f36e0bd
Alerting: Update provisioning services that handle Alertmanager configuraiton to access config via storage (#79814)
* extract get and save operations to a alertmanagerConfigStore. this removes duplicated code in service (currently only mute timings) and improves testing
* replace generic errors with errutils one with better messages.
* update provisioning services to use new store

---------

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2024-01-05 16:15:18 -05:00
Alexander Weaver
a8fb01a502
Swap weaveworks/common utilities for equivalents in grafana/dskit (#80051)
* Replace histogram collector and grpc injectors

* Extract request timing utility

* Also vendor test file

* Suppress erroneous linter warn
2024-01-05 10:08:38 -06:00
Matthew Jacobson
3537c5440f
Alerting: Refactor migration to return pairs of legacy and upgraded structs (#79719)
Some refactoring that will simplify next changes for dry-run PRs. This should be no-op as far as the created ngalert resources and database state, though it does change some logs.

The key change here is to modify migrateOrg to return pairs of legacy struct + ngalert struct instead of actually persisting the alerts and alertmanager config. This will allow us to capture error information during dry-run migration.

It also moves most persistence-related operations such as title deduplication and folder creation to the right before we persist. This will simplify eventual partial migrations (individual alerts, dashboards, channels, ...).

Additionally it changes channel code to deal with PostableGrafanaReceiver instead of PostableApiReceiver (integration instead of contact point).
2024-01-05 05:37:13 -05:00
Santiago
1f6575e65e
Alerting: Test MOA in remote secondary mode (#79828) 2024-01-05 11:05:27 +01:00
Alexander Weaver
90d4704cd7
Alerting: Fix URL timestamp conversion in historian API in annotation mode (#80026)
Fix timestamp conversion when calling annotation store
2024-01-04 12:40:21 -06:00
Yuri Tseretyan
f6a46744a6
Alerting: Support hysteresis command expression (#75189)
Backend: 

* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
   - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
  -  add rule_fingerprint column to alert_instance
   - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
   - add AlertingResultsFromRuleState that implements the new interface in eval package
   - update getExprRequest to patch the hysteresis command.

* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.


Frontend:

* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions

* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions

---------

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
2024-01-04 11:47:13 -05:00
Santiago
a77ba40ed4
Alerting: Use the forked Alertmanager for remote secondary mode (#79646)
* (WIP) Alerting: Use the forked Alertmanager for remote secondary mode

* fall back to using internal AM in case of error

* remove TODOs, clean up .ini file, add orgId as part of remote AM config struct

* log warnings and errors, fall back to remoteSecondary, fall back to internal AM only

* extract logic to decide remote Alertmanager mode to a separate function, switch on mode

* tests

* make linter happy

* remove func to decide remote Alertmanager mode

* refactor factory function and options

* add default case to switch statement

* remove ineffectual assignment
2023-12-21 15:26:31 +01:00
Santiago
c46da8ea9b
Alerting: Update alerting package and imports from cluster and clusterpb (#79786)
* Alerting: Update alerting package

* update to latest commit

* alias for imports
2023-12-21 12:34:48 +01:00
Matthew Jacobson
0424d44b39
Alerting: In migration, create one label per channel (#76527)
* In migration, create one label per channel

This PR changes how routing is done by the legacy alerting migration.

Previously, we created a single label on each alert rule that contained an array of contact point names. Ex: __contact__="slack legacy testing","slack legacy testing2"

This label was then routed against a series of regex-matching policies with continue=true. Ex: __contacts__ =~ .*"slack legacy testing".*

In the case of many contact points, this array could quickly become difficult to manage and difficult to grok at-a-glance.

This PR replaces the single __contact__ label with multiple __legacy_c_{contactname}__ labels and simple equality-matching policies. These channel-specific policies are nested in a single route under the top-level route which matches against __legacy_use_channels__ = true for ease of organization.

This should improve the experience for users wanting to keep the default migrated routing strategy but who also want to modify which contact points an alert sends to.
2023-12-19 13:25:13 -05:00