* Alerting: Implement GetStatus in the remote Alertmanager struct
* update tests
* fix tests, extract AlertmanagerConfig from PostableConfig
* get the remote AM config instead of the Grafana one from the remote AM
* pass grafana AM config in test
* return error in GetStatus instead of logging it (internal AM)
* Alerting: Improve error when receiver used by rule is deleted
* Remove RuleUID from public error and data
* Improve fallback error in am config post
* Refactor to expand to time intervals
* Fix message on unchecked errors to be same as before
* Alerting: Implement SaveAndApplyDefaultConfig in the remote Alertmanager struct
* send the hash of the encrypted configuration
* tests, default config hash in AM struct
* add missing default config to test
* restore build directory
* go work file...
* fix broken test
* remove unnecessary conversion to []byte
* go work again...
* make things work again with latest main branch changes
* update error messages in tests for decrypting config
* Alerting: Fix simplified routing custom group by override
Custom group by overrides for simplified routing were missing required fields
GroupBy and GroupByAll normally set during upstream Route validation.
This fix ensures those missing fields are applied to the generated routes.
* Inline GroupBy and GroupByAll initialization instead of normalize after
* Alerting: Fix simplified routes '...' groupBy creating invalid routes
There were a few ways to go about this fix:
1. Modifying our copy of upstream validation to allow this
2. Modify our notification settings validation to prevent this
3. Normalize group by on save
4. Normalized group by on generate
Option 4. was chosen as the others have a mix of the following cons:
- Generated routes risk being incompatible with upstream/remote AM
- Awkward FE UX when using '...'
- Rule definition changing after save and potential pitfalls with TF
With option 4. generated routes stay compatible with external/remote AMs, FE
doesn't need to change as we allow mixed '...' and custom label groupBys, and
settings we save to db are the same ones requested.
In addition, it has the slight benefit of allowing us to hide the internal
implementation details of `alertname, grafana_folder` from the user in the
future, since we don't need to send them with every FE or TF request.
* Safer use of DefaultNotificationSettingsGroupBy
* Fix missed API tests
* Alerting: Persist silence state immediately on Create/Delete
Persists the silence state to the kvstore immediately instead of waiting for the
next maintenance run. This is used after Create/Delete to prevent silences from
being lost when a new Alertmanager is started before the state has persisted.
This can happen, for example, in a rolling deployment scenario.
* Fix test that requires real data
* Don't error if silence state persist fails, maintenance will correct
* Alerting: Make retention period configurable for the notification log
* update sample.ini
* fix outdated comment (on disk -> kvstore)
* skip checking cyclomatic complexity for ReadUnifiedAlertingSettings
* Feature Flags: use FeatureToggles interface where possible
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* Replace TestFeatureToggles with existing WithFeatures
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
---------
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* (WIP) Alerting: Decrypt secrets before sending configuration to the remote Alertmanager
* refactor, fix tests
* test decrypting secrets
* tidy up
* test SendConfiguration, quote keys, refactor tests
* make linter happy
* decrypt configuration before comparing
* copy configuration struct before decrypting
* reduce diff in TestCompareAndSendConfiguration
* clean up remote/alertmanager.go
* make linter happy
* avoid serializing into JSON to copy struct
* codeowners
Updates Grafana Alertmanager to work with new interface from grafana/alerting#161. This change stops passing user-defined templates to the Grafana Alertmanager by persisting them to disk and instead passes them by string.
This commit adds basic support for time_intervals, as mute_time_intervals
is deprecated in Alertmanager and scheduled to be removed before 1.0.
It does not add support for time_intervals in API or file provisioning,
nor does it support exporting time intervals. This will be added in
later commits to keep the changes as simple as possible.
* Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping
* Support validation of notification settings when rules are updated
* Implement route generator for Alertmanager configuration. That fetches all notification settings.
* Update multi-tenant Alertmanager to run the generator before applying the configuration.
* Add notification settings labels to state calculation
* update the Multi-tenant Alertmanager to provide validation for notification settings
* update GET API so only admins can see auto-gen
* streamline initialization of test databases, support on-disk sqlite test db
* clean up test databases
* introduce testsuite helper
* use testsuite everywhere we use a test db
* update documentation
* improve error handling
* disable entity integration test until we can figure out locking error
* Add single receiver method
* Add receiver permissions
* Add single/multi GET endpoints for receivers
* Remove stable tag from time intervals
See end of PR description here: https://github.com/grafana/grafana/pull/81672
* Create locking config store that mimics existing provisioning store
* Rename existing receivers(_test).go
* Introduce shared receiver group service
* Fix test
* Move query model to models package
* ReceiverGroup -> Receiver
* Remove locking config store
* Move convert methods to compat.go
* Cleanup
This commit prevents saving configurations containing inhibition
rules in Grafana Alertmanager. It does not reject inhibition
rules when using external Alertmanagers, such as Mimir. This meant
the validation had to be put in the MultiOrgAlertmanager instead of
in the validation of PostableUserConfig. We can remove this when
inhibition rules are supported in Grafana Managed Alerts.
AM config applied via API would use the PostableUserConfig as the AM raw
config and also the hash used to decide when the AM config has changed.
However, when applied via the periodic sync the PostableApiAlertingConfig would
be used instead.
This leads to two issues:
- Inconsistent hash comparisons when modifying the AM causing redundant applies.
- GetStatus assumed the raw config was PostableUserConfig causing the endpoint
to return correctly after a new config is applied via API and then nothing once
the periodic sync runs.
Note: Technically, the upstream GrafanaAlertamanger GetStatus shouldn't be
returning PostableUserConfig or PostableApiAlertingConfig, but instead
GettableStatus. However, this issue required changes elsewhere and is out of
scope.
* Alerting: Add metric to check for default AM configurations
* Use a gauge for the config hash
* don't go out of bounds when converting uint64 to float64
* expose metric for config hash
* update metrics after applying config
* Alerting: Add metrics to the remote Alertmanager struct
* rephrase http_requests_failed description
* make linter happy
* remove unnecessary metrics
* extract timed client to separate package
* use histogram collector from dskit
* remove weaveworks dependency
* capture metrics for all requests to the remote Alertmanager (both clients)
* use the timed client in the MimirAuthRoundTripper
* HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function
* refactor
* less git diff
* gauge for last readiness check in seconds
* initialize LastReadinesCheck to 0, tweak metric names and descriptions
* add counters for sync attempts/errors
* last config sync and last state sync timestamps (gauges)
* change latency metric name
* metric for remote Alertmanager mode
* code review comments
* move label constants to metrics package
* (WIP) Alerting: Use the forked Alertmanager for remote secondary mode
* fall back to using internal AM in case of error
* remove TODOs, clean up .ini file, add orgId as part of remote AM config struct
* log warnings and errors, fall back to remoteSecondary, fall back to internal AM only
* extract logic to decide remote Alertmanager mode to a separate function, switch on mode
* tests
* make linter happy
* remove func to decide remote Alertmanager mode
* refactor factory function and options
* add default case to switch statement
* remove ineffectual assignment
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager
Mimir client that understands the new APIs developed for mimir. Very much a WIP still.
* more wip
* appease the linter
* more linting
* add more code
* get state from kvstore, encode, send
* send state to the remote Alertmanager, extract fullstate logic into its own function
* pass kvstore to remote.NewAlertmanager()
* refactor
* add fake kvstore to tests
* tests
* use FileStore to get state
* always log 'completed state upload'
* refactor compareRemoteConfig
* base64-encode the state in the file store
* export silences and nflog filenames, refactor
* log 'completed state/config upload...' regardless of outcome
* add values to the state store in tests
* address code review comments
* log error from filestore
---------
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Alerting: Add GetFullState method to FileStore
* make tests compile, create stateStore in NewAlertmanager
* return errors instead of logging, accept an arbitrary number of strings
* make NewAlertmanager() accept a stateStore
* Alerting: Add an empty Forked Alertmanager
* Alerting: Add methods for silences to the forked Alertmanager
* check for errors in tests
* make linter happy
* make linter happy
* Alerting: Add methods for silences to the forked Alertmanager
* Alerting: Move `ExternalAlertmanager` to its own package
We'll avoid import cycles when using components from other packages. In addition to that, I've created an `Options` approach for the multiorg alertmanger to allow us to override how per tenant alertmanagers are created.
* switch things around
* address review comments
* fix references and warnings
* Alerting: post alerts to the remote Alertmanager and fetch them
* fix broken tests
* Alerting: Add Mimir Backend image to devenv (blocks)
* add alerting as code owner for mimir_backend block
* Alerting: Use Mimir image to run integration tests for the remote Alertmanager
* skip integration test when running all tests
* skipping integration test when no Alertmanager URL is provided
* fix bad host for mimir_backend
* remove basic auth testing until we have an nginx image in our CI
* add integration tests for alerts
* fix tests
* change SendCtx -> Send, add context.Context to Send, fix CI
* add reover() for functions from the Prometheus Alertmanager HTTP client that could panic
* add TODO to implement PutAlerts in a way that mimicks what Prometheus does
* fix log format
* Alerting: Use Mimir image to run integration tests for the remote Alertmanager
* skip integration test when running all tests
* skipping integration test when no Alertmanager URL is provided
* fix bad host for mimir_backend
* remove basic auth testing until we have an nginx image in our CI