* Alerting: In migration improve deduplication of title and group
This change improves alert titles generated in the legacy migration
that occur when we need to deduplicate titles. Now when duplicate
titles are detected we will first attempt to append a sequential index,
falling back to a random uid if none are unique within 10 attempts.
This should cause shorter and more easily readable deduplicated
titles in most cases.
In addition, groups are no longer deduplicated. Instead we set them
to a combination of truncated dashboard name and humanized alert
frequency. This way, alerts from the same dashboard share a group
if they have the same evaluation interval. In the event that truncation
causes overlap, it won't be a big issue as all alerts will still be in a
group with the correct evaluation interval.
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager
Mimir client that understands the new APIs developed for mimir. Very much a WIP still.
* more wip
* appease the linter
* more linting
* add more code
* get state from kvstore, encode, send
* send state to the remote Alertmanager, extract fullstate logic into its own function
* pass kvstore to remote.NewAlertmanager()
* refactor
* add fake kvstore to tests
* tests
* use FileStore to get state
* always log 'completed state upload'
* refactor compareRemoteConfig
* base64-encode the state in the file store
* export silences and nflog filenames, refactor
* log 'completed state/config upload...' regardless of outcome
* add values to the state store in tests
* address code review comments
* log error from filestore
---------
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Alerting: Apply query optimization to eval endpoints
Previously, query optimization was applied to alert queries when scheduled but
not when ran through `api/v1/eval` or `/api/v1/rule/test/grafana`. This could
lead to discrepancies between preview and scheduled alert results.
* Alerting: Add GetFullState method to FileStore
* make tests compile, create stateStore in NewAlertmanager
* return errors instead of logging, accept an arbitrary number of strings
* make NewAlertmanager() accept a stateStore
* Alerting: In migration, fallback to '1s' for malformed min interval
During legacy migration, when we encounter an alert datasource query
with a min interval (interval field in the query model) that is not
parseable, instead of failing the migration we fallback to a min interval
of 1s and continue.
The reason for this is a bug in legacy alerting (existing for a few major
versions) which allows arbitrary dashboard variables to be used as the
min interval, even though those variables do not work and will cause
the legacy alert to fail with `interval calculation failed: time: invalid
duration`.
* Remote Alertmanager(refactor): Only parse the URL once
Exactly what it says in the tin.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* use the existing tests
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager
This is our first attempt at making Grafana communicate use Mimir as a backend - it uses a new set of APIs that we've developed on the Mimir side to upload the grafana configuration and alertmanager state so that it can then be ported over.
Codewise, we've introduced a couple of things:
A client to isolate in its own package all the communication that happens with Mimir
A few changes to the remote/alertmanager to include uploading the configuration and state when it starts
A few refactors that align a bit better with the design approach that we're thinking
An integration tests again these newly developed APIs using a custom image
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* remove use of SignedInUserCopies
* add extra safety to not cross assign permissions
unwind circular dependency
dashboardacl->dashboardaccess
fix missing import
* correctly set teams for permissions
* fix missing inits
* nit: check err
* exit early for api keys
* Alerting: Add an empty Forked Alertmanager
* Alerting: Add methods for silences to the forked Alertmanager
* check for errors in tests
* make linter happy
* Alerting: Add methods for alerts to the forked Alertmanager
* Alerting: Add methods for receivers to the forked Alertmanager
* Alerting: Add TestTemplate method to the forked Alertmanager
* make linter happy
* separate into both forked AMs
* fix tests
* Alerting: Add lifecycle methods to the forked Alertmanager
* Alerting: Add an empty Forked Alertmanager
* Alerting: Add methods for silences to the forked Alertmanager
* check for errors in tests
* make linter happy
* Alerting: Add methods for alerts to the forked Alertmanager
* Alerting: Add methods for receivers to the forked Alertmanager
* Alerting: Add TestTemplate method to the forked Alertmanager
* make linter happy
* separate into both forked AMs
* fix tests
* Alerting: Add an empty Forked Alertmanager
* Alerting: Add methods for silences to the forked Alertmanager
* check for errors in tests
* make linter happy
* Alerting: Add methods for alerts to the forked Alertmanager
* Alerting: Add methods for receivers to the forked Alertmanager
* make linter happy
* separate into both forked AMs
* fix tests
* rename testErr -> expErr
* Alerting: Add an empty Forked Alertmanager
* Alerting: Add methods for silences to the forked Alertmanager
* check for errors in tests
* make linter happy
* Alerting: Add methods for alerts to the forked Alertmanager
* make linter happy
* separate into both forked AMs
* rename testErr -> expErr
* Alerting: Add an empty Forked Alertmanager
* Alerting: Add methods for silences to the forked Alertmanager
* check for errors in tests
* make linter happy
* make linter happy
* Alerting: Add methods for silences to the forked Alertmanager
When running in dev mode, error messages would contain an additional "error" property alongside "message". Since this causes confusion, that has been removed and now error messages are the same both modes (using "message").
* Alerting: Rename remote.ExternalAlertmanager to remote.Alertmanager
* Alerting: Send alerts to the remote Alertmanager
* add ticker to readiness check, add tests
* use options when creating a new sender.ExternaAlertmanager
* unexport defaultMaxQueueCapacity
* delete unused defaultConfig field
* add debug log line when sending alerts to the remote alertmanager
* move and refactor readiness check
* update tests to not include defaultConfig
* Alerting: Move `ExternalAlertmanager` to its own package
We'll avoid import cycles when using components from other packages. In addition to that, I've created an `Options` approach for the multiorg alertmanger to allow us to override how per tenant alertmanagers are created.
* switch things around
* address review comments
* fix references and warnings
* Alerting: Move migration from background service run to ngalert init
sqlite database write contention between the migration's single transaction and
dashboard provisioning's frequent commits was causing the migration to
fail with SQLITE_BUSY/SQLITE_BUSY_SNAPSHOT on all retries.
This is not a new issue for sqlite+grafana, but the discrepancy between the
length of the transactions was causing it to be very consistent. In addition,
since a failed migration has implications on the assumed correctness of the
alertmanager and alert rule definition state, we cause a server shutdown on
error. This can make e2e tests as well as some high-load provisioned
sqlite installations flaky on startup.
The correct fix for this is better transaction management across various
services and is out of scope for this change as we're primarily interested in
mitigating the current bout of server failures in e2e tests when using sqlite.
* Alerting: post alerts to the remote Alertmanager and fetch them
* fix broken tests
* Alerting: Add Mimir Backend image to devenv (blocks)
* add alerting as code owner for mimir_backend block
* Alerting: Use Mimir image to run integration tests for the remote Alertmanager
* skip integration test when running all tests
* skipping integration test when no Alertmanager URL is provided
* fix bad host for mimir_backend
* remove basic auth testing until we have an nginx image in our CI
* add integration tests for alerts
* fix tests
* change SendCtx -> Send, add context.Context to Send, fix CI
* add reover() for functions from the Prometheus Alertmanager HTTP client that could panic
* add TODO to implement PutAlerts in a way that mimicks what Prometheus does
* fix log format
* Alerting: Use Mimir image to run integration tests for the remote Alertmanager
* skip integration test when running all tests
* skipping integration test when no Alertmanager URL is provided
* fix bad host for mimir_backend
* remove basic auth testing until we have an nginx image in our CI