Commit Graph

37 Commits

Author SHA1 Message Date
Santiago
309a7e7684
Alerting: Implement SaveAndApplyDefaultConfig in the remote Alertmanager struct (#85005)
* Alerting: Implement SaveAndApplyDefaultConfig in the remote Alertmanager struct

* send the hash of the encrypted configuration

* tests, default config hash in AM struct

* add missing default config to test

* restore build directory

* go work file...

* fix broken test

* remove unnecessary conversion to []byte

* go work again...

* make things work again with latest main branch changes

* update error messages in tests for decrypting config
2024-04-19 15:11:07 +02:00
Santiago
a2ce8fefed
Alerting: Use a struct when sending a Grafana AM configuration to the remote Alertmanager (#86451)
* Alerting: Use a struct when sending a Grafana AM configuration to the remote Alertmanager

* remove '-distroless' from mimir image name
2024-04-19 13:04:18 +02:00
Matthew Jacobson
f79dd7c7f9
Alerting: Persist silence state immediately on Create/Delete (#84705)
* Alerting: Persist silence state immediately on Create/Delete

Persists the silence state to the kvstore immediately instead of waiting for the
 next maintenance run. This is used after Create/Delete to prevent silences from
 being lost when a new Alertmanager is started before the state has persisted.
 This can happen, for example, in a rolling deployment scenario.

* Fix test that requires real data

* Don't error if silence state persist fails, maintenance will correct
2024-04-09 13:39:34 -04:00
Santiago
2e7cc68394
Alerting: Remove CleanUp method from the Alertmanager (#85650)
Alerting: Remove Cleanup method from the Alertmanager
2024-04-09 12:13:27 +02:00
Matthew Jacobson
0c3c5c5607
Alerting: Stop persisting silences and nflog to disk (#84706)
With this change, we no longer need to persist silence/nflog states to disk in addition to the kvstore
2024-03-23 00:37:33 +02:00
Santiago
a2facbecd4
Alerting: Implement ApplyConfig for remote primary mode (forked AM) (#84811)
* Alerting: Implement ApplyConfig for remote primary mode (forked AM)

* add TODO for saving the config hash in other config-related methods

* fix bad method receiver name (m -> am)

* tests

* add mutex

* remove sync loop
2024-03-22 15:17:41 +01:00
Santiago
4ad6d66479
Alerting: Remove ID from UserGrafanaConfig struct (#84602)
* Alerting: Remove ID from UserGrafanaConfig struct

* user custom mimir image withoud id in grafana config

* change mimir image name
2024-03-19 12:47:13 +01:00
Santiago
c9bb18101c
Alerting: Decrypt secrets before sending configuration to the remote Alertmanager (#83640)
* (WIP) Alerting: Decrypt secrets before sending configuration to the remote Alertmanager

* refactor, fix tests

* test decrypting secrets

* tidy up

* test SendConfiguration, quote keys, refactor tests

* make linter happy

* decrypt configuration before comparing

* copy configuration struct before decrypting

* reduce diff in TestCompareAndSendConfiguration

* clean up remote/alertmanager.go

* make linter happy

* avoid serializing into JSON to copy struct

* codeowners
2024-03-19 12:12:03 +01:00
Santiago
fbbda6c05e
Alerting: Retry readiness check to the remote Alertmanager on 5xx status code responses (#81174) 2024-01-24 21:39:06 +01:00
Santiago
3217a0dc05
Alerting: Fix state sync errors counter increment (#80702) 2024-01-17 11:04:27 +01:00
Santiago
6c87d9a1e7
Alerting: Stop retries on 4xx status code responses (remote Alertmanager readiness check) (#80350) 2024-01-11 12:12:35 +01:00
Santiago
9e78faa7ba
Alerting: Add metrics to the remote Alertmanager struct (#79835)
* Alerting: Add metrics to the remote Alertmanager struct

* rephrase http_requests_failed description

* make linter happy

* remove unnecessary metrics

* extract timed client to separate package

* use histogram collector from dskit

* remove weaveworks dependency

* capture metrics for all requests to the remote Alertmanager (both clients)

* use the timed client in the MimirAuthRoundTripper

* HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function

* refactor

* less git diff

* gauge for last readiness check in seconds

* initialize LastReadinesCheck to 0, tweak metric names and descriptions

* add counters for sync attempts/errors

* last config sync and last state sync timestamps (gauges)

* change latency metric name

* metric for remote Alertmanager mode

* code review comments

* move label constants to metrics package
2024-01-10 11:18:24 +01:00
Santiago
a77ba40ed4
Alerting: Use the forked Alertmanager for remote secondary mode (#79646)
* (WIP) Alerting: Use the forked Alertmanager for remote secondary mode

* fall back to using internal AM in case of error

* remove TODOs, clean up .ini file, add orgId as part of remote AM config struct

* log warnings and errors, fall back to remoteSecondary, fall back to internal AM only

* extract logic to decide remote Alertmanager mode to a separate function, switch on mode

* tests

* make linter happy

* remove func to decide remote Alertmanager mode

* refactor factory function and options

* add default case to switch statement

* remove ineffectual assignment
2023-12-21 15:26:31 +01:00
Santiago
9945514baa
Alerting: Validate configuration for the remote Alertmanager struct (#79691)
* Alerting: Validate configuration for the remote Alertmanager struct

* add TenantID to test

* add OrgID to config struct in tests
2023-12-19 18:41:48 +01:00
Santiago
23b4568597
Alerting: Send configuration and state to the remote Alertmanager on shutdown (#78682)
* Alerting: Send configuration and state to the remote Alertmanager on shutdown

* Alerting: Add a sync interval for ApplyConfig in remote secondary mode

* add routine to sync states and configs

* pass a cancellable context to syncRoutine(), remove tests for ApplyConfig, cache last config in memory

* extract logic to update config and state in the remote Alertmanager

* get latest config from the database

* avoid using separate goroutine for updating state and config

* clean up PR

* refactor, comments, tests

* update tests

* remove canceled context from calls to StopAndWait()

* create context with timeout and send config and state to remote Alertmanager

* update tests

* address code review comments
2023-12-13 22:53:09 +01:00
Santiago
91836e7832
Alerting: Add time-based convergence in remote secondary mode (#78809)
* Alerting: Add a sync interval for ApplyConfig in remote secondary mode

* add routine to sync states and configs

* pass a cancellable context to syncRoutine(), remove tests for ApplyConfig, cache last config in memory

* extract logic to update config and state in the remote Alertmanager

* get latest config from the database

* avoid using separate goroutine for updating state and config

* clean up PR

* refactor, comments, tests

* update tests

* add config struct for remote secondary forked Alertmanager

* use errgroups for sync operations

* use waitgroup instead of errgroup

* remove helper method to sync AMs

* check for errors instead of bool syncErr
2023-12-13 13:36:17 +01:00
Santiago
1a5c2cb55b
Alerting: Check whether the internal Alertmanager is ready in remote secondary mode (#79406)
Alerting: Check whether the internal Alertmanager is ready in remote secondary
2023-12-12 18:33:11 +01:00
gotjosh
cc3c0a2cc2
Alerting: Refactor readiness check (#78799)
* Alerting: Refactor readiness check

Moves the readiness check to the mimir client and removes the need to assert that we have senders - it already has a queue and can hold notifications until we're ready to send them.
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-12 15:34:54 +00:00
Santiago
d64c2b6f4e
Alerting: Implement ApplyConfig in the forked Alertmanager (#78684)
* Alerting: Add a sync interval for ApplyConfig in remote secondary mode

* remove out of scope code

* remove parentheses after CleanUp for consistency in test comments

* Add comment to ApplyConfig
2023-11-30 15:36:41 +01:00
Santiago
316c8b50bc
Alerting: Add SaveAndApply methods to the forked Alertmanager (remote secondary) (#78827)
* Alerting: Add configuration methods to the forked Alertmanager for remote secondary modes

* update comments
2023-11-30 15:18:56 +01:00
Santiago
73776f37eb
Alerting: Send state to the remote Alertmanager (#78538)
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager

Mimir client that understands the new APIs developed for mimir. Very much a WIP still.

* more wip

* appease the linter

* more linting

* add more code

* get state from kvstore, encode, send

* send state to the remote Alertmanager, extract fullstate logic into its own function

* pass kvstore to remote.NewAlertmanager()

* refactor

* add fake kvstore to tests

* tests

* use FileStore to get state

* always log 'completed state upload'

* refactor compareRemoteConfig

* base64-encode the state in the file store

* export silences and nflog filenames, refactor

* log 'completed state/config upload...' regardless of outcome

* add values to the state store in tests

* address code review comments

* log error from filestore

---------

Co-authored-by: gotjosh <josue.abreu@gmail.com>
2023-11-29 12:49:39 +01:00
Santiago
01d274852c
Alerting: Add GetFullState method to FileStore (#78701)
* Alerting: Add GetFullState method to FileStore

* make tests compile, create stateStore in NewAlertmanager

* return errors instead of logging, accept an arbitrary number of strings

* make NewAlertmanager() accept a stateStore
2023-11-28 15:34:45 +01:00
gotjosh
8120306fea
Remote Alertmanager(refactor): Only parse the URL once (#78631)
* Remote Alertmanager(refactor): Only parse the URL once

Exactly what it says in the tin.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* use the existing tests

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-11-24 11:05:13 +00:00
gotjosh
23fe8f4e9c
Alerting: Introduce a Mimir client as part of the Remote Alertmanager (#78357)
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager

This is our first attempt at making Grafana communicate use Mimir as a backend - it uses a new set of APIs that we've developed on the Mimir side to upload the grafana configuration and alertmanager state so that it can then be ported over.

Codewise, we've introduced a couple of things:

A client to isolate in its own package all the communication that happens with Mimir
A few changes to the remote/alertmanager to include uploading the configuration and state when it starts
A few refactors that align a bit better with the design approach that we're thinking
An integration tests again these newly developed APIs using a custom image

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
2023-11-23 16:59:36 +00:00
Santiago
4a152a0e35
Alerting: Add lifecycle methods to the forked Alertmanager (#77741)
* Alerting: Add an empty Forked Alertmanager

* Alerting: Add methods for silences to the forked Alertmanager

* check for errors in tests

* make linter happy

* Alerting: Add methods for alerts to the forked Alertmanager

* Alerting: Add methods for receivers to the forked Alertmanager

* Alerting: Add TestTemplate method to the forked Alertmanager

* make linter happy

* separate into both forked AMs

* fix tests

* Alerting: Add lifecycle methods to the forked Alertmanager
2023-11-14 11:17:17 +01:00
Santiago
8b751eb216
Alerting: Add TestTemplate method to the forked Alertmanager (#77577)
* Alerting: Add an empty Forked Alertmanager

* Alerting: Add methods for silences to the forked Alertmanager

* check for errors in tests

* make linter happy

* Alerting: Add methods for alerts to the forked Alertmanager

* Alerting: Add methods for receivers to the forked Alertmanager

* Alerting: Add TestTemplate method to the forked Alertmanager

* make linter happy

* separate into both forked AMs

* fix tests
2023-11-09 12:35:24 +01:00
Santiago
ba51c371ec
Alerting: Add methods for receivers to the forked Alertmanager (#77574)
* Alerting: Add an empty Forked Alertmanager

* Alerting: Add methods for silences to the forked Alertmanager

* check for errors in tests

* make linter happy

* Alerting: Add methods for alerts to the forked Alertmanager

* Alerting: Add methods for receivers to the forked Alertmanager

* make linter happy

* separate into both forked AMs

* fix tests

* rename testErr -> expErr
2023-11-09 11:38:16 +01:00
Santiago
e24fe96d90
Alerting: Add methods for alerts to the forked Alertmanager (#77571)
* Alerting: Add an empty Forked Alertmanager

* Alerting: Add methods for silences to the forked Alertmanager

* check for errors in tests

* make linter happy

* Alerting: Add methods for alerts to the forked Alertmanager

* make linter happy

* separate into both forked AMs

* rename testErr -> expErr
2023-11-08 13:52:04 +01:00
Santiago
197f0d2859
Alerting: Add methods for silences to the forked Alertmanager (#77805)
* Alerting: Add an empty Forked Alertmanager

* Alerting: Add methods for silences to the forked Alertmanager

* check for errors in tests

* make linter happy

* make linter happy

* Alerting: Add methods for silences to the forked Alertmanager
2023-11-08 12:03:40 +01:00
Santiago
01af8f61f1
Alerting: Separate the forked Alertmanager into two implementations (#77582) 2023-11-02 17:53:18 +01:00
Santiago
8fc9873443
Alerting: Add an empty Forked Alertmanager struct (#77550)
Alerting: Add an empty Forked Alertmanager
2023-11-02 16:49:03 +01:00
Santiago
a6b9b27673
Alerting: Remove OrgID() from the Alertmanager interface (#77398) 2023-10-31 10:58:47 +01:00
Santiago
f9fc2e4568
Alerting: Remove ConfigHash() from the Alertmanager interface (#77134) 2023-10-25 17:11:53 +02:00
Santiago
322a9c0b15
Alerting: Replace FileStore() for CleanUp() in the Alertmanager interface (#77126)
Alerting: Remplace FileStore() for CleanUp() in the Alertmanager interface
2023-10-25 13:58:28 +02:00
Santiago
01add144b8
Alerting: Send alerts to the remote Alertmanager (#77034)
* Alerting: Rename remote.ExternalAlertmanager to remote.Alertmanager

* Alerting: Send alerts to the remote Alertmanager

* add ticker to readiness check, add tests

* use options when creating a new sender.ExternaAlertmanager

* unexport defaultMaxQueueCapacity

* delete unused defaultConfig field

* add debug log line when sending alerts to the remote alertmanager

* move and refactor readiness check

* update tests to not include defaultConfig
2023-10-25 11:52:48 +02:00
Santiago
488a60aee6
Alerting: Rename remote.ExternalAlertmanager to remote.Alertmanager (#76956) 2023-10-23 15:37:14 +02:00
gotjosh
866acbd5ac
Alerting: Move ExternalAlertmanager to its own package (#76854)
* Alerting: Move `ExternalAlertmanager` to its own package

We'll avoid import cycles when using components from other packages. In addition to that, I've created an `Options` approach for the multiorg alertmanger to allow us to override how per tenant alertmanagers are created.

* switch things around

* address review comments

* fix references and warnings
2023-10-20 14:08:13 +02:00