Fixes a panic that would ocurr as we proxy 4xx responses. When this happens and the content type of the response is JSON we try to check if the response has a "message" key. Then, we assume that the key will contain a value of string but we don't take into account that this value can potentially be `null`.
This adds a type assertion check to to this assumption so that we can keep the original JSON body as the response if we're unable to extract an `message`.
* Fix Annotation creation
- Remove validation of panelID, now annotations are created irrespective on whether they're attached to a panel or not.
- Alwasy attach the annotation to an AlertID
* Fix annotation creation
* fix tests
* Alerting: Clear alerting rule evaluation errors after intermittent failures
When an alert transitioned in a way that `alerting -> error -> (alerting|nodata)`, the error provided by the `error` state would never be cleared thus the API and UI would show the health as an error.
* update AlertingEnabled and UnifiedAlertingSettings.Enabled to be pointers
* add a pseudo migration to fix the AlertingEnabled and UnifiedAlertingSettings.Enabled if the latter is not defined
* update the default configuration file to make default value for both 'enabled' flags be undefined
Misc
* update Migrator to expose DB engine. This is needed for a ualert migration to access the database while the list of migrations is created.
* add more verbose failure when migrations do not match
Co-authored-by: gotjosh <josue@grafana.com>
Co-authored-by: Yuriy Tseretyan <yuriy.tseretyan@grafana.com>
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
* add value to email template
* add value to default template
* update test string
* test: fix ngalert test suite
* test: run CI
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
* Alerting: accept mute_timing_intervals through the api for the embedded alertmanager
* add workaround for mutetimeinterval
* add mute timings to routes
* revert changes
* Update pkg/services/ngalert/api/api_alertmanager.go
* Update pkg/services/ngalert/api/api_alertmanager.go
* Update pkg/services/ngalert/api/api_alertmanager.go
* update prometheus/alertmanager dependency
* add some var docs
Refactor usage of legacy data contracts. Moves legacy data contracts
to pkg/tsdb/legacydata package.
Refactor pkg/expr to be a proper service/dependency that can be provided
to wire to remove some unneeded dependencies to SSE in ngalert and other places.
Refactor pkg/expr to not use the legacydata,RequestHandler and use
backend.QueryDataHandler instead.
* do not suppress NoData state
* extract conversion of state to postable alert + tests
* create a special alert instance if nodata
* use NoData when converting from Keep Last State instead of Alerting
* add silence during migration if NoData is mapped to KeepLastState.
* Use secrets service in pluginproxy
* Use secrets service in pluginxontext
* Use secrets service in pluginsettings
* Use secrets service in provisioning
* Use secrets service in authinfoservice
* Use secrets service in api
* Use secrets service in sqlstore
* Use secrets service in dashboardshapshots
* Use secrets service in tsdb
* Use secrets service in datasources
* Use secrets service in alerting
* Use secrets service in ngalert
* Break cyclic dependancy
* Refactor service
* Break cyclic dependancy
* Add FakeSecretsStore
* Setup Secrets Service in sqlstore
* Fix
* Continue secrets service refactoring
* Fix cyclic dependancy in sqlstore tests
* Fix secrets service references
* Fix linter errors
* Add fake secrets service for tests
* Refactor SetupTestSecretsService
* Update setting up secret service in tests
* Fix missing secrets service in multiorg_alertmanager_test
* Use fake db in tests and sort imports
* Use fake db in datasources tests
* Fix more tests
* Fix linter issues
* Attempt to fix plugin proxy tests
* Pass secrets service to getPluginProxiedRequest in pluginproxy tests
* Fix pluginproxy tests
* Revert using secrets service in alerting and provisioning
* Update decryptFn in alerting migration
* Rename defaultProvider to currentProvider
* Use fake secrets service in alert channels tests
* Refactor secrets service test helper
* Update setting up secrets service in tests
* Revert alerting changes in api
* Add comments
* Remove secrets service from background services
* Convert global encryption functions into vars
* Revert "Convert global encryption functions into vars"
This reverts commit 498eb19859.
* Add feature toggle for envelope encryption
* Rename toggle
Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>
Co-authored-by: Joan López de la Franca Beltran <joanjan14@gmail.com>
* move evaluation function out of loop
* extract updateRule function
* isolate alertRule change. update returns new rule and the evaluation accepts the rule as argument
* extract retry loop into a function
* add function wide log context.
* refactor metrics + add tests + replace timeNow with schedule.clock
* Added an option to discord notifier to use discord's webhook name (useful for customizing notifications).
* Support ngalert system with discord username toggle
* Added ngalert discord test
* Apply suggestions from code review
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Docs updated with discord username setting
* Fix api integration test
Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Alerting: Validate contact point configuration during the migration
This minimises the chances of generating broken configuration as part of the migration. Originally, we wanted to generate it and not produce a hard stop in Grafana but this strategy has the chance to avoid delivering notifications for our users.
We now think it's better to hard stop the migration and let the user take care of resolving the configuration manually.
* Alerting: Fixes a bug when trying to sync broken alertmanager config
Broken alertmanager configuration has the potential to be introduced as part of a migration e.g. due to incompatible data between what grafana accepts and what the Alertmanager expects. When this happens, we expect an eventually consistent behaviour where we'll keep trying to apply the configuration until it works.
As part of change in https://github.com/grafana/grafana/pull/39237 we introduced a regression that modified this behaviour and instead tried to create a new Alertmanager for that organization everytime, which eventually ended up in a panic due to a duplicate metrics being registered.
This PR fixes that and introduces a test to catch further regressions.
* Remove disable orgs
* Encryption: Add support to encrypt/decrypt sjd
* Add datasources.Service as a proxy to datasources db operations
* Encrypt ds.SecureJsonData before calling SQLStore
* Move ds cache code into ds service
* Fix tlsmanager tests
* Fix pluginproxy tests
* Remove some securejsondata.GetEncryptedJsonData usages
* Add pluginsettings.Service as a proxy for plugin settings db operations
* Add AlertNotificationService as a proxy for alert notification db operations
* Remove some securejsondata.GetEncryptedJsonData usages
* Remove more securejsondata.GetEncryptedJsonData usages
* Fix lint errors
* Minor fixes
* Remove encryption global functions usages from ngalert
* Fix lint errors
* Minor fixes
* Minor fixes
* Remove securejsondata.DecryptedValue usage
* Refactor the refactor
* Remove securejsondata.DecryptedValue usage
* Move securejsondata to migrations package
* Move securejsondata to migrations package
* Minor fix
* Fix integration test
* Fix integration tests
* Undo undesired changes
* Fix tests
* Add context.Context into encryption methods
* Fix tests
* Fix tests
* Fix tests
* Trigger CI
* Fix test
* Add names to params of encryption service interface
* Remove bus from CacheServiceImpl
* Add logging
* Add keys to logger
Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>
* Add missing key to logger
Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>
* Undo changes in markdown files
* Fix formatting
* Add context to secrets service
* Rename decryptSecureJsonData to decryptSecureJsonDataFn
* Name args in GetDecryptedValueFn
* Add template back to NewAlertmanagerNotifier
* Copy GetDecryptedValueFn to ngalert
* Add logging to pluginsettings
* Fix pluginsettings test
Co-authored-by: Tania B <yalyna.ts@gmail.com>
Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>
* Alerting: (wip) add template funcs
* Alerting: (wip) numeric template functions
* Alerting: (wip) template functions
* Test for the "args" function
* Alerting: (wip) Documentation for template functions
* Alerting: template functions - refactor
* code review changes
* disable linter error
* Use Prometheus implementation of TemplateExpander
* Update docs/sources/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule.md
Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>
* change templateCaptureValue to support using template functions
* Update pkg/services/ngalert/state/template.go
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Test and documentation added for reReplaceAll template function
* complete missing functions, documentation and tests
* Use the alert instance's evaluation time for expanding the template
* strvalue graphlink and tablelink functions
* delete duplicate test
* make strvalue return an empty string
Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
Remove validation for labels to be accepted in the Alertmanager, This helps with datasources that produce non-compatible labels.
Adds an "object_matchers" to alert manager routers so we can support labels names with extended characters beyond prometheus/openmetrics. It only does this for the internal Grafana managed Alert Manager.
This requires a change to alert manager, so for now we use grafana/alertmanager which is a slight fork, with the intention of going back to upstream.
The frontend handles the migration of "matchers" -> "object_matchers" when the route is edited and saved. Once this is done, downgrades will not work old versions will not recognize the "object_matchers".
Co-authored-by: Kyle Brandt <kyle@grafana.com>
Co-authored-by: Nathan Rodman <nathanrodman@gmail.com>
Require guardian.New to take context.Context as first argument.
Migrates the GetDashboardAclInfoListQuery to be dispatched using context.
Ref #36734
Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>
Co-authored-by: sam boyer <sam.boyer@grafana.com>
* Add method GetAllLatestAlertmanagerConfiguration to DBStore
* add method ApplyConfig to AlertManager
* update multiorg alert manager to load all alertmanager configs at once
* pass url parameters through context.Context
* fix url param names without colon prefix
* change context params to vars
* replace url vars in tests using new api
* rename vars to params
* add some comments
* rename seturlvars to seturlparams
* Chore: GetDashboardQuery should be dispatched using DispatchCtx
* Fix after merge
* Changes after review
* Various fixes
* Use GetDashboardCtx function instead of GetDashboard
* Alerting: Refactor & fix unified alerting metrics structure
Fixes and refactors the metrics structure we have for the ngalert service. Now, each component has its own metric struct that includes the JUST the metrics it uses. Additionally, I have fixed the configuration metrics and added new metrics to determine if we have discovered and started all the necessary configurations of an instance.
This allows us to alert on `grafana_alerting_discovered_configurations - grafana_alerting_active_configurations != 0` to know whether an alertmanager instance did not start successfully.
* Alerting: Persist notification log and silences to the database
This removes the dependency of having persistent disk to run grafana alerting. Instead of regularly flushing the notification log and silences to disk we now flush the binary content of those files to the database encoded as a base64 string.
* Change templateCaptureValue to support using template functions
This commit changes templateCaptureValue to use float64 for the value
instead of *float64. This change means that annotations and labels can
use the float64 value with functions such as printf and avoid having to
check for nil. It also means that absent values are now printed as 0.
* Use math.NaN() instead of 0 for absent value
* Alerting: Fix alert flapping in the alertmanager
fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager.
Mostly due to a combination of `EndsAt` and resend delay.
The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard
back from the alert generation system.
Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself,
and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing.
This commit, increases the `EndsAt` by 3x the the resend delay or alert interval (depending on which one is higher). The resendDelay has been decreased to 30 seconds.
Fixes#30144
Co-authored-by: dsotirakis <sotirakis.dim@gmail.com>
Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com>
Co-authored-by: Ida Furjesova <ida.furjesova@grafana.com>
Co-authored-by: Jack Westbrook <jack.westbrook@gmail.com>
Co-authored-by: Will Browne <wbrowne@users.noreply.github.com>
Co-authored-by: Leon Sorokin <leeoniya@gmail.com>
Co-authored-by: Andrej Ocenas <mr.ocenas@gmail.com>
Co-authored-by: spinillos <selenepinillos@gmail.com>
Co-authored-by: Karl Persson <kalle.persson@grafana.com>
Co-authored-by: Leonard Gram <leo@xlson.com>
Introduces org-level isolation for the Alertmanager and its components.
Silences, Alerts and Contact points are not separated by org and are not shared between them.
Co-authored with @davidmparrott and @papagian
This commit adds contact point testing to ngalerts via a new API
endpoint. This endpoint accepts JSON containing a list of
receiver configurations which are validated and then tested
with a notification for a test alert. The endpoint returns JSON
for each receiver with a status and error message. It accepts
a configurable timeout via the Request-Timeout header (in seconds)
up to a maximum of 30 seconds.
* Alerting: Expose discovered and dropped Alertmanagers
Exposes the API for discovered and dropped Alertmanagers.
* make admin config poll interval configurable
* update after rebase
* wordsmith
* More wordsmithing
* change name of the config
* settings package too
* Alerting: modify table and accessors to limit org access appropriately
* Update migration to create multiple Alertmanager configs
* Apply suggestions from code review
Co-authored-by: gotjosh <josue@grafana.com>
* replace mg.ClearMigrationEntry()
mg.ClearMigrationEntry() would create a new session.
This commit introduces a new migration for clearing an entry from migration log for replacing mg.ClearMigrationEntry() so that all dashboard alert migration operations will run inside the same transaction.
It adds also `SkipMigrationLog()` in Migrator interface for skipping adding an entry in the migration_log.
Co-authored-by: gotjosh <josue@grafana.com>
* Alerting: Send alerts to external Alertmanager(s)
Within this PR we're adding support for registering or unregistering
sending to a set of external alertmanagers. A few of the things that are
going are:
- Introduce a new table to hold "admin" (either org or global)
configuration we can change at runtime.
- A new periodic check that polls for this configuration and adjusts the
"senders" accordingly.
- Introduces a new concept of "senders" that are responsible for
shipping the alerts to the external Alertmanager(s). In a nutshell,
this is the Prometheus notifier (the one in charge of sending the alert)
mapped to a multi-tenant map.
There are a few code movements here and there but those are minor, I
tried to keep things intact as much as possible so that we could have an
easier diff.
* Alerting: Refactor `Run` of the scheduler
A bit of a refactor to make the diff easier to read for supporting
external Alertmanagers.
We'll introduce another routine that checks the database for
configuration and spawns other routines accordingly.
* Block the wait.
* Fix test