This PR has two steps that together create a functional dry-run capability for the migration.
By enabling the feature flag alertingPreviewUpgrade when on legacy alerting it will:
a. Allow all Grafana Alerting background services except for the scheduler to start (multiorg alertmanager, state manager, routes, …).
b. Allow the UI to show Grafana Alerting pages alongside legacy ones (with appropriate in-app warnings that UA is not actually running).
c. Show a new “Alerting Upgrade” page and register associated /api/v1/upgrade endpoints that will allow the user to upgrade their organization live without restart and present a summary of the upgrade in a table.
* Drop from API response
* Drop from swagger docs
* Drop from integration tests
* regenerate public swagger docs
* Drop from frontend
* Drop asserts for namespaceID field
Swagger(ngalert): Add `X-Disable-Provenance` to missing operations
I added all functions that call the `determineProvenance` function
Schema changes are from:
`make` in `pkg/services/ngalert/api/tooling`
`make swagger-clean && make openapi3-gen` in root
* Export Notification Policy correctly (#78020)
The JSON version of an exported Notification Policy now
inline correctly the policy in the same way the Yaml version
does.
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
* ngalert `make`: Support GNU install on Darwin
Currently, the Makefile assumes that Darwin is using the Mac version of `sed`
I have the GNU version, so it failed. With this PR, it checks which version is installed
I also called `make` and there are some changes that came out of it
* swagger-gen
* Alerting: Apply query optimization to eval endpoints
Previously, query optimization was applied to alert queries when scheduled but
not when ran through `api/v1/eval` or `/api/v1/rule/test/grafana`. This could
lead to discrepancies between preview and scheduled alert results.
When running in dev mode, error messages would contain an additional "error" property alongside "message". Since this causes confusion, that has been removed and now error messages are the same both modes (using "message").
* Alerting: post alerts to the remote Alertmanager and fetch them
* fix broken tests
* Alerting: Add Mimir Backend image to devenv (blocks)
* add alerting as code owner for mimir_backend block
* Alerting: Use Mimir image to run integration tests for the remote Alertmanager
* skip integration test when running all tests
* skipping integration test when no Alertmanager URL is provided
* fix bad host for mimir_backend
* remove basic auth testing until we have an nginx image in our CI
* add integration tests for alerts
* fix tests
* change SendCtx -> Send, add context.Context to Send, fix CI
* add reover() for functions from the Prometheus Alertmanager HTTP client that could panic
* add TODO to implement PutAlerts in a way that mimicks what Prometheus does
* fix log format
This PR replaces the vendored models in the migration with their equivalent ngalert models. It also replaces the raw SQL selects and inserts with service calls.
It also fills in some gaps in the testing suite around:
- Migration of alert rules: verifying that the actual data model (queries, conditions) are correct 9a7cfa9
- Secure settings migration: verifying that secure fields remain encrypted for all available notifiers and certain fields migrate from plain text to encrypted secure settings correctly e7d3993
Replacing the checks for custom dashboard ACLs will be replaced in a separate targeted PR as it will be complex enough alone.
* update storage's method InstertRules to return ids of added rules as slice to keep the same order as rules in the argument
* schematize response of update rule group endpoint, add created, updated, deleted fields that contain UID of affected rules.
* update integration tests to use the new fields
* extend RuleStore interface to get namespace by UID
* add new export API endpoints
* implement request handlers
* update authorization and wire handlers to paths
* add folder error matchers to errorToResponse
* add tests for export methods
* Alerting: Manage remote Alertmanager silences
* fix typo
* check errors when encoding json in fake external AM
* take path from configured URL, check for nil responses
* Add support for `keep_firing_for` in ruler proxy
* Don't delete `keep_firing_for` when editing a rule with the field set
Co-Authored-By: Sonia Aguilar <33540275+soniaAguilarPeiron@users.noreply.github.com>
---------
Co-authored-by: Sonia Aguilar <33540275+soniaAguilarPeiron@users.noreply.github.com>
* Make identity.Requester available at Context
* Clean pkg/services/guardian/guardian.go
* Clean guardian provider and guardian AC
* Clean pkg/api/team.go
* Clean ctxhandler, datasources, plugin and live
* Clean dashboards and guardian
* Implement NewUserDisplayDTOFromRequester
* Change status code numbers for http constants
* Upgrade signature of ngalert services
* log parsing errors instead of throwing error
* Make identity.Requester available at Context
* Clean pkg/services/guardian/guardian.go
* Clean guardian provider and guardian AC
* Clean pkg/api/team.go
* Clean ctxhandler, datasources, plugin and live
* Question: what to do with the UserDisplayDTO?
* Clean dashboards and guardian
* Remove identity.Requester from ReqContext
* Implement NewUserDisplayDTOFromRequester
* Fix tests
* Change status code numbers for http constants
* Upgrade signature of ngalert services
* log parsing errors instead of throwing error
* Fix tests and add logs
* linting
* add metrics and tracing to state manager
* propagate tracer to state manager
* add scheduler metrics
* fix backtesting
* add test for state metrics
* remove StateUpdateCount
* update docs
* metrics can be null
* add tracer to new tests
* introduce a new action "alert.provisioning.secrets:read" and role "fixed:alerting.provisioning.secrets:reader"
* update alerting API authorization layer to let the user read provisioning with the new action
* let new action use decrypt flag
* add action and role to docs
* Alerting: Fix contact point testing with secure settings
Fixes double encryption of secure settings during contact point testing and removes code duplication
that helped cause the drift between alertmanager and test endpoint. Also adds integration tests to cover
the regression.
Note: provisioningStore is created to remove cycle and the unnecessary dependency.
* Alerting: Make ApplyAlertmanagerConfiguration only decrypt/encrypt new/changed secure settings
Previously, ApplyAlertmanagerConfiguration would decrypt and re-encrypt all secure settings. However, this caused re-encrypted secure settings to be included in the raw configuration when applied to the embedded alertmanager, resulting in changes to the hash. Consequently, even if no actual modifications were made, saving any alertmanager configuration triggered an apply/restart and created a new historical entry in the database.
To address the issue, this modifies ApplyAlertmanagerConfiguration, which is called by POST `api/alertmanager/grafana/config/api/v1/alerts`, to decrypt and re-encrypt only new and updated secure settings. Unchanged secure settings are loaded directly from the database without alteration.
We determine whether secure settings have changed based on the following (already in-use) assumption: Only new or updated secure settings are provided via the POST `api/alertmanager/grafana/config/api/v1/alerts` request, while existing unchanged settings are omitted.
* Ensure saving a grafana-managed contact point will only send new/changed secure settings
Previously, when saving a grafana-managed contact point, empty string values were transmitted for all unset secure settings. This led to potential backend issues, as it assumed that only newly added or updated secure settings would be provided.
To address this, we now exclude empty ('', null, undefined) secure settings, unless there was a pre-existing entry in secureFields for that specific setting. In essence, this means we only transmit an empty secure setting if a previously configured value was cleared.
* Fix linting
* refactor omitEmptyUnlessExisting
* fixup
---------
Co-authored-by: Gilles De Mey <gilles.de.mey@gmail.com>
* Add limit query parameter
* Drop copy paste comment
* Extend history query limit to 30 days and 250 entries
* Fix history log entries ordering
* Update no history message, add empty history test
---------
Co-authored-by: Konrad Lalik <konrad.lalik@grafana.com>
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
* replace condition validation with just structural validation
* validate conditions of only new and updated rules
* add integration tests for rule update\delete API
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Alerting: Repurpose rule testing endpoint to return potential alerts
This feature replaces the existing no-longer in-use grafana ruler testing API endpoint /api/v1/rule/test/grafana. The new endpoint returns a list of potential alerts created by the given alert rule, including built-in + interpolated labels and annotations.
The key priority of this endpoint is that it is intended to be as true as possible to what would be generated by the ruler except that the resulting alerts are not filtered to only Resolved / Firing and ready to be sent.
This means that the endpoint will, among other things:
- Attach static annotations and labels from the rule configuration to the alert instances.
- Attach dynamic annotations from the datasource to the alert instances.
- Attach built-in labels and annotations created by the Grafana Ruler (such as alertname and grafana_folder) to the alert instances.
- Interpolate templated annotations / labels and accept allowed template functions.
* Alerting: Fix unique violation when updating rule group with title chains/cycles
The uniqueness constraint for titles within an org+folder is enforced on every update within a transaction instead of on commit (deferred constraint). This means that there could be a set of updates that will throw a unique constraint violation in an intermediate step even though the final state is valid. For example, a chain of updates RuleA -> RuleB -> RuleC could fail if not executed in the correct order, or a swap of titles RuleA <-> RuleB cannot be executed in any order without violating the constraint.
The exact solution to this is complex and requires determining directed paths and cycles in the update graph, adding in temporary updates to break cycles, and then executing the updates in reverse topological order (see first commit in PR if curious).
This is not implemented here.
Instead, we choose a simpler solution that works in all cases but might perform more updates than necessary. This simpler solution makes a determination of whether an intermediate collision could occur and if so, adds a temporary title on all updated rules to break any cycles and remove the need for specific ordering.
In addition, we make sure diffs are executed in the following order: DELETES, UPDATES, INSERTS.
* remove unused HasAdmin and HasEdit permission methods
* remove legacy AC from HasAccess method
* remove unused function
* update alerting tests to work with RBAC
* update to alerting 20230418161049-5f374e58cb32
* rename renamed structs in https://github.com/grafana/alerting/pull/73
* update ValidateContactPoint to use BuildReceiverConfiguration
* update logger factory according to changes
* rewrite integration builder
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Alerting: Allow hooking into request handler functions.
Adds a facility to AlertNG for hooking into API handlers, allowing the
replacement of request handlers for specific paths. One of goals of this
approach was to allow hooking as late as possible in the request, e.g.
after all middleware has been applied, to simplfiy usage.
* Update pkg/services/ngalert/api/hooks.go
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Update pkg/services/ngalert/api/hooks.go
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Update pkg/services/ngalert/ngalert.go
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Fixes to review comments
* Fix passing logger in
---------
Co-authored-by: gotjosh <josue.abreu@gmail.com>
Alerting: Add totalsFiltered to RuleResponse to facilitate hidden by filters count
Currently, when both a limit_alerts and a matcher/state filter is applied, there is not enough information to determine how many alert instances were hidden by the filters. Only enough to determine the total hidden by the limit and filter combined.
This change adds a separate totalsFiltered field alongside the AlertRule totals that will contain the count of instances after filters but before limits.
This commit adds support for limits and filters to the Prometheus Rules
API.
Limits:
It adds a number of limits to the Grafana flavour of the Prometheus Rules
API:
- `limit` limits the maximum number of Rule Groups returned
- `limit_rules` limits the maximum number of rules per Rule Group
- `limit_alerts` limits the maximum number of alerts per rule
It sorts Rule Groups and rules within Rule Groups such that data in the
response is stable across requests. It also returns summaries (totals)
for all Rule Groups, individual Rule Groups and rules.
Filters:
Alerts can be filtered by state with the `state` query string. An example
of an HTTP request asking for just firing alerts might be
`/api/prometheus/grafana/api/v1/rules?state=alerting`.
A request can filter by two or more states by adding additional `state`
query strings to the URL. For example `?state=alerting&state=normal`.
Like the alert list panel, the `firing`, `pending` and `normal` state are
first compared against the state of each alert rule. All other states are
ignored. If the alert rule matches then its alert instances are filtered
against states once more.
Alerts can also be filtered by labels using the `matcher` query string.
Like `state`, multiple matchers can be provided by adding additional
`matcher` query strings to the URL.
The match expression should be parsed using existing regular expression
and sent to the API as URL-encoded JSON in the format:
{
"name": "test",
"value": "value1",
"isRegex": false,
"isEqual": true
}
The `isRegex` and `isEqual` options work as follows:
| IsEqual | IsRegex | Operator |
| ------- | -------- | -------- |
| true | false | = |
| true | true | =~ |
| false | true | !~ |
| false | false | != |
* replace receiver errors with one from alerting
* add the converter to alerting models
* update buildReceiverIntegration to accept GrafanaReceiver
---------
Co-authored-by: George Robinson <george.robinson@grafana.com>
* define initial service and add to wire
* update caching service interface
* add skipQueryCache header handler and update metrics query function to use it
* add caching service as a dependency to query service
* working caching impl
* propagate cache status to frontend in response
* beginning of improvements suggested by Lean - separate caching logic from query logic.
* more changes to simplify query function
* Decided to revert renaming of function
* Remove error status from cache request
* add extra documentation
* Move query caching duration metric to query package
* add a little bit of documentation
* wip: convert resource caching
* Change return type of query service QueryData to a QueryDataResponse with Headers
* update codeowners
* change X-Cache value to const
* use resource caching in endpoint handlers
* write resource headers to response even if it's not a cache hit
* fix panic caused by lack of nil check
* update unit test
* remove NONE header - shouldn't show up in OSS
* Convert everything to use the plugin middleware
* revert a few more things
* clean up unused vars
* start reverting resource caching, start to implement in plugin middleware
* revert more, fix typo
* Update caching interfaces - resource caching now has a separate cache method
* continue wiring up new resource caching conventions - still in progress
* add more safety to implementation
* remove some unused objects
* remove some code that I left in by accident
* add some comments, fix codeowners, fix duplicate registration
* fix source of panic in resource middleware
* Update client decorator test to provide an empty response object
* create tests for caching middleware
* fix unit test
* Update pkg/services/caching/service.go
Co-authored-by: Arati R. <33031346+suntala@users.noreply.github.com>
* improve error message in error log
* quick docs update
* Remove use of mockery. Update return signature to return an explicit hit/miss bool
* create unit test for empty request context
* rename caching metrics to make it clear they pertain to caching
* Update pkg/services/pluginsintegration/clientmiddleware/caching_middleware.go
Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com>
* Add clarifying comments to cache skip middleware func
* Add comment pointing to the resource cache update call
* fix unit tests (missing dependency)
* try to fix mystery syntax error
* fix a panic
* Caching: Introduce feature toggle to caching service refactor (#66323)
* introduce new feature toggle
* hide calls to new service behind a feature flag
* remove licensing flag from toggle (misunderstood what it was for)
* fix unit tests
* rerun toggle gen
---------
Co-authored-by: Arati R. <33031346+suntala@users.noreply.github.com>
Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com>
* Alerting: Tiny refactor on the eval and schedule packages
two very small things:
- We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext`
- The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary.
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
---------
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
* Alerting: Add endpoint to revert to a previous alertmanager configuration
This endpoint is meant to be used in conjunction with /api/alertmanager/grafana/config/history to
revert to a previously applied alertmanager configuration. This is done by ID instead of raw config
string in order to avoid secure field complications.
This commit adds a number of limits to the Grafana flavor of the
Prometheus Rules API:
1. `limit` limits the maximum number of Rule Groups returned
2. `limit_rules` limits the maximum number of rules per Rule Group
3. `limit_alerts` limits the maximum number of alerts per rule
It sorts Rule Groups and rules within Rule Groups such that data in the
response is stable across requests. It also returns summaries (totals) for
all Rule Groups, individual Rule Groups and rules.
* WIP
* skip invalid historic configurations instead of erroring
* add warning log when bad historic config is found
* remove unused custom marshaller for GettableHistoricUserConfig
* add id to historic user config, move limit check to store, fix typo
* swagger spec
* move export rules to definitions package
* move provisioning contact point methods to provisioning package
* move AlertRuleGroupWithFolderTitle to ngalert models and adapter functions to api's compat
* move rule_types files back to where they were before.