* Alerting: During legacy migration reduce the number of created silences
During legacy migration every migrated rule was given a label rule_uid=<uid>.
This was used to silence DatasourceError/DatasourceNoData alerts for
migrated rules that had either ExecutionErrorState/NoDataState set to
keep_state, respectively.
This could potentially create a large amount of silences and a high cardinality
label. Both of these scenarios have poor outcomes for CPU load and latency in
unified alerting.
Instead, this change creates one label per ExecutionErrorState/NoDataState when
they are set to keep_state as well as two silence rules, if rules with said
labels were created during migration. These silence rules are:
- __legacy_silence_error_keep_state__ = true
- __legacy_silence_nodata_keep_state__ = true
This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label `rule_uid`.
* Remove FolderID from service tests
* Add models
* Add folderID pack to publicdashboard tests
* Remove folderID from dashboard tests
* Remove folderID from folders
* Remove folderID from ngalert tests
* Remove nolint comment
* Add back some tests after rebase
This PR has two steps that together create a functional dry-run capability for the migration.
By enabling the feature flag alertingPreviewUpgrade when on legacy alerting it will:
a. Allow all Grafana Alerting background services except for the scheduler to start (multiorg alertmanager, state manager, routes, …).
b. Allow the UI to show Grafana Alerting pages alongside legacy ones (with appropriate in-app warnings that UA is not actually running).
c. Show a new “Alerting Upgrade” page and register associated /api/v1/upgrade endpoints that will allow the user to upgrade their organization live without restart and present a summary of the upgrade in a table.
Some refactoring that will simplify next changes for dry-run PRs. This should be no-op as far as the created ngalert resources and database state, though it does change some logs.
The key change here is to modify migrateOrg to return pairs of legacy struct + ngalert struct instead of actually persisting the alerts and alertmanager config. This will allow us to capture error information during dry-run migration.
It also moves most persistence-related operations such as title deduplication and folder creation to the right before we persist. This will simplify eventual partial migrations (individual alerts, dashboards, channels, ...).
Additionally it changes channel code to deal with PostableGrafanaReceiver instead of PostableApiReceiver (integration instead of contact point).
* In migration, create one label per channel
This PR changes how routing is done by the legacy alerting migration.
Previously, we created a single label on each alert rule that contained an array of contact point names. Ex: __contact__="slack legacy testing","slack legacy testing2"
This label was then routed against a series of regex-matching policies with continue=true. Ex: __contacts__ =~ .*"slack legacy testing".*
In the case of many contact points, this array could quickly become difficult to manage and difficult to grok at-a-glance.
This PR replaces the single __contact__ label with multiple __legacy_c_{contactname}__ labels and simple equality-matching policies. These channel-specific policies are nested in a single route under the top-level route which matches against __legacy_use_channels__ = true for ease of organization.
This should improve the experience for users wanting to keep the default migrated routing strategy but who also want to modify which contact points an alert sends to.
* Folders: Show folders user has access to at the root level
* Refactor
* Refactor
* Hide parent folders user has no access to
* Skip expensive computation if possible
* Fix tests
* Fix potential nil access
* Fix duplicated folders
* Fix linter error
* Fix querying folders if no managed permissions set
* Update benchmark
* Add special shared with me folder and fetch available non-root folders on demand
* Fix parents query
* Improve db query for folders
* Reset benchmark changes
* Fix permissions for shared with me folder
* Simplify dedup
* Add option to include shared folder permission to user's permissions
* Fix nil UID
* Remove duplicated folders from shared list
* Folders: Fix fetching empty folder
* Nested folders: Show dashboards with directly assigned permissions
* Fix slow dashboards fetch
* Refactor
* Fix cycle dependencies
* Move shared folder to models
* Fix shared folder links
* Refactor
* Use feature flag for permissions
* Use feature flag
* Review comments
* Expose shared folder UID through frontend settings
* Add frontend type for sharedWithMeFolderUID option
* Refactor: apply review suggestions
* Fix parent uid for shared folder
* Fix listing shared dashboards for users with access to all folders
* Prevent creating folder with "shared" UID
* Add tests for shared folders
* Add test for shared dashboards
* Fix linter
* Add metrics for shared with me folder
* Add metrics for shared with me dashboards
* Fix tests
* Tests: add metrics as a dependency
* Fix access control metadata for shared with me folder
* Use constant for shared with me
* Optimize parent folders access check, fetch all folders in one query.
* Use labels for metrics
* Alerting: Add clean_upgrade config and deprecate force_migration
Upgrading to UA and rolling back will no longer delete any data by default.
Instead, each set of tables will remain unchanged when switching between
legacy and UA. As such, the force_migration config has been deprecated
and no extra configuration is required to roll back to legacy anymore.
If clean_upgrade is set to true when upgrading from legacy alerting to Unified
Alerting, grafana will first delete all existing Unified Alerting resources,
thus re-upgrading all organizations from scratch. If false or unset,
organizations that have previously upgraded will not lose their existing Unified
Alerting data when switching between legacy and Unified Alerting.
Similar to force_migration, it should be kept false when not needed as it may
cause unintended data-loss if left enabled.
---------
Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
* Alerting: Keep track of individual org migration status
Save migration status per migrated org.
Change the meaning (and key/value) of the org_id=0 entry
to store the current (previous) config value used by alerting.
This is so we can know when to upgrade/downgrade by
comparing with the new config value in
UnifiedAlerting.IsEnabled.
* Alerting: In migration improve deduplication of title and group
This change improves alert titles generated in the legacy migration
that occur when we need to deduplicate titles. Now when duplicate
titles are detected we will first attempt to append a sequential index,
falling back to a random uid if none are unique within 10 attempts.
This should cause shorter and more easily readable deduplicated
titles in most cases.
In addition, groups are no longer deduplicated. Instead we set them
to a combination of truncated dashboard name and humanized alert
frequency. This way, alerts from the same dashboard share a group
if they have the same evaluation interval. In the event that truncation
causes overlap, it won't be a big issue as all alerts will still be in a
group with the correct evaluation interval.
* Alerting: In migration, fallback to '1s' for malformed min interval
During legacy migration, when we encounter an alert datasource query
with a min interval (interval field in the query model) that is not
parseable, instead of failing the migration we fallback to a min interval
of 1s and continue.
The reason for this is a bug in legacy alerting (existing for a few major
versions) which allows arbitrary dashboard variables to be used as the
min interval, even though those variables do not work and will cause
the legacy alert to fail with `interval calculation failed: time: invalid
duration`.
* remove use of SignedInUserCopies
* add extra safety to not cross assign permissions
unwind circular dependency
dashboardacl->dashboardaccess
fix missing import
* correctly set teams for permissions
* fix missing inits
* nit: check err
* exit early for api keys
* Alerting: Move migration from background service run to ngalert init
sqlite database write contention between the migration's single transaction and
dashboard provisioning's frequent commits was causing the migration to
fail with SQLITE_BUSY/SQLITE_BUSY_SNAPSHOT on all retries.
This is not a new issue for sqlite+grafana, but the discrepancy between the
length of the transactions was causing it to be very consistent. In addition,
since a failed migration has implications on the assumed correctness of the
alertmanager and alert rule definition state, we cause a server shutdown on
error. This can make e2e tests as well as some high-load provisioned
sqlite installations flaky on startup.
The correct fix for this is better transaction management across various
services and is out of scope for this change as we're primarily interested in
mitigating the current bout of server failures in e2e tests when using sqlite.
* Fix migration of custom dashboard permissions
Dashboard alert permissions were determined by both its dashboard and
folder scoped permissions, while UA alert rules only have folder
scoped permissions.
This means, when migrating an alert, we'll need to decide if the parent folder
is a correct location for the newly created alert rule so that users, teams,
and org roles have the same access to it as they did in legacy.
To do this, we translate both the folder and dashboard resource
permissions to two sets of SetResourcePermissionCommands. Each of these
encapsulates a mapping of all:
OrgRoles -> Viewer/Editor/Admin
Teams -> Viewer/Editor/Admin
Users -> Viewer/Editor/Admin
When the dashboard permissions (including those inherited from the parent
folder) differ from the parent folder permissions alone, we need to create a
new folder to represent the access-level of the legacy dashboard.
Compromises:
When determining the SetResourcePermissionCommands we only take into account
managed and basic roles. Fixed and custom roles introduce significant complexity
and synchronicity hurdles. Instead, we log a warning they had the potential to
override the newly created folder permissions.
Also, we don't attempt to reconcile datasource permissions that were
not necessary in legacy alerting. Users without access to the necessary
datasources to edit an alert rule will need to obtain said access separate from
the migration.
This PR replaces the vendored models in the migration with their equivalent ngalert models. It also replaces the raw SQL selects and inserts with service calls.
It also fills in some gaps in the testing suite around:
- Migration of alert rules: verifying that the actual data model (queries, conditions) are correct 9a7cfa9
- Secure settings migration: verifying that secure fields remain encrypted for all available notifiers and certain fields migrate from plain text to encrypted secure settings correctly e7d3993
Replacing the checks for custom dashboard ACLs will be replaced in a separate targeted PR as it will be complex enough alone.