* WIP
* skip invalid historic configurations instead of erroring
* add warning log when bad historic config is found
* remove unused custom marshaller for GettableHistoricUserConfig
* add id to historic user config, move limit check to store, fix typo
* swagger spec
* Alerting: get alert rules on faults (#61248)
Two functions used to fetch alert rules from DB are updated:
- GetAlertRulesForScheduling
- ListAlertRules
Rows are scanned one by one so good ones are returned.
Common Error is logged with indication how many
rules failed on deserialization.
Resolved: #61248
* updates from review comments
* change error to warning, apply correct format
* Update pkg/services/ngalert/store/alertmanager.go
Co-authored-by: George Robinson <george.robinson@grafana.com>
* different wording and data
---------
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Mark AM configuration as applied
* add missing checks, make linter happy
* fix deadlock, mark as valid on save and on load
* mark configurations only if needed
* check error after applyConfig()
* code review comments
* code review changes
* more code review changes
* clean HistoricConfigFromAlertConfig function
* Add is_paused attr to the POST alert rule group endpoint
* Add is_paused to alerting API POST alert rule group
* Fixed tests
* Add is_paused to alerting gettable endpoints
* Fix integration tests
* Alerting: allow to pause existing rules (#62401)
* Display Pause Rule switch in Editing Rule form
* add isPaused property to form interface and dto
* map isPaused prop with is_paused value from DTO
Also update test snapshots
* Append '(Paused)' text on alert list state column when appropriate
* Change Switch styles according to discussion with UX
Also adding a tooltip with info what this means
* Adjust styles
* Fix alignment and isPaused type definition
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
* Fix test
* Fix test
* Fix RuleList test
---------
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
* wip
* Fix tests and add comments to clarify AlertRuleWithOptionals
* Fix one more test
* Fix tests
* Fix typo in comment
* Fix alert rule(s) cannot be paused via API
* Add integration tests for alerting api pausing flow
* Remove duplicated integration test
---------
Co-authored-by: Virginia Cepeda <virginia.cepeda@grafana.com>
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Add field in alert_rule model, add state to alert_instance model, and state to eval
* Remove paused state from eval package
* Skip paused alert rules in scheduler
* Add migration to add is_paused field to alert_rule table
* Convert to postable alerts only if not normal, pernding, or paused
* Handle paused eval results in state manager
* Add Paused state to eval package
* Add paused alerts logic in scheduler
* Skip alert on scheduler
* Remove paused status from eval package
* Apply suggestions from code review
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Remove state
* Rethink schedule and manager for paused alerts
* Change return to continue
* Remove unused var
* Rethink alert pausing
* Paused alerts storing annotations
* Only add one state transition
* Revert boolean method renaming refactor
* Revert take image refactor
* Make registry errors public
* Revert method extraction for getting a folder title
* Revert variable renaming refactor
* Undo unnecessary changes
* Revert changes in test
* Remove IsPause check in PatchPartiLAlertRule function
* Use SetNormal to set state
* Fix text by returning to old behaviour on alert rule deletion
* Add test in schedule_unit_test.go to test ticks with paused alerts
* Add coment to clarify usage of context.Background()
* Add comment to clarify resetStateByRuleUID method usage
* Move rule get to a more limited scope
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: George Robinson <george.robinson@grafana.com>
* rum gofmt on pkg/services/ngalert/schedule/schedule.go
* Remove defer cancel for context
* Update pkg/services/ngalert/models/instance_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/models/testing.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/schedule/schedule_unit_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/schedule/schedule_unit_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/models/instance_test.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* skip scheduler rule state clean up on paused alert rule
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Fix mock in test
* Add (hopefully) final suggestions
* Use error channel from recordAnnotationsSync to cancel context
* Run make gen-cue
* Place pause alert check in channel update after version check
* Reduce branching un update channel select
* Add if for error and move code inside if in state manager ResetStateByRuleUID
* Add reason to logs
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: George Robinson <george.robinson@grafana.com>
* Do not delete alert rule routine, just exit on eval if is paused
* Reduce branching and create-close a channel to avoid deadlocks
* Separate state deletion and state reset (includes history saving)
* Add current pause state in rule route in scheduler
* Split clearState and bring errCh closer to RecordStatesAsync call
* Change rule to ruleMeta in RecordStatesAsync
* copy state to be able to modify it
* Add timeout to context creation
* Shorten the timeout
* Use resetState is rule is paused and deleteState if rule is not paused
* Remove Empty state reason
* Save every rule change in historian
* Add tests for DeleteStateByRuleUID and ResetStateByRuleUID
* Remove useless line
* Remove outdated comment
Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
* add feature flag `alertingNoNormalState`
* update instance database to support exclusion of state in list operation
* do not save normal state and delete transitions to normal
* update get methods to filter out normal state
* Update config store to split between active and history tables
* Migrations to fix up indexes
* Implement migration from old format to new
* Move add migrations call
* Delete duplicated rows
* Explicitly map fields
* Quote the column name because it's a reserved word
* Lift migrations to top
* Use XORM for nearly everything, avoid any non trivial raw SQL
* Touch up indexes and zero out IDs on move
* Drop TODO that's already completed
* Fix assignment of IDs
* Update config store to split between active and history tables
* Migrations to fix up indexes
* Implement migration from old format to new
* Move add migrations call
* Delete duplicated rows
* Explicitly map fields
* Quote the column name because it's a reserved word
* Lift migrations to top
* Guardian: Use dashboard UID instead of ID
* Apply suggestions from code review
Introduce several guardian constructors and each time use
the most appropriate one.
This change marks tests in the `sender` package that use an external
process as integration tests instead of unit tests. This speeds up the
package's unit tests by about 20 seconds.
This change also reduces the number of alert instances in the `store`
package's bulk write integration test from 20_000 to 10_000. This is
still enough to exercise the bulk-write code but speeds up the package
tests from about 250s to 130s.
Put together, integration tests go to about 160s while also speeding up
unit tests by 20s.
* Nested Folders: Support getting of nested folder in folder service when feature flag is set
* Fix lint
* Fix some tests
* Fix ngalert test
* ngalert fix
* Fix API tests
* Fix some tests and lint
* Fix lint 2
* Fix library elements and panels
* Add access control to get folder
* Cleanup and minor test change
* chore: refactor CountDashboardsInFolder to use the more efficient Count() sql function
* feat(nested folders): Add CountAlertRulesInFolder to ngalert store
This commit adds CountAlertRulesInFolder and a new model for the CountAlertRulesQuery. It returns a count of alert rules associated with a given orgID and parent folder UID. (the namespace referenced inside alert rules is the parent folder).
I'm not sure where this belongs in the ngalert service, so that will come in a future commit.
* clean up and document integration test convention
* clarify integration test conventions
* clean up integration tests that don't follow convention
* mark testIntegration* functions as helpers to avoid confusion
* chore: add alias for InitTestDB and Session
Adds an alias for the sqlstore InitTestDB and Session, and updates tests using these to reduce dependencies on the sqlstore.Store.
* next pass of removing sqlstore imports
* last little bit
* remove mockstore where possible
* Chore: move folder service interface into a separate package
* copy implementation into a standalone package
* move implementation and tests to the new folder package
* remove leftovers from wire
* add test doubles for folder service
* fix tests in library panels/elements
* fix provideservice in ngalert
This commit fixes a bug where changing the Folder or Rule Group of an existing rule returns the following error in PostgreSQL "pq: missing FROM-clause for table a"
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.
These new transactions are off by default, guarded by the feature toggle "alertingBigTransactions"
Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s
```
After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s
```
So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
* Move fakeRuleStore to tests/fakes package
* Break stub dependencies on store
* Update existing tests to point to new location
* Remove unused stub of TimeNow
* Rename fake to take advantage of package name
* SQLStore: Ensure that sessions are always closed
Delete `NewSession()` in favour of `WithDbSession()`
* Add WithDbSessionForceNewSession to the interface
* Apply suggestions from code review
* Refactor state manager to not depend on rule store interface
* Refactor grafana and proxied ruler APIs to not depend on store.RuleStore
* Refactor folder subscription logic to not use store.RuleStore
* Delete dead code
* Delete store.RuleStore
* Add consumer-side store interface to state manager
* Remove dead dependency
* Delete dead dependency in API struct
* Delete store-layer InstanceStore interface
* Move fake for state's InstanceStore interface to state package
* Alerting: Change screenshots to use components
This commit changes screenshots to use a number of components instead of a set of functional wrappers.
It moves the uploading of screenshots from the screenshot package to the image package so we can re-use the same code for both uploading screenshots and server-side images; SingleFlight from the screenshot package to the image package so we can use it for both taking and uploading the screenshot, where as before it was used just for taking the screenshot; and it also removes the use of a cache because we know that screenshots can be taken at most once per tick of the scheduler.
* WIP
* Set public_suffix to a pre Ruby 2.6 version
* we don't need to install python
* Stretch->Buster
* Bump versions in lib.star
* Manually update linter
Sort of messy, but the .mod-file need to contain all dependencies that
use 1.16+ features, otherwise they're assumed to be compiled with
-lang=go1.16 and cannot access generics et al.
Bingo doesn't seem to understand that, but it's possible to manually
update things to get Bingo happy.
* undo reformatting
* Various lint improvements
* More from the linter
* goimports -w ./pkg/
* Disable gocritic
* Add/modify linter exceptions
* lint + flatten nested list
Go 1.19 doesn't support nested lists, and there wasn't an obvious workaround.
https://go.dev/doc/comment#lists
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.
Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s
```
After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s
```
So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
This change also updates some of our tests so that they run successfully against postgreSQL - we were using random Int64s, but postgres integers, which our tables use, max out at 2^31-1
* Update GetAlertRulesForScheduling to query for folders (if needed)
* Update scheduler's alertRulesRegistry to cache folder titles along with rules
* Update rule eval loop to take folder title from the
* Extract interface RuleStore
* Pre-fetch the rule keys with the version to detect changes, and query the full table only if there are changes.
* Chore: Add user service method SetUsingOrg
* Chore: Add user service method GetSignedInUserWithCacheCtx
* Use method GetSignedInUserWithCacheCtx from user service
* Fix lint after rebase
* Fix lint
* Fix lint error
* roll back some changes
* Roll back changes in api and middleware
* Add xorm tags to SignedInUser ID fields
* Wire up to full alert rule struct
* Extract group change detection logic to dedicated file
* GroupDiff -> GroupDelta for consistency
* Calculate deltas and handle backwards compatible requests
* Separate changes and insert/update/delete as needed
* Regenerate files
* Don't touch the DB if there are no changes
* Quota checking, delete unused file
* Mark modified records as provisioned
* Validation + a couple API layer tests
* Address linter errors
* Fix issue with UID assignment and rule creation
* Propagate top level group fields to all rules
* Tests for repeated updates and versioning
* Tests for quota and provenance checks
* Fix linter errors
* Regenerate
* Factor out some shared logic
* Drop unnecessary multiple nilchecks
* Use alternative strategy for rolling UIDs on inserted rules
* Fix tests, add back nilcheck, refresh UIDs during test
* Address feedback
* Add missing nil-check
* Move SignedInUser to user service and RoleType and Roles to org
* Use go naming convention for roles
* Fix some imports and leftovers
* Fix ldap debug test
* Fix lint
* Fix lint 2
* Fix lint 3
* Fix type and not needed conversion
* Clean up messages in api tests
* Clean up api tests 2
* remove support for bus from scheduler
* rename event to FolderTitleUpdated and fire only if title has changed
* add method to increase version of all rules that belong to a folder
* update ngalert service to subscribe to folder title change event call data store and update scheduler
* add tests
* Replace subquery with threshold calculation
* Use offset/limit to account for orgs with large gaps in IDs
* Collapse into one statement
* Drop dead constants
* Revert to 2 query approach
* Drop unused consts again
* update GetAlertRulesForSchedulingQuery to have result AlertRule
* update fetcher utils and registry to support AlertRule
* alertRuleInfo to use alert rule instead of version
* update updateCh hanlder of ruleRoutine to just clean up the state. The updated rule will be provided at the next evaluation
* update evalCh handler of ruleRoutine to use rule from the message and clear state as well as update extra labels
* remove unused function in ruleRoutine
* remove unused model SchedulableAlertRule
* store rule version in ruleRoutine instead of rule
* do not call the sender if nothing to send
* move fake FakeExternalAlertmanager to sender package
* move tests from scheduler to router
* update alerts router to have all fields private
* update scheduler tests to use sender mock
* make 'for' pointer to distinguish between missing field and 0
* set 'for' to -1 if the value is missing but not allow negative in the request + path -1 with the value from original rule
* update store validation to not allow negative 'for'
* update usages to use pointer
Migrations:
* add a new column alert_group_idx to alert_rule table
* add a new column alert_group_idx to alert_rule_version table
* re-index existing rules during migration
API:
* set group index on update. Use the natural order of items in the array as group index
* sort rules in the group on GET
* update the version of all rules of all affected groups. This will make optimistic lock work in the case of multiple concurrent request touching the same groups.
UI:
* update UI to keep the order of alerts in a group
* Alerting: Add first Grafana reserved label g_label
g_label holds the title of the folder container the alert. The intention of this label
is to use it as part of the new default notification policy groupBy.
* Add nil check on updateRule labels map
* Disable gocyclo lint on schedule.ruleRoutine
will remove later in a separate refactoring PR to reduce complexity.
* Address doc suggestions
* Update g_folder for rules in folder when folder title changes
* Remove global bus in FolderService
* Modify tests to fit new common g_folder label
* Add changelog entry
* Fix merge conflicts
* Switch GrafanaReservedLabelPrefix from `g_` to `grafana_`
* Chore: Exclude integration tests from running on test-backend step
* Remove -v from go test command
* Add check to skip integration tests before each integration test
* Try to restart pipeline
* Retrying to make pipeline run
* Alerting: move group update to alert rule service
* rename validateAlertRuleInterval to validateRuleGroupInterval
* init baseinterval correctly
* add seconds suffix
* extract validation function for reusability
* add context to err message
* Alerting: decapitalize log lines and use "err" as the key for errors
Found using (logger|log).(Warn|Debug|Info|Error)\([A-Z] and (logger|log).(Warn|Debug|Info|Error)\(.+"error"
* update authz to exclude entire group if user does not have access to rule
* change rule update authz to not return changes because if user does not have access to any rule in group, they do not have access to the rule
* a new query that returns alerts in group by UID of alert that belongs to that group
* collect all affected groups during calculate changes
* update authorize to check access to groups
* update tests for calculateChanges to assert new fields
* add authorization tests
Adds three functions:
`withStoredImages` iterates over a list of models.Alerts, extracting a stored image's data from storage, if available, and executing a user-provided function.
`withStoredImage` does this for an image attached to a specific alert.
`openImage` finds and opens an image file on disk.
Moves `store.Image` to `models.Image`
Simplifies `channels.ImageStore` interface and updates notifiers that use it to use the simpler methods.
Updates all pkg/alert/notifier/channels to use withStoredImage routines.
* chore: replace artisnal FakeDashboardService with generated mock
Maintaining a handcrafted FakeDashboardService is not sustainable now that we are in the process of moving the dashboard-related functions out of sqlstore.
* sqlstore: finish removing Find and SearchDashboards
Find and SearchDashboards were previously copied into the dashboard service. This commit completes that work, removing Find and SearchDashboards from the sqlstore and updating callers to use the dashboard service.
* dashboards: remove SearchDashboards from Store interface
SearchDashboards is a wrapper around FindDashboard that transforms the results, so it's been moved out of the Store entirely and the functionality moved into the Dashboard Service's search implementation.
The database tests depended heavily on the transformation, so I added testSearchDashboards, a copy of search dashboards, instead of (heavily) refactoring all the tests.
If an image token is present in an alert instance, the email notifier will attempt to find a public URL for the image token. If found, it will add that to the email as the `ImageLink` field. If only local file data is available, the notifier will attach the file to the outgoing email using the `EmbeddedImage` field.
This change adds a field to state.State and models.AlertInstance
that indicate the "Reason" that an instance has its current state. This
helps us account for cases where the state is "Normal" but the
underlying evaluation returned "NoData" or "Error", for example.
Fixes#42606
Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
This change extracts screenshot data from alert messages via a private annotation `__alertScreenshotToken__` and attaches a URL to a Slack message or uploads the data to an image upload endpoint if needed.
This change also implements a few foundational functions for use in other notifiers.
The State Manager will now take screenshots when an alert instance
switches to an Alerting or Resolved state.
Signed-off-by: Joe Blubaugh joe.blubaugh@grafana.com
This commit adds a pkg/services/screenshot package for taking and uploading screenshots of Grafana dashboards. It supports taking screenshots of both dashboards and individual panels within a dashboard, using the rendering service.
The screenshot package has the following services, most of which can be composed:
BrowserScreenshotService (Takes screenshots with headless Chrome)
CachableScreenshotService (Caches screenshots taken with another service such as BrowserScreenshotService)
NoopScreenshotService (A no-op screenshot service for tests)
SingleFlightScreenshotService (Prevents duplicate screenshots when taking screenshots of the same dashboard or panel in parallel)
ScreenshotUnavailableService (A screenshot service that returns ErrScreenshotsUnavailable)
UploadingScreenshotService (A screenshot service that uploads taken screenshots)
The screenshot package does not support wire dependency injection yet. ngalert constructs its own version of the service. See https://github.com/grafana/grafana/issues/49296
This PR also adds an ImageScreenshotService to ngAlert. This is used to take screenshots with a screenshotservice and then store their location reference for use by alert instances and notifiers.
* Alerting: unwrap upsert into insert and update function
* add changelog entry
* remove changelog entry
* rename upsertrule to updaterule
* use directly alertrule model for inserts
* add test for updating a rule with a conflicting name
* add check for access to rule's data source in GET APIs
* use more general method GetAlertRules instead of GetNamespaceAlertRules.
* remove unused GetNamespaceAlertRules.
Tests:
* create a method to generate permissions for rules
* extract method to create RuleSrv
* add tests for RouteGetNamespaceRulesConfig
* move validation at the beginning of method
* remove usage of GetOrgRuleGroups because it is not necessary. All information is already available in memory.
* remove unused method
* Base-line API for provisioning notification policies
* Wire API up, some simple tests
* Return provenance status through API
* Fix missing call
* Transactions
* Clarity in package dependencies
* Unify receivers in definitions
* Fix issue introduced by receiver change
* Drop unused internal test implementation
* FGAC hooks for provisioning routes
* Polish, swap names
* Asserting on number of exposed routes
* Don't bubble up updated object
* Integrate with new concurrency token feature in store
* Back out duplicated changes
* Remove redundant tests
* Regenerate and create unit tests for API layer
* Integration tests for auth
* Address linter errors
* Put route behind toggle
* Use alternative store API and fix feature toggle in tests
* Fixes, polish
* Fix whitespace
* Re-kick drone
* Rename services to provisioning
* Use alert:create action for folder search with edit permissions. This matches the action that is used to query dashboards (the update will be addressed later)
* Update rule store to use FindDashboards instead of folder service to list folders the user has access to view alerts. Folder service does not support query type and additional filters.
* Do not check whether the user can save to folder if FGAC is enabled because it is checked on API level.