This commit adds a customizable timeout for screenshots called
capture_timeout. The default value is 10 seconds, and the maximum
value is 30 seconds. This timeout should be less than the minimum
Interval of all Evaluation Groups to avoid back pressure on alert
rule evaluation.
* Update config store to split between active and history tables
* Migrations to fix up indexes
* Implement migration from old format to new
* Move add migrations call
* Delete duplicated rows
* Explicitly map fields
* Quote the column name because it's a reserved word
* Lift migrations to top
* Use XORM for nearly everything, avoid any non trivial raw SQL
* Touch up indexes and zero out IDs on move
* Drop TODO that's already completed
* Fix assignment of IDs
Automatically forward core plugin request HTTP headers in outgoing HTTP requests.
Core datasource plugin authors don't have to specifically handle forwarding of HTTP
headers, e.g. do not have to "hardcode" the header-names in the datasource plugin,
if not having custom needs.
Fixes#57065
* refactor email to not use simplejson
* add tests
* split integration test and unit test + more unit-tests
* Remove outdated comment
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
* introduce alias for json.RawMessage with name RawMessage. This is needed to keep raw JSON and implement a marshaler for YAML, which does not seem to be used but there are tests that fail.
* replace usage of simplejson with RawMessage in NotificationChannelConfig
* remove usage of simplejson in tests
* change migration code to convert simplejson to raw message
* Set Dashboard and Panel IDs on rule group replacement
* fix comments and abbreviate test variable name
* Update pkg/services/ngalert/provisioning/alert_rules.go
Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
* Update config store to split between active and history tables
* Migrations to fix up indexes
* Implement migration from old format to new
* Move add migrations call
* Delete duplicated rows
* Explicitly map fields
* Quote the column name because it's a reserved word
* Lift migrations to top
* introduce Logger interface local to channles + implementaton that wraps the Grafana logger
* make NewFactoryConfig accept LoggerFactory
* add logger field to FactoryConfig
* update usages of log.Logger to internal interface
* Guardian: Use dashboard UID instead of ID
* Apply suggestions from code review
Introduce several guardian constructors and each time use
the most appropriate one.
* Implement backtesting engine that can process regular rule specification (with queries to datasource) as well as special kind of rules that have data frame instead of query.
* declare a new API endpoint and model
* add feature toggle `alertingBacktesting`
* Move truncation code to util to mirror upstream
* Resolve merge conflicts
* Align logging of alert key
* Update tests and fix field passing bug
* Remove superfluous newline in test now that we trim whitespace
* Uptake minor log changes from upstream
This change marks tests in the `sender` package that use an external
process as integration tests instead of unit tests. This speeds up the
package's unit tests by about 20 seconds.
This change also reduces the number of alert instances in the `store`
package's bulk write integration test from 20_000 to 10_000. This is
still enough to exercise the bulk-write code but speeds up the package
tests from about 250s to 130s.
Put together, integration tests go to about 160s while also speeding up
unit tests by 20s.
This commit better defines how we set states in resultNormal,
resultAlerting, resultError and resultNoData. It changes the existing
code to call methods such as SetAlerting, SetPending, SetNormal,
SetError and NoData instead of assigning values to each individual field
whenever the state is changed. This should make it easier to understand
what fields should be set for which states and avoid cases where states are
missing, or have additional unexpected fields.
Before this change, the alerting provisioning system incorrectly used
the QuotaTarget to check if alerting's request quota had been reached.
The quota service requires the QuotaTargetSrv, which is what's
registered with the service at startup time. This is leading to errors
in the provisioning system.
* update pagerduty and opsgenie to deserialize settings using standard JSON library
* update pagerduty truncation to use a function from Alertamanger package
* update opsgenie to use payload model (same as in Alertmanager)
This commit makes a number of changes to how images work in Slack
notifications.
It adds support for uploading images to Slack via the files.upload
API when the contact point has a token. Images are no longer linked
via a URL if a token is present.
Each image uploaded to Slack is posted as a reply to the original
notification. Up to maxImagesPerThreadTs images can be posted as
replies before a final message is sent with:
There are no images than can be shown here. To see the panels for
all firing and resolved alerts please check Grafana
Incoming Webhooks cannot upload files via files.upload and so webhooks
require the image to be uploaded to cloud storage and linked via URL.
This change preallocates slices and maps where the size of the data is known before the object is created.
Co-authored-by: Joe Blubaugh <joe.blubaugh@grafana.com>
* Auth: move interface to its own file
* Auth: move to test package
* Auth: move quota consts to auth file
* Auth: move service to impl package
* Auth: move interfaces and related models to auth package
* Auth: Create sub package and type alias to avoid circular dependency
* Component that can cache and extract variable dependencies
* Component that can cache and extract variable dependencies
* Updates
* Refactoring
* Lots of refactoring and iterations of supporting both re-rendering and query re-execution
* Updated SceneCanvasText
* Updated name of file
* Updated
* Refactoring a bit
* Added back getName
* Added comment
* minor fix
* Minor fix
* Merge fixes
* Merge fixes
* Some review fixes
* Updated comment
* Added forceRender function
* Add back fail on console log
* Nested Folders: Support getting of nested folder in folder service when feature flag is set
* Fix lint
* Fix some tests
* Fix ngalert test
* ngalert fix
* Fix API tests
* Fix some tests and lint
* Fix lint 2
* Fix library elements and panels
* Add access control to get folder
* Cleanup and minor test change
* Remove URL-based alertmanagers from endpoint config
* WIP
* Add migration and alertmanagers from admin_configuration
* Empty comment removed
* set BasicAuth true when user is present in url
* Remove Alertmanagers from GET /admin_config payload
* Remove URL-based alertmanager configuration from UI
* Fix new uid generation in external alertmanagers migration
* Fix tests for URL-based external alertmanagers
* Fix API tests
* Add more tests, move migration code to separate file, and remove possible am duplicate urls
* Fix edge cases in migration
* Fix imports
* Remove useless fields and fix created_at/updated_at retrieval
Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Konrad Lalik <konrad.lalik@grafana.com>
* Refactor state and manager to not depend directly on image interface
* Move generic errors to models package
* Move NotAvailableImageService to state as its only references are in state tests
* Move NoopImageService to state package
* Move mock to state package
* Fix linter error
* Fix comment styling
* Fix a couple added references introduced by rebase
* Empty commit to kick build
* extract method processTick
* make processTick return scheduled rules
* move state manager tests to state manager
* update test
* move all tests into one file
* remove unused fields
* chore: refactor CountDashboardsInFolder to use the more efficient Count() sql function
* feat(nested folders): Add CountAlertRulesInFolder to ngalert store
This commit adds CountAlertRulesInFolder and a new model for the CountAlertRulesQuery. It returns a count of alert rules associated with a given orgID and parent folder UID. (the namespace referenced inside alert rules is the parent folder).
I'm not sure where this belongs in the ngalert service, so that will come in a future commit.
* delete all stale states in one lock
* do not use touched states to detect stale rely only on LastEvaluationTime maintained correctly
* fix tests to use correct eval time
* delete unused method
* Reduce piecemeal state fields
* Read data directly off state instead of rule
* Unify state and context into single struct
* Expose contextual information to layer above setNextState
* Work in terms of ContextualState and call historian in batches
* Call annotations service in batches
* Export format state and reason and remove workaround in unrelated test package
* Add new method to annotation service for batch inserting
* Fix loop variable aliasing bug caught by linter, didn't change behavior
* Incl timerange on annotation tests
* Insert one at a time if tags are present
* Point to rule from ContextualState rather than copy fields
* Build annotations and copy data prior to starting goroutine
* Rename to StateTransition
* Use new bulk-insert utility
* Remove rule from StateTransition and pass in directly to historian
* Simplify annotations logic since we have only one rule
* Fix logs and context, nilcheck, simplify method name
* Regenerate mock
* clean up and document integration test convention
* clarify integration test conventions
* clean up integration tests that don't follow convention
* mark testIntegration* functions as helpers to avoid confusion
* add: IsServiceAccount to SignedInUser and IsRealUser
* fix: linting error
* refactor: add function IsServiceAccountUser()
By adding the function IsServiceAccountUser() we use it to identify for
ServiceAccounts in the HasUniqueID() since caching is built up on having
a uniqueID, see comment: https://github.com/grafana/grafana/pull/58015#discussion_r1011361880
* Add custom title to pushover contact point
* Update pkg/services/ngalert/notifier/channels/pushover.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Use simplejson
* Use more verbose variable names
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Add custom title to googlechet contact point
* Update pkg/services/ngalert/notifier/channels/googlechat.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Use simplejson
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Add custom title to discord contact point
* Update pkg/services/ngalert/notifier/channels/discord.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Use simplejson
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Add custom title to DingDing contact point
* Update pkg/services/ngalert/notifier/channels_config/available_channels.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/notifier/channels/dingding.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Add error checking before URL templating
* Remove comment
* Use simplejson
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Add title and description to VictorOps contact point
* Update pkg/services/ngalert/notifier/channels_config/available_channels.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Add description and details to Kafka notifier
* Fixed testing and add new logic testing
* Add proper description to kafka contact point UI
* Update pkg/services/ngalert/notifier/channels_config/available_channels.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Update pkg/services/ngalert/notifier/channels_config/available_channels.go
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* create contextual log context provider
* use contextual provider in scheduler
* init logger in the package
* use context for log context
* use context in state manager
* make TimeRange interface and add relative range
* make Execute methods support the current time
* update resample to support relative time range
* update DSNode to support relative time range
* update query service to create queries with absolute time
* make alerting evaluator create relative time ranges
* Revert "Revert "Prometheus: Type and flavor configuration (#56496)" (#57552)"
This reverts commit 2432ce619a.
* Adds new fields and documentation for Prometheus datasource configuration: prometheus type, and version
* Adding two new fields to the data JSON in the prometheus datasource configuration: prometheusType, and prometheusVersion.
* Version field will attempt to auto-detect via buildinfo API when prometheus Type is selected
* Touch up log statements, fix casing, add and normalize contexts
* Dedicated logger for dashboard resolver
* Avoid injecting logger to historian
* More minor log touch-ups
* Dedicated logger for state manager
* Use rule context in annotation creator
* Rename base logger and avoid redundant contextual loggers
* Audit logs in sender package
* Fix casing and touch up a few key names
* Avoid logging entire alert struct
* Log configuration ID being applied
* Revert change to errorf rather than log
* Tune levels further and remove some redundancies
* Adjust logger naming and standardize log context
* Adjust logger naming in router
* Move log and get rid of dead error handling code
* Define EvaluationContext
* Refactor ConditionEval to use new context struct
* Refactor QueriesAndExpressionsEval to use EvaluationContext
* Remove dead field from AlertExecCtx
* Refactor Validate to use EvaluationContext
* Get rid of privately used AlertExecCtx
* Move EvaluationContext to new file and add helper
* Add builder pattern and bind rule info to context
* Extract header logic and add rule UID header
* Fix missing call
* chore: add alias for InitTestDB and Session
Adds an alias for the sqlstore InitTestDB and Session, and updates tests using these to reduce dependencies on the sqlstore.Store.
* next pass of removing sqlstore imports
* last little bit
* remove mockstore where possible
This change adds new functionality to the wecom alerting contact point. In addition to a webhook address, you can now send alerts to the wecom apiapp endpoint.
Based on https://github.com/grafana/grafana/discussions/55883
Signed-off-by: aimuz <mr.imuz@gmail.com>
* Create caching dashboard resolver
* A couple tests for dashboard resolving
* Log warning on not found
* Additional polish + review nits
* Move to singleflight instead of a plain mutex
* Store errors instead of -1 in cache and use reflection when reading
* Address linter error
* One more linter error
* Chore: move folder service interface into a separate package
* copy implementation into a standalone package
* move implementation and tests to the new folder package
* remove leftovers from wire
* add test doubles for folder service
* fix tests in library panels/elements
* fix provideservice in ngalert
We have received a lot of feedback regarding the ValueString in alert notifications. Perhaps one of the most frequent complaints about ValueString is that it is difficult to read because it contains a lot of information, and the information is shown as a JSON-like string. Users have often asked how it can be templated and the answer is that it can't.
Until now users have been able to add custom annotations to their alert rules which contains values via the $values variable added in previous versions of Grafana. However, these custom annotations must be added for each of the user's alert rule, instead of once in a template that all of their alerts can be notified via.
This commit adds then the much requested feature to support values in notification templates. Users can then create a single template that prints the annotations, labels and values of their alerts in a format of their choice!
This commit fixes a bug where changing the Folder or Rule Group of an existing rule returns the following error in PostgreSQL "pq: missing FROM-clause for table a"
grafana.com/grafana/prometheus-alertmanager has been updated to a
version that fixes some bugs upstream. This change just updates that
dependency and a few shared ones.
* remove ResetAllStates because it's not used
* refactor cache to accept logs, metrics and url as method args
* update manager Warm method to set the entire state at once
* remove unused reset method
* introduce ruleStates
* change getOrCreate to belong to ruleStates
* update Get to not return error
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.
These new transactions are off by default, guarded by the feature toggle "alertingBigTransactions"
Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s
```
After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s
```
So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
* Move annotation functionality behind a history persistence interface
* Rename to RecordState
* Fix lint error in import aliasing
* One more import linter error
* (WIP) switch to fork AM, first implementation of the API, generate spec
* get receivers avoiding race conditions
* use latest version of our forked AM, tests
* make linter happy, delete TODO comment
* update number of expected paths to += 2
* delete unused endpoint code, code review comments, tests
* Update pkg/services/ngalert/notifier/alertmanager.go
Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
* remove call to fmt.Println
* clear naming for fields
* shorter variable names in GetReceivers
Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
* Move fakeRuleStore to tests/fakes package
* Break stub dependencies on store
* Update existing tests to point to new location
* Remove unused stub of TimeNow
* Rename fake to take advantage of package name
* SQLStore: Ensure that sessions are always closed
Delete `NewSession()` in favour of `WithDbSession()`
* Add WithDbSessionForceNewSession to the interface
* Apply suggestions from code review
* Refactor state manager to not depend on rule store interface
* Refactor grafana and proxied ruler APIs to not depend on store.RuleStore
* Refactor folder subscription logic to not use store.RuleStore
* Delete dead code
* Delete store.RuleStore
This commit is one of two commits to make the data frames for all queries and expressions in an alert rule available to the state package for rendering a graph. It renames Result to Condition, and creates an additional field called
Results that is a map of Ref ID to data.Frames.
* Add consumer-side store interface to state manager
* Remove dead dependency
* Delete dead dependency in API struct
* Delete store-layer InstanceStore interface
* Move fake for state's InstanceStore interface to state package
* Move ticker files to dedicated package with no changes
* Fix package naming and resolve naming conflicts
* Fix up all existing references to moved objects
* Remove all alerting-specific references from shared util
* Rename TickerMetrics to simply Metrics
* Rename base ticker type to T and rename NewTicker to simply New
* PluginDetails: Make plugin details page look good in topnav
* Minor style tweak aligning things
* minor refactoring where I moved the logic to decide the default tab into its own hook.
* refactor(plugindetails): first pass at using navmodel for usePluginDetailsTabs hook
* refactor(plugindetails): move "reset page when uninstalling plugin" to installcontrols
this prevents a user from seeing a blank page if they uninstall an app plugin whilst viewing a
config page
* refactor(plugindetails): remove usage of toIconName and reduce nested if
* Trying to fix tests
* minor fix
* test(plugindetails): update selectors causing failing tests
* chore(plugindetails): remove commented out test code
* test(plugindetails): clean up - remove unnecesary usage of waitFor
Co-authored-by: Marcus Andersson <marcus.andersson@grafana.com>
Co-authored-by: Jack Westbrook <jack.westbrook@gmail.com>
* Alerting: Change screenshots to use components
This commit changes screenshots to use a number of components instead of a set of functional wrappers.
It moves the uploading of screenshots from the screenshot package to the image package so we can re-use the same code for both uploading screenshots and server-side images; SingleFlight from the screenshot package to the image package so we can use it for both taking and uploading the screenshot, where as before it was used just for taking the screenshot; and it also removes the use of a cache because we know that screenshots can be taken at most once per tick of the scheduler.
* Create struct for Slack's receiver settings
* Remove one layer of indirection when building slack notifier
* Delete un-used struct
* Validate against settings struct instead of simplejson object
* Genericize settings marshalling
* Remove repetition between fields on notifier and fields on settings struct
* Rename unmarshal settings wrapper
* Handle comma separated strings at marshalling time rather than validation time
* Address misc review feedback
This commit fixes a bug where we did not send resolved alerts to Alertmanager for resolved alert instances. This meant that resolved notifications did not have the annotations from the resolved state, and a result did not also have the resolved screenshot.
* access control to log user name if it does not have permissions
* update ngalert Evaluator to accept user instead of creating a pseudo one
* update alerting eval (rule\query testing) API to provide the real user to the Evaluator
* update scheduler to create a pseudo user with proper permissions
* WIP
* Set public_suffix to a pre Ruby 2.6 version
* we don't need to install python
* Stretch->Buster
* Bump versions in lib.star
* Manually update linter
Sort of messy, but the .mod-file need to contain all dependencies that
use 1.16+ features, otherwise they're assumed to be compiled with
-lang=go1.16 and cannot access generics et al.
Bingo doesn't seem to understand that, but it's possible to manually
update things to get Bingo happy.
* undo reformatting
* Various lint improvements
* More from the linter
* goimports -w ./pkg/
* Disable gocritic
* Add/modify linter exceptions
* lint + flatten nested list
Go 1.19 doesn't support nested lists, and there wasn't an obvious workaround.
https://go.dev/doc/comment#lists
* Alerting: Sanitize invalid label/annotation names for external alertmanagers
Grafana's built-in Alertmanager supports both Unicode label keys and values; however, if using an external
Prometheus Alertmanager label keys must be compatible with their data model.
This means label keys must only contain ASCII letters, numbers, as well as underscores and match the regex
`[a-zA-Z_][a-zA-Z0-9_]*`.
Any invalid characters will now be removed or replaced by the Grafana alerting engine before being sent to
the external Alertmanager according to the following rules:
- `Whitespace` will be removed.
- `ASCII characters` will be replaced with `_`.
- `All other characters` will be replaced with their lower-case hex representation.
* Prefix hex replacements with `0x`
* Refactor for clarity
* Apply suggestions from code review
Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.
Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s
```
After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s
```
So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
This change also updates some of our tests so that they run successfully against postgreSQL - we were using random Int64s, but postgres integers, which our tables use, max out at 2^31-1
* Update GetAlertRulesForScheduling to query for folders (if needed)
* Update scheduler's alertRulesRegistry to cache folder titles along with rules
* Update rule eval loop to take folder title from the
* Extract interface RuleStore
* Pre-fetch the rule keys with the version to detect changes, and query the full table only if there are changes.
Removes various custom headers logic sprinkled around in the backend.
It should automatically be applied to outgoing HTTP requests via the
CustomHeadersMiddleware.
This also removes decryption of SecureJSONData to populate custom
headers in ngalert which seemed to have caused a ton of CPU usage.
* update RouteDeleteAlertRules rules to update as a group
* remove expecter from scheduler mock to support variadic function
* create function to check for provisioning status + tests
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>