Commit Graph

1038 Commits

Author SHA1 Message Date
Santiago
aba91d3053 Alerting: Fetch all applied alerting configurations (#65728)
* WIP

* skip invalid historic configurations instead of erroring

* add warning log when bad historic config is found

* remove unused custom marshaller for GettableHistoricUserConfig

* add id to historic user config, move limit check to store, fix typo

* swagger spec
2023-03-31 17:43:04 -03:00
Alexander Weaver
da4832724e Alerting: Delete stub for SQL alert state history backend (#65667)
Delete stub for SQL backend
2023-03-31 11:15:56 -05:00
Matthew Jacobson
b9dc04139a Alerting: Respect "For" Duration for NoData alerts (#65574)
* Alerting: Respect "For" Duration for NoData alerts

This change modifies `resultNoData` to be more inline with the logic of the other state handlers.

The main effects of this are:

1) NoData states with NoDataState config set to Alerting will respect "For" duration.
2) Prevents zero value in StartsAt and EndsAt for alerts that have only even been in normal state. This includes state transitions from NoDataState=OK and ExecErrState=OK.
3) Better state transition logging.
2023-03-31 19:05:15 +03:00
Steve Simpson
04336d53a9 Alerting: Update prometheus version (#65688) 2023-03-31 16:34:35 +02:00
Yuri Tseretyan
622c23716a Alerting: Use logger with context in the state cache (#65663) 2023-03-31 10:11:30 -04:00
Alexander Weaver
b2abb63286 Alerting: Introduce proper feature toggles for common state history backend combinations (#65497)
* define 3 feature toggles for rollout phases

* Pass feature toggles along

* Implement first feature toggle

* Try a different strategy with fall-throughs to specific configurations

* Apply toggle overrides once outside of backend composition

* Emit log messages when we coerce backends

* Run code generator for feature toggle files

* Improve wording in flag descs

* Re-run generator

* Use code-generated constants instead of plain strings

* Use converted enum values rather than strings for pre-parsing
2023-03-30 13:53:21 -05:00
Alexander Weaver
5e87ea745d Alerting: Fix and re-enable filters instance labels in log line test (#65618)
Fix and reenable test
2023-03-30 09:02:18 -05:00
Dimitris Sotirakis
e758b017d0 Alerting: Disable filters instance labels in log line test (#65610)
* Disable filters instance labels in log line test

* Add drone reference
2023-03-30 16:04:29 +03:00
Yuri Tseretyan
9eaffdf5a8 Alerting: Remove dependency on alerting package in definitions (#65390)
* move export rules to definitions package
* move provisioning contact point methods to provisioning package
* move AlertRuleGroupWithFolderTitle to ngalert models and adapter functions to api's compat
* move rule_types files back to where they were before.
2023-03-29 13:34:59 -04:00
Alexander Weaver
a416100abc Alerting: No longer index state history log streams by instance labels (#65474)
* Remove private labels

* No longer index by instance labels

* Labels are now invariant, only build them once

* Remove bucketing since everything is in a single stream

* Refactor statesToStreams to only return a single unified log stream

* Don't query on labels that no longer exist

* Move selector logic to loki layer, genericize client to work in terms of straight logQL

* Add support for line-level label filters in query

* Combine existing selector tests for better parallelism

* Tests for logQL construction

* Underscore instead of dot for unwrapping labels in logql
2023-03-29 11:52:11 -05:00
Santiago
7b92849508 Alerting: Add CustomDetails field in PagerDuty contact point (#64860)
* Alerting: Add CustomDetails for PagerDuty

* fix default value for 'severity' from 'error' to 'critical'

* minimal docs for notifiers, specifying config for PagerDuty

* replace notifier -> integration

* replace notifier -> integration
2023-03-29 10:35:01 -03:00
Alexander Weaver
de1637afe5 Alerting: Add alert instance labels to Loki log lines in addition to stream labels (#65403)
Add instance labels to log line
2023-03-28 08:57:51 -05:00
Alexander Weaver
dd04757fc9 Alerting: Add "backend" label to state history writes metrics (#65395)
* Add backend label to state history writes metrics

* Update test expectations
2023-03-28 08:49:51 -05:00
Giuseppe Guerra
a89202eab2 Plugins: Improve instrumentation by adding metrics and tracing (#61035)
* WIP: Plugins tracing

* Trace ID middleware

* Add prometheus metrics and tracing to plugins updater

* Add TODOs

* Add instrumented http client

* Add tracing to grafana update checker

* Goimports

* Moved plugins tracing to middleware

* goimports, fix tests

* Removed X-Trace-Id header

* Fix comment in NewTracingHeaderMiddleware

* Add metrics to instrumented http client

* Add instrumented http client options

* Removed unused function

* Switch to contextual logger

* Refactoring, fix tests

* Moved InstrumentedHTTPClient and PrometheusMetrics to their own package

* Tracing middleware: handle errors

* Report span status codes when recording errors

* Add tests for tracing middleware

* Moved fakeSpan and fakeTracer to pkg/infra/tracing

* Add TestHTTPClientTracing

* Lint

* Changes after PR review

* Tests: Made "ended" in FakeSpan private, allow calling End only once

* Testing: panic in FakeSpan if span already ended

* Refactoring: Simplify Grafana updater checks

* Refactoring: Simplify plugins updater error checks and logs

* Fix wrong call to checkForUpdates -> instrumentedCheckForUpdates

* Tests: Fix wrong call to checkForUpdates -> instrumentedCheckForUpdates

* Log update checks duration, use Info log level for check succeeded logs

* Add plugin context span attributes in tracing_middleware

* Refactor prometheus metrics as httpclient middleware

* Fix call to ProvidePluginsService in plugins_test.go

* Propagate context to update checker outgoing http requests

* Plugin client tracing middleware: Removed operation name in status

* Fix tests

* Goimports tracing_middleware.go

* Goimports

* Fix imports

* Changed span name to plugins client middleware

* Add span name assertion in TestTracingMiddleware

* Removed Prometheus metrics middleware from grafana and plugins updatechecker

* Add span attributes for ds name, type, uid, panel and dashboard ids

* Fix http header reading in tracing middlewares

* Use contexthandler.FromContext, add X-Query-Group-Id

* Add test for RunStream

* Fix imports

* Changes from PR review

* TestTracingMiddleware: Changed assert to require for didPanic assertion

* Lint

* Fix imports
2023-03-28 11:01:06 +02:00
Serge Zaitsev
0beb768427 Chore: Remove result fields from ngalert (#65410)
* remove result fields from ngalert

* remove duplicate imports
2023-03-28 10:34:35 +02:00
Yuri Tseretyan
ec4152c7e5 Alerting: Remove dependency on secrets in definitions package (#65391) 2023-03-27 16:35:54 -04:00
Yuri Tseretyan
52a0f59706 Alerting: introduce AlertQuery in definitions package (#63825)
* copy AlertQuery from ngmodels to the definition package
* replaces usages of ngmodels.AlertQuery in API models
* create a converter between models of AlertQuery
---------

Co-authored-by: Alex Moreno <alexander.moreno@grafana.com>
2023-03-27 11:55:13 -04:00
Alexander Weaver
07368dec74 Alerting: Fix attachment of external labels to Loki state history log streams (#65140)
Fix attachment of external labels, add tests
2023-03-21 18:00:59 -05:00
Alexander Weaver
bf54f2672e Alerting: Switch to snappy-compressed-protobuf for outgoing push requests to Loki (#65077)
* Encode with snappy, always

* JSON encoder type

* Headers

* Copy labels formatter from promtail

* Implement snappy-proto encoding

* Create encoder interface, test both encoders, choose snappy-proto by default

* Make encoder configurable at the LokiCfg level

* Export both encoders

* Touch up comment and tests

* Drop unnecessary conversions after move to plain strings to appease linter
2023-03-21 13:38:42 -05:00
Alexander Weaver
cc7e5ce62e Alerting: Fix ambiguous handling of equals in labels when bucketing Loki state history streams (#65013)
* Use JSON instead of data.Labels string format as label repr

* Drop debug log line
2023-03-21 12:33:27 -05:00
Alexander Weaver
e39d7f44c9 Alerting: Elide requests to Loki if nothing should be recorded (#65011)
Exit early if no log streams or annotations
2023-03-21 09:30:56 -05:00
Alexander Weaver
40c5713cbd Vendor errors.Join from Go standard library to avoid version incompatibilities (#64985)
Vendor errors.Join from std lib
2023-03-17 14:07:58 -05:00
Alexander Weaver
a31672fa40 Alerting: Create new state history "fanout" backend that dispatches to multiple other backends at once (#64774)
* Rename RecordStatesAsync to Record

* Rename QueryStates to Query

* Implement fanout writes

* Implement primary queries

* Simplify error joining

* Add test for query path

* Add tests for writes and error propagation

* Allow fanout backend to be configured

* Touch up log messages and config validation

* Consistent documentation for all backend structs

* Parse and normalize backend names more consistently against an enum

* Touch-ups to documentation

* Improve clarity around multi-record blocking

* Keep primary and secondaries more distinct

* Rename fanout backend to multiple backend

* Simplify config keys for multi backend mode
2023-03-17 12:41:18 -05:00
Alexander Weaver
9bcf8819d3 Alerting: Handful of small adjustments to log levels and parameters (#64572)
Calculate duration earlier in scheduler
2023-03-17 12:15:49 +00:00
gotjosh
02a8f62021 Alerting: Fix stats that display alert count when using unified alerting (#64852)
* Alerting: Fix stats when using unified alerting
2023-03-17 11:19:18 +00:00
George Robinson
0b506b4ccc Alerting: Update github.com/grafana/alerting (#64882) 2023-03-16 13:59:35 +00:00
Yuri Tseretyan
85a954cd81 Alerting: Update scheduler to get updates only from database (#64635)
* stop using the scheduler's Update and Delete methods all communication must be via the database
* update scheduler's registry to calculate diff before re-setting the cache
* update fetcher to return the diff generated by registry
* update processTick to update rule eval routine if the rule was updated and it is not going to be evaluated at this tick.
* remove references to the scheduler from api package
* remove unused methods in the scheduler
2023-03-14 18:02:51 -04:00
Alexander Weaver
faef3a8258 Alerting: Log error but don't fail initialization if state history connection test fails (#64699)
Don't return init error if ping fails, add tests
2023-03-13 15:54:46 -05:00
Andrej Ocenas
6647217208 Phlare: Use enum config to send deduplicated func and filenames (#64435) 2023-03-13 11:06:04 +01:00
Jean-Philippe Quéméner
fb5ed0b0b3 Alerting: fix flaky cache test (#64499) 2023-03-09 06:08:05 -05:00
Ryan McKinley
42e7ec9fe4 Chore: cleanup dashboard service names (#64442) 2023-03-08 14:37:45 -05:00
George Robinson
0c8876c3a2 Alerting: Return errors when expanding templates (#63662)
This commit changes the state package so that errors encountered while
expanding templates for custom labels and annotations are returned
from the function. This is not used at present, but will be used in the
future as we look at how to offer better feedback to users who don't
have access to logs, for example our customers who use Hosted Grafana.
2023-03-08 12:25:02 +00:00
Alexander Weaver
4a1c18abf6 Alerting: Fix intermittency when seeding database in rule store tests (#64322)
Force unique IDs when seeding database
2023-03-07 09:40:55 -05:00
George Robinson
ed71012ced Alerting: Fix Classic Conditions $values variable (#64243)
This commit fixes a bug in the $values variable in notification
templates when using Classic Conditions. Since Classic Conditions
are not multi-dimensional, the values of each series that exceeded
the condition should be available as a RefID and offset. For example,
B0, B1, etc. However, this bug meant that instead just a single
condition would be printed as B, not B0.
2023-03-06 12:08:00 -05:00
Alexander Weaver
19d01dff91 Alerting: Expose Prometheus metrics for persisting state history (#63157)
* Create historian metrics and dependency inject

* Record counter for total number of state transitions logged

* Track write failures

* Track current number of active write goroutines

* Record histogram of how long it takes to write history data

* Don't copy the registerer

* Adjust naming of write failures metric

* Introduce WritesTotal to complement WritesFailedTotal

* Measure TransitionsFailedTotal to complement TransitionsTotal

* Rename all to state_history

* Remove redundant Total suffix

* Increment totals all the time, not just on success

* Drop ActiveWriteGoroutines

* Drop PersistDuration in favor of WriteDuration

* Drop unused gauge

* Make writes and writesFailed per org

* Add metric indicating backend and a spot for future metadata

* Drop _batch_ from names and update help

* Add metric for bytes written

* Better pairing of total + failure metric updates

* Few tweaks to wording and naming

* Record info metric during composition

* Create fakeRequester and simple happy path test using it

* Blocking test for the full historian and test for happy path metrics

* Add tests for failure case metrics

* Smoke test for full annotation persistence

* Create test for metrics on annotation persistence, both happy and failing paths

* Address linter complaints

* More linter complaints

* Remove unnecessary whitespace

* Consistency improvements to help texts

* Update tests to match new descs
2023-03-06 10:40:37 -06:00
gotjosh
5422f7cf56 Alerting: Add metrics for active receiver and integrations (#64050)
* Alerting: Add metrics for active receiver and integrations

Introduces metrics that allows us to track the number of configured receivers and integration in the Alertmanager for all orgs.

As a bonus, I realised that the alert reception metrics where not being exported nor collected. This does that too.
2023-03-06 16:37:07 +00:00
Serge Zaitsev
0bdb105df2 Chore: Remove xorcare/pointer dependency (#63900)
* Chore: remove pointer dependency

* fix type casts

* deprecate xorcare/pointer library in linter

* rooky mistake
2023-03-06 05:23:15 -05:00
Ieva
a52999a886 Access Control: revert to using folder store from the scope resolvers (#64132)
* revert to using folder store from the resolvers

* fixing tests after revert

* api test fixes

---------

Co-authored-by: Kristin Laemmert <mildwonkey@users.noreply.github.com>
2023-03-03 10:56:33 -05:00
Yuri Tseretyan
e760f22402 Alerting: Use background context for maintenance function (#64065) 2023-03-02 14:19:52 -05:00
Emil Tullstedt
10ee900beb Errors: Remove direct dependencies on github.com/pkg/errors (#64026)
Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>
2023-03-02 16:28:10 +01:00
Kristin Laemmert
bb798e24f3 chore(services): replace dependencies on dashboard store with dashboard service (#63937)
* chore(services): replace dependencies on dashboard store with dashboard service

This continues the backend service/store split by replacing dashboard store dependencies with service dependencies. the folder service remains the single exception for now; otherwise we'd have a dependency cycle between the folder and dashboard services. I have some ideas for that, but I'll take care of all the easy parts first.

While doing this, I identified and removed a number of unused arguments from the following functions:

NewFolderNameScopeResolver
NewFolderIDScopeResolver
NewFolderUIDScopeResolver
NewDashboardIDScopeResolver
NewDashboardUIDScopeResolver
resolveDashboardScope

I have a small enterprise PR to support this commit.

* lingering fmt
2023-03-02 08:09:57 -05:00
Yuri Tseretyan
5e2a661dec Alerting: update API models to user NoDataState and ExecutionErrorState from definitions instead of models (#63824) 2023-02-28 16:21:41 -05:00
Yuri Tseretyan
f561e71de8 Alerting: decouple api models from domain\dto models: separate Provenance status + converters (#63594)
* move conversions of domain models to api models and reverse from definition package to api package
2023-02-27 17:57:15 -05:00
Alexander Weaver
e77621649d Alerting: Instrument outgoing state history requests using weaveworks/common (#63600)
* Loki backend and client depend on a requester

* Instrument all requests to loki using weaveworks TimedClient

* Construct collector in metrics package
2023-02-23 17:52:02 -06:00
Yuri Tseretyan
98e1aeaebd Alerting: Fix client to external Alertmanager to correctly build URL for Mimir Alertmanager (#63676) 2023-02-23 13:55:26 -05:00
Emil Tullstedt
3abaf32cf2 Chore: Upgrade golangci-lint to v1.51.2 (#63630)
Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>
2023-02-23 15:10:03 +01:00
Alex Moreno
f60dc4441f Alerting: Add status label to GroupRules metric (#63454)
* Add status label to GroupRules metric

* Add state (active and paused) label to GrouRules

* Add active/paused metrics tests
2023-02-23 12:38:27 +01:00
George Robinson
f93a9c794d Alerting: Fix incorrect comment in eval.go (#63510)
This commit fixes an incorrect comment in the Result struct in eval.go
that I had written some time ago. The comment now documents the
actual behaviour and content of this field.
2023-02-21 15:42:04 +00:00
bla2ej
56c8661929 Alerting: Get alert rules on faults (#61248) (#63051)
* Alerting: get alert rules on faults (#61248)

Two functions used to fetch alert rules from DB are updated:
- GetAlertRulesForScheduling
- ListAlertRules
Rows are scanned one by one so good ones are returned.
Common Error is logged with indication how many
rules failed on deserialization.

Resolved: #61248

* updates from review comments
2023-02-21 08:54:20 -06:00
George Robinson
9f2fb3fa27 Alerting: Add filter and remove funcs for custom labels and annotations (#63437)
This commit adds filterLabels, filterLabelsRe, removeLabels, and
removeLabelsRe functions to templates for custom labels and annotations.
It allows for use cases such as removing all private labels.
2023-02-20 14:40:26 +00:00