This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
This commit adds debug logs for previous_ends_at and next_ends_at
to state.go to help us debug issues where alerts are resolved in
Alertmanager due to expiration. This change is in response to a
support escalation where this information was needed but unavailable.
* Alerting: Repurpose rule testing endpoint to return potential alerts
This feature replaces the existing no-longer in-use grafana ruler testing API endpoint /api/v1/rule/test/grafana. The new endpoint returns a list of potential alerts created by the given alert rule, including built-in + interpolated labels and annotations.
The key priority of this endpoint is that it is intended to be as true as possible to what would be generated by the ruler except that the resulting alerts are not filtered to only Resolved / Firing and ready to be sent.
This means that the endpoint will, among other things:
- Attach static annotations and labels from the rule configuration to the alert instances.
- Attach dynamic annotations from the datasource to the alert instances.
- Attach built-in labels and annotations created by the Grafana Ruler (such as alertname and grafana_folder) to the alert instances.
- Interpolate templated annotations / labels and accept allowed template functions.
* use tokens or urls in image annotations
* improve tests, fix some comments
* fix empty tokens
* code review changes, check for url before checking for token (support old token formats)
* Alerting: Remove and revert flag alertingBigTransactions
This is a partial revert of #56575 and a removal of the `alertingBigTransactions` flag.
Real-word use has seen no clear performance incentive to maintain this flag. Lowered db connection count
came at the cost of significant increase in CPU usage and query latency.
* Fix lint backend
* Removed last bits of alertingBigTransactions
---------
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
* Add fresh context with timeout and same log properties, re-derive logger
* Unify timeout constants
* Move ctx after shortcut that got added through rebasing
* Unify timeouts
* Port opentracing's SpanFromContext and ContextFromSpan to the grafana tracing package
* Support both opentracing and otel variants
* Better document why we're creating a new ctx
* Add new func to FakeSpan which was added after rebase
* Support grafana-specific traceID key in both tracer implementations
* Alerting: Respect "For" Duration for NoData alerts
This change modifies `resultNoData` to be more inline with the logic of the other state handlers.
The main effects of this are:
1) NoData states with NoDataState config set to Alerting will respect "For" duration.
2) Prevents zero value in StartsAt and EndsAt for alerts that have only even been in normal state. This includes state transitions from NoDataState=OK and ExecErrState=OK.
3) Better state transition logging.
* Remove private labels
* No longer index by instance labels
* Labels are now invariant, only build them once
* Remove bucketing since everything is in a single stream
* Refactor statesToStreams to only return a single unified log stream
* Don't query on labels that no longer exist
* Move selector logic to loki layer, genericize client to work in terms of straight logQL
* Add support for line-level label filters in query
* Combine existing selector tests for better parallelism
* Tests for logQL construction
* Underscore instead of dot for unwrapping labels in logql
* Encode with snappy, always
* JSON encoder type
* Headers
* Copy labels formatter from promtail
* Implement snappy-proto encoding
* Create encoder interface, test both encoders, choose snappy-proto by default
* Make encoder configurable at the LokiCfg level
* Export both encoders
* Touch up comment and tests
* Drop unnecessary conversions after move to plain strings to appease linter
* Rename RecordStatesAsync to Record
* Rename QueryStates to Query
* Implement fanout writes
* Implement primary queries
* Simplify error joining
* Add test for query path
* Add tests for writes and error propagation
* Allow fanout backend to be configured
* Touch up log messages and config validation
* Consistent documentation for all backend structs
* Parse and normalize backend names more consistently against an enum
* Touch-ups to documentation
* Improve clarity around multi-record blocking
* Keep primary and secondaries more distinct
* Rename fanout backend to multiple backend
* Simplify config keys for multi backend mode
This commit changes the state package so that errors encountered while
expanding templates for custom labels and annotations are returned
from the function. This is not used at present, but will be used in the
future as we look at how to offer better feedback to users who don't
have access to logs, for example our customers who use Hosted Grafana.
This commit fixes a bug in the $values variable in notification
templates when using Classic Conditions. Since Classic Conditions
are not multi-dimensional, the values of each series that exceeded
the condition should be available as a RefID and offset. For example,
B0, B1, etc. However, this bug meant that instead just a single
condition would be printed as B, not B0.
* Create historian metrics and dependency inject
* Record counter for total number of state transitions logged
* Track write failures
* Track current number of active write goroutines
* Record histogram of how long it takes to write history data
* Don't copy the registerer
* Adjust naming of write failures metric
* Introduce WritesTotal to complement WritesFailedTotal
* Measure TransitionsFailedTotal to complement TransitionsTotal
* Rename all to state_history
* Remove redundant Total suffix
* Increment totals all the time, not just on success
* Drop ActiveWriteGoroutines
* Drop PersistDuration in favor of WriteDuration
* Drop unused gauge
* Make writes and writesFailed per org
* Add metric indicating backend and a spot for future metadata
* Drop _batch_ from names and update help
* Add metric for bytes written
* Better pairing of total + failure metric updates
* Few tweaks to wording and naming
* Record info metric during composition
* Create fakeRequester and simple happy path test using it
* Blocking test for the full historian and test for happy path metrics
* Add tests for failure case metrics
* Smoke test for full annotation persistence
* Create test for metrics on annotation persistence, both happy and failing paths
* Address linter complaints
* More linter complaints
* Remove unnecessary whitespace
* Consistency improvements to help texts
* Update tests to match new descs
* Loki backend and client depend on a requester
* Instrument all requests to loki using weaveworks TimedClient
* Construct collector in metrics package
This commit adds filterLabels, filterLabelsRe, removeLabels, and
removeLabelsRe functions to templates for custom labels and annotations.
It allows for use cases such as removing all private labels.
This commit changes the Data struct in template.go to use Labels
instead of map[string]string. It changes how labels are printed
when using {{ .Labels }} from map[foo:bar bar:baz] to
foo=bar, bar=baz.
This commit changes how labels are printed in templates for custom
annotations and labels from map[foo:bar bar:baz] to foo=bar, bar=baz.
Labels are comma separated, and sorted in increasing order.
This commit moves templating from the state package to a sub-package
called template. This sub-package will be the logical package for
future ease-of-use improvements to templating custom annotations
and labels.
* Use existing row struct instead of [2]string, add deserialization helper
* Replace Stream struct with stream struct which is exactly the same
* Drop unused status field
* Don't export queryRes and queryData
* Tests for custom marshalling
* Rename row fields to T and V for consistency with prometheus samples
* Rename row to sample
The `rule_groups_rules` metric is currently defined and computed by `State`.
It makes more sense for this metric to be computed off of the configured rule
set, not based on the rule evaluation state. There could be an edge condition
where a rule does not have a state yet, and so is uncounted.
Additionally, we would like this metric (and others), to have a `rule_group`
label, and this is much easier to achieve if the metric is produced from the
`Scheduler` package.