Alerting docs: recovery threshold (#81069)

* Alerting docs: recovery threshold

* ran prettier

* Adds note that only available in oss

* ran prettier
This commit is contained in:
brendamuir 2024-01-25 10:28:32 +01:00 committed by GitHub
parent 12e19d5364
commit 030a68bbf7
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 29 additions and 0 deletions

View File

@ -71,6 +71,7 @@ Define a query to get the data you want to measure and a condition that needs to
All alert rules are managed by Grafana by default. If you want to switch to a data source-managed alert rule, click **Switch to data source-managed alert rule**.
1. Add one or more [expressions][expression-queries].
a. For each expression, select either **Classic condition** to create a single alert rule, or choose from the **Math**, **Reduce**, and **Resample** options to generate separate alert for each series.
{{% admonition type="note" %}}
@ -79,6 +80,14 @@ Define a query to get the data you want to measure and a condition that needs to
b. Click **Preview** to verify that the expression is successful.
{{% admonition type="note" %}}
The recovery threshold feature is currently only available in OSS.
{{% /admonition %}}
1. To add a recovery threshold, turn the **Custom recovery threshold** toggle on and fill in a value for when your alert rule should stop firing.
You can only add one recovery threshold in a query and it must be the alert condition.
1. Click **Set as alert condition** on the query or expression you want to set as your alert condition.
## Set alert evaluation behavior

View File

@ -128,6 +128,26 @@ When the queried data satisfies the defined condition, Grafana triggers the asso
By default, the last expression added is used as the alert condition.
## Recovery threshold
{{% admonition type="note" %}}
The recovery threshold feature is currently only available in OSS.
{{% /admonition %}}
To reduce the noise of flapping alerts, you can set a recovery threshold different to the alert threshold.
Flapping alerts occur when a metric hovers around the alert threshold condition and may lead to frequent state changes, resulting in too many notifications being generated.
Grafana-managed alert rules are evaluated for a specific interval of time. During each evaluation, the result of the query is checked against the threshold set in the alert rule. If the value of a metric is above the threshold, an alert rule fires and a notification is sent. When the value goes below the threshold and there is an active alert for this metric, the alert is resolved, and another notification is sent.
It can be tricky to create an alert rule for a noisy metric. That is, when the value of a metric continually goes above and below a threshold. This is called flapping and results in a series of firing - resolved - firing notifications and a noisy alert state history.
For example, if you have an alert for latency with a threshold of 1000ms and the number fluctuates around 1000 (say 980 ->1010 -> 990 -> 1020, and so on) then each of those will trigger a notification.
To solve this problem, you can set a (custom) recovery threshold, which basically means having two thresholds instead of one. An alert is triggered when the first threshold is crossed and is resolved only when the second threshold is crossed.
For example, you could set a threshold of 1000ms and a recovery threshold of 900ms. This way, an alert rule will only stop firing when it goes under 900ms and flapping is reduced.
{{% docs/reference %}}
[data-source-alerting]: "/docs/grafana/ -> /docs/grafana/<GRAFANA VERSION>/alerting/fundamentals/data-source-alerting"
[data-source-alerting]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/fundamentals/data-source-alerting"