Alerting docs: recovery threshold (#81069)

* Alerting docs: recovery threshold

* ran prettier

* Adds note that only available in oss

* ran prettier
This commit is contained in:
brendamuir
2024-01-25 10:28:32 +01:00
committed by GitHub
parent 12e19d5364
commit 030a68bbf7
2 changed files with 29 additions and 0 deletions

View File

@@ -128,6 +128,26 @@ When the queried data satisfies the defined condition, Grafana triggers the asso
By default, the last expression added is used as the alert condition.
## Recovery threshold
{{% admonition type="note" %}}
The recovery threshold feature is currently only available in OSS.
{{% /admonition %}}
To reduce the noise of flapping alerts, you can set a recovery threshold different to the alert threshold.
Flapping alerts occur when a metric hovers around the alert threshold condition and may lead to frequent state changes, resulting in too many notifications being generated.
Grafana-managed alert rules are evaluated for a specific interval of time. During each evaluation, the result of the query is checked against the threshold set in the alert rule. If the value of a metric is above the threshold, an alert rule fires and a notification is sent. When the value goes below the threshold and there is an active alert for this metric, the alert is resolved, and another notification is sent.
It can be tricky to create an alert rule for a noisy metric. That is, when the value of a metric continually goes above and below a threshold. This is called flapping and results in a series of firing - resolved - firing notifications and a noisy alert state history.
For example, if you have an alert for latency with a threshold of 1000ms and the number fluctuates around 1000 (say 980 ->1010 -> 990 -> 1020, and so on) then each of those will trigger a notification.
To solve this problem, you can set a (custom) recovery threshold, which basically means having two thresholds instead of one. An alert is triggered when the first threshold is crossed and is resolved only when the second threshold is crossed.
For example, you could set a threshold of 1000ms and a recovery threshold of 900ms. This way, an alert rule will only stop firing when it goes under 900ms and flapping is reduced.
{{% docs/reference %}}
[data-source-alerting]: "/docs/grafana/ -> /docs/grafana/<GRAFANA VERSION>/alerting/fundamentals/data-source-alerting"
[data-source-alerting]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/fundamentals/data-source-alerting"