mirror of
https://github.com/grafana/grafana.git
synced 2025-02-25 18:55:37 -06:00
update for get-started-part-3 (#98601)
* update for get-started-part-3 * update * pretty * typo * clarification * clarification2 * pretty2
This commit is contained in:
parent
69adb3d25b
commit
2fd355ebca
@ -41,15 +41,17 @@ refs:
|
||||
|
||||
The Get started with Grafana Alerting tutorial Part 3 is a continuation of [Get started with Grafana Alerting tutorial Part 2](http://www.grafana.com/tutorials/alerting-get-started-pt2/).
|
||||
|
||||
Alert grouping in Grafana Alerting reduces notification noise by combining related alerts into a single, concise notification. This is essential for on-call engineers, ensuring they focus on resolving incidents instead of sorting through a flood of notifications.
|
||||
Grouping in Grafana Alerting reduces notification noise by combining related alert instances into a single, concise notification. This is useful for on-call engineers, ensuring they focus on resolving incidents instead of sorting through a flood of notifications.
|
||||
|
||||
Grouping is configured by using labels in the notification policy that reference the labels that are generated by the alert instances. With notification policies, you can also configure how often notifications are sent for each group of alerts.
|
||||
Grouping is configured using labels in the notification policy. These labels reference those generated by alert instances or configured by the user.
|
||||
|
||||
Notification policies also allow you to define how often notifications are sent for each group of alert instances.
|
||||
|
||||
In this tutorial, you will:
|
||||
|
||||
- Learn how alert rule grouping works.
|
||||
- Create a notification policy to handle grouping.
|
||||
- Define an alert rule for a real-world scenario.
|
||||
- Define alert rules for a real-world scenario.
|
||||
- Receive and review grouped alert notifications.
|
||||
|
||||
<!-- INTERACTIVE page intro.md END -->
|
||||
@ -164,25 +166,26 @@ Alert notification grouping is configured with **labels** and **timing options**
|
||||
|
||||
### Types of Labels
|
||||
|
||||
1. **Reserved labels** (default):
|
||||
**Reserved labels** (default):
|
||||
|
||||
- Automatically generated by Grafana, e.g., `alertname`, `grafana_folder`.
|
||||
- Example: `alertname="High CPU usage"`.
|
||||
- Automatically generated by Grafana, e.g., `alertname`, `grafana_folder`.
|
||||
- Example: `alertname="High CPU usage"`.
|
||||
|
||||
1. **User-configured labels**:
|
||||
**User-configured labels**:
|
||||
|
||||
- Added manually to the alert rule.
|
||||
- Example: `severity`, `priority`.
|
||||
- Added manually to the alert rule.
|
||||
- Example: `severity`, `priority`.
|
||||
|
||||
1. **Query labels**:
|
||||
- Returned by the data source query.
|
||||
- Example: `region`, `service`, `environment`.
|
||||
**Query labels**:
|
||||
|
||||
- Returned by the data source query.
|
||||
- Example: `region`, `service`, `environment`.
|
||||
|
||||
### Timing Options
|
||||
|
||||
1. **Group wait**: Time before sending the first notification.
|
||||
1. **Group interval**: Time between notifications for a group.
|
||||
1. **Repeat interval**: Time before resending notifications for an unchanged group.
|
||||
**Group wait**: Time before sending the first notification.
|
||||
**Group interval**: Time between notifications for a group.
|
||||
**Repeat interval**: Time before resending notifications for an unchanged group.
|
||||
|
||||
Alerts sharing the **same label values** are grouped together, and timing options determine notification frequency.
|
||||
|
||||
@ -198,7 +201,7 @@ For more details, see:
|
||||
|
||||
### Scenario: monitoring a distributed application
|
||||
|
||||
You’re monitoring metrics like CPU usage, memory utilization, and network latency across multiple regions. Alert rules include labels such as `region: us-west` and `region: us-east`. If multiple alerts trigger across these regions, they can result in notification floods.
|
||||
You’re monitoring metrics like CPU usage, memory utilization, and network latency across multiple regions. Some of these alert rules include labels such as `region: us-west` and `region: us-east`. If multiple alert rules trigger across these regions, they can result in notification floods.
|
||||
|
||||
### How to manage grouping
|
||||
|
||||
@ -206,10 +209,13 @@ To group alert rule notifications:
|
||||
|
||||
1. **Define labels**: Use `region`, `metric`, or `instance` labels to categorize alerts.
|
||||
1. **Configure Notification policies**:
|
||||
- Group alerts by the `region` label.
|
||||
- Group alerts by the **query label** "region".
|
||||
- Example:
|
||||
- Alerts for `region: us-west` go to the West Coast team.
|
||||
- Alerts for `region: us-east` go to the East Coast team.
|
||||
- Alert notifications for `region: us-west` go to the West Coast team.
|
||||
- Alert notifications for `region: us-east` go to the East Coast team.
|
||||
1. Specify the **timing options** for sending notifications to control their frequency.
|
||||
- Example:
|
||||
- **Group interval**: setting determines how often updates for the same alert group are sent. By default, this interval is set to 5 minutes, but you can customize it to be shorter or longer based on your needs.
|
||||
|
||||
<!-- INTERACTIVE page step3.md END -->
|
||||
<!-- INTERACTIVE page step4.md START -->
|
||||
@ -218,9 +224,7 @@ To group alert rule notifications:
|
||||
|
||||
### Notification Policy
|
||||
|
||||
[Notification policies](ref:notification-policies) group alert instances and route notifications to specific contact points.
|
||||
|
||||
To follow the above example, we will create notification policies that route alert instances based on the `region` label to specific contact points. This setup ensures that alerts for a given region are consolidated into a single notification. Additionally, we will fine-tune the **timing settings** for each region by overriding the default parent policy, allowing more granular control over when notifications are sent.
|
||||
Following the above example, [notification policies](ref:notification-policies) are created to route alert instances, which have a region label, to a specific contact point. The goal is to receive one consolidated notification per region. To demonstrate how grouping works, alert notifications for the East Coast team are not grouped. Regarding timing, a specific schedule is defined for that region. This setup overrides the parent's settings to fine-tune the behavior for specific labels (i.e., regions).
|
||||
|
||||
<!-- INTERACTIVE ignore START -->
|
||||
|
||||
@ -255,7 +259,7 @@ To follow the above example, we will create notification policies that route ale
|
||||
1. Override grouping settings:
|
||||
|
||||
- Toggle **Override grouping**.
|
||||
- **Group by**: `region`.
|
||||
- **Group by**: Add `region` as label. Remove any existing labels.
|
||||
|
||||
**Group by** consolidates alerts that share the same grouping label into a single notification. For example, all alerts with `region=us-west` will be combined into one notification, making it easier to manage and reducing alert fatigue.
|
||||
|
||||
@ -268,11 +272,11 @@ To follow the above example, we will create notification policies that route ale
|
||||
|
||||
1. Save and repeat:
|
||||
|
||||
- Repeat for `region = us-east` with a different webhook or a different contact point.
|
||||
- Repeat the steps above for `region = us-east` but without overriding grouping and timing options. Use a different webhook endpoint as the contact point.
|
||||
|
||||
{{< figure src="/media/docs/alerting/notificaiton-policies-region.png" max-width="750px" alt="Two nested notification policies to route and group alert notifications" >}}
|
||||
|
||||
These nested policies should route alert instances where the region label is either us-west or us-east.
|
||||
These nested policies should route alert instances where the region label is either us-west or us-east. Only the us-west region team should receive grouped alert notifications.
|
||||
|
||||
{{< admonition type="note" >}}
|
||||
In Grafana, each label within a notification policy must have a unique key. If you attempt to add the same label key (e.g., region) with different values (us-west and us-east), only the last entry is saved, and the previous one is discarded. This is because labels are stored as associative arrays (maps), where each key must be unique.
|
||||
@ -383,66 +387,125 @@ Grafana includes a [test data source](https://grafana.com/docs/grafana/latest/da
|
||||
|
||||
Every alert rule is assigned to an evaluation group. You can assign the alert rule to an existing evaluation group or create a new one.
|
||||
|
||||
1. In **Folder**, click **+ New folder** and enter a name. For example: `Multi-region CPU alerts`. This folder contains our alert rules.
|
||||
1. In the **Evaluation group**, repeat the above step to create a new evaluation group. Name it `Multi-region CPU group`.
|
||||
1. In **Folder**, click **+ New folder** and enter a name. For example: `Multi-region alerts`. This folder contains our alert rules.
|
||||
1. In the **Evaluation group**, repeat the above step to create a new evaluation group. Name it `Multi-region group`.
|
||||
1. Choose an **Evaluation interval** (how often the alert are evaluated). Choose `1m`.
|
||||
|
||||
The evaluation interval of 1 minute allows Grafana to detect changes quickly, while the longer **Group wait** (from our notification policy) and **Group interval** (inherited from the Default notification policy) allow for efficient grouping of alerts and minimize unnecessary notifications.
|
||||
|
||||
1. Set the pending period to `0s` (zero seconds), so the alert rule fires the moment the condition is met (this minimizes the waiting time for the demonstration).
|
||||
|
||||
### Configure labels and notifications
|
||||
|
||||
Choose the notification policy where you want to receive your alert notifications.
|
||||
Select who should receive a notification when an alert rule fires.
|
||||
|
||||
1. Select **Use notification policy**.
|
||||
1. Click **Preview routing** to ensure correct matching.
|
||||
|
||||
{{< figure src="/media/docs/alerting/region-notification-policy-routing-preview.png" max-width="750px" alt="Preview of alert instance routing with the region label matcher" >}}
|
||||
|
||||
The preview shows that the region label from our data source is successfully matching the notification policies that we created earlier thanks to the label matcher that we configured.
|
||||
The preview should show that the region label from our data source is successfully matching the notification policies that we created earlier thanks to the label matcher that we configured.
|
||||
|
||||
1. Click **Save rule and exit**.
|
||||
|
||||
### Create a second alert rule
|
||||
|
||||
Repeat the steps above to create a second alert rule that alerts on high memory usage.
|
||||
|
||||
1. Duplicate the alert rule by clicking on **More > Duplicate**.
|
||||
1. Name it `High Memory usage - Multi-region`.
|
||||
1. Use the below CSV data to simulate a data source returning memory usage.
|
||||
|
||||
```
|
||||
region,memory-usage,service,instance
|
||||
us-west,42,cache-server-1,server-09
|
||||
us-west,88,cache-server-1,server-10
|
||||
us-east,74,api-server-1,server-11
|
||||
us-east,90,api-server-1,server-12
|
||||
us-west,53,analytics-server-1,server-13
|
||||
us-east,81,analytics-server-2,server-14
|
||||
us-west,77,analytics-server-1,server-15
|
||||
us-east,94,analytics-server-2,server-16
|
||||
```
|
||||
|
||||
1. Click Save rule and exit.
|
||||
|
||||
<!-- INTERACTIVE page step5.md END -->
|
||||
<!-- INTERACTIVE page step6.md START -->
|
||||
|
||||
## Receiving grouped alert notifications
|
||||
|
||||
Now that the alert rule has been configured, you should receive alert notifications in the contact point whenever alerts trigger.
|
||||
Now that the alert rules have been configured, you should receive alert notifications in the contact point(s) whenever alerts trigger.
|
||||
|
||||
When the configured alert rule detects CPU usage higher than 75% across multiple regions, it will evaluate the metric every minute. If the condition persists, notifications will be grouped together, with a **Group wait** of 30 seconds before the first alert is sent. Follow-up notifications are sent every 2 minutes for quick updates in this demonstration, but for reducing alert frequency, consider using the default or increasing the interval. If the condition continues for an extended period, a **Repeat interval** of 4 hours ensures that the alert is only resent if the issue persists
|
||||
When the configured alert rule detects CPU or memory usage higher than 75% across multiple regions, it will evaluate the metric every minute. If the condition persists, notifications will be grouped together, with a Group wait of 30 seconds before the first alert is sent. Follow-up notifications for the same alert group will be sent at intervals of 2 minutes (US-west alert instances only), increasing the frequency of the grouped alert notifications. US-east instances follow-up notifications should be sent at the default interval of 5 minutes. If the condition continues for an extended period, a Repeat interval of 4 hours ensures that the alert is only resent if the issue persists.
|
||||
|
||||
As a result, our notification policy will route two notifications: one notification grouping the three alert instances from the `us-east` region and another grouping the two alert instances from the `us-west` region
|
||||
As a result, our notification policies should route three notifications: one grouped notification grouping both CPU and memory alert instances from the us-west region and two separate notifications with alert instances from the us-east region.
|
||||
|
||||
Grouped notifications example:
|
||||
|
||||
Webhook - US East
|
||||
```json
|
||||
{
|
||||
"receiver": "US-West-Alerts",
|
||||
"status": "firing",
|
||||
"alerts": [
|
||||
{
|
||||
"status": "firing",
|
||||
"labels": {
|
||||
"alertname": "High CPU usage - Multi-region",
|
||||
"grafana_folder": "Multi-region alerts",
|
||||
"instance": "server-05",
|
||||
...
|
||||
{
|
||||
"status": "firing",
|
||||
"labels": {
|
||||
"alertname": "High Memory usage - Multi-region",
|
||||
"grafana_folder": "Multi-region alerts",
|
||||
"instance": "server-10",
|
||||
},
|
||||
|
||||
...}
|
||||
```
|
||||
|
||||
_Detail of CPU and memory alert instances grouped into a single notification for us-west contact point._
|
||||
|
||||
```json
|
||||
{
|
||||
"receiver": "webhook-us-east",
|
||||
"receiver": "US-East-Alerts",
|
||||
"status": "firing",
|
||||
"alerts": [{ "instance": "server-03" }, { "instance": "server-06" }, { "instance": "server-08" }]
|
||||
}
|
||||
"alerts": [
|
||||
{
|
||||
"status": "firing",
|
||||
"labels": {
|
||||
"alertname": "High CPU usage - Multi-region",
|
||||
"grafana_folder": "Multi-region alerts",
|
||||
"instance": "server-03",
|
||||
"region": "us-east",
|
||||
"service": "web-server-2"
|
||||
...}}}
|
||||
```
|
||||
|
||||
Webhook - US West
|
||||
_Detail of CPU alert instances grouped into a separate notification for us-east contact point._
|
||||
|
||||
```json
|
||||
{
|
||||
"receiver": "webhook-us-west",
|
||||
"receiver": "US-East-Alerts",
|
||||
"status": "firing",
|
||||
"alerts": [{ "instance": "server-02" }, { "instance": "server-07" }]
|
||||
}
|
||||
"alerts": [
|
||||
{
|
||||
"status": "firing",
|
||||
"labels": {
|
||||
"alertname": "High memory usage - Multi-region",
|
||||
"grafana_folder": "Multi-region memory alerts",
|
||||
"instance": "server-12",
|
||||
"region": "us-east"
|
||||
...}}}
|
||||
```
|
||||
|
||||
_Detail of memory alert instances grouped into a separate notification for us-east contact point._
|
||||
|
||||
<!-- INTERACTIVE page step6.md END -->
|
||||
|
||||
<!-- INTERACTIVE page finish.md START -->
|
||||
|
||||
## Conclusion
|
||||
|
||||
Alert rule grouping simplifies incident management by consolidating related alerts. By configuring **notification policies** and using **labels** (such as _region_), you can group alerts based on specific criteria and route them to the appropriate teams. Fine-tuning **timing options**—including group wait, group interval, and repeat interval—further reduces noise and ensures notifications remain actionable without overwhelming on-call engineers.
|
||||
By configuring **notification policies** and using **labels** (such as _region_), you can group alert notifications based on specific criteria and route them to the appropriate teams. Fine-tuning **timing options**—including group wait, group interval, and repeat interval—further can reduce noise and ensures notifications remain actionable without overwhelming on-call engineers.
|
||||
|
||||
<!-- INTERACTIVE page finish.md END -->
|
||||
|
Loading…
Reference in New Issue
Block a user