Docs: Updates intro alerting topics (#66958)

* Updates intro alerting topics

* taking out cortex

* Adds choice table for alert rules type

* updates with Vikas feedback
This commit is contained in:
brendamuir 2023-04-21 13:24:43 +02:00 committed by GitHub
parent 1d0387dcc2
commit c742503d2c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 133 additions and 79 deletions

View File

@ -13,7 +13,13 @@ weight: 114
# Alerting
Grafana Alerting allows you to learn about problems in your systems moments after they occur. Create, manage, and take action on your alerts in a single, consolidated view, and improve your teams ability to identify and resolve issues quickly.
Grafana Alerting allows you to learn about problems in your systems moments after they occur.
Monitor your incoming metrics data or log entries and set up your Alerting system to watch for specific events or circumstances and then send notifications when those things are found.
In this way, you eliminate the need for manual monitoring and provide a first line of defense against system outages or changes that could turn into major incidents.
Using Grafana Alerting, you create queries and expressions from multiple data sources — no matter where your data is stored — giving you the flexibility to combine your data and alert on your metrics and logs in new and unique ways. You can then create, manage, and take action on your alerts from a single, consolidated view, and improve your teams ability to identify and resolve issues quickly.
Grafana Alerting is available for Grafana OSS, Grafana Enterprise, or Grafana Cloud. With Mimir and Loki alert rules you can run alert expressions closer to your data and at massive scale, all managed by the Grafana UI you are already familiar with.
@ -21,33 +27,7 @@ Watch this video to learn more about Grafana Alerting: {{< vimeo 720001629 >}}
_Refer to [Manage your alert rules]({{< relref "../alerting/alerting-rules/" >}}) for current instructions._
## Overview
The following diagram gives you an overview of how Grafana Alerting works and introduces you to some of the key concepts that work together and form the core of our flexible and powerful alerting engine.
{{< figure src="/static/img/docs/alerting/unified/about-alerting-flow-diagram-latest.png" caption="Grafana Alerting overview" >}}
1. Alert rules
Set evaluation criteria that determines whether an alert instance will fire. An alert rule consists of one or more queries and expressions, a condition, the frequency of evaluation, and optionally, the duration over which the condition is met.
Grafana managed alerts support multi-dimensional alerting, which means that each alert rule can create multiple alert instances. This is exceptionally powerful if you are observing multiple series in a single expression.
Once an alert rule has been created, they go through various states and transitions. The state and health of alert rules help you understand several key status indicators about your alerts.
1. Labels
Match an alert rule and its instances to notification policies and silences. They can also be used to group your alerts by severity.
1. Notification policies
Set where, when, and how the alerts get routed. Each notification policy specifies a set of label matchers to indicate which alerts they are responsible for. A notification policy has a contact point assigned to it that consists of one or more notifiers.
1. Contact points
Define how your contacts are notified when an alert fires. We support a multitude of ChatOps tools to ensure the alerts come to your team.
## Features
## Key features and benefits
**One page for all alerts**
@ -55,24 +35,55 @@ A single Grafana Alerting page consolidates both Grafana-managed alerts and aler
**Multi-dimensional alerts**
Alert rules can create multiple individual alert instances per alert rule, known as multi-dimensional alerts, giving you the power and flexibility to gain visibility into your entire system with just a single alert.
Alert rules can create multiple individual alert instances per alert rule, known as multi-dimensional alerts, giving you the power and flexibility to gain visibility into your entire system with just a single alert rule. You do this by adding labels to your query to specify which component is being monitored and generate multiple alert instances for a single alert rule. For example, if you want to monitor each server in a cluster, a multi-dimensional alert will alert on each CPU, whereas a standard alert will alert on the overall server.
**Routing alerts**
**Route alerts**
Route each alert instance to a specific contact point based on labels you define. Notification policies are the set of rules for where, when, and how the alerts are routed to contact points.
**Silencing alerts**
**Silence alerts**
Silences allow you to stop receiving persistent notifications from one or more alerting rules. You can also partially pause an alert based on certain criteria. Silences have their own dedicated section for better organization and visibility, so that you can scan your paused alert rules without cluttering the main alerting view.
Silences stop notifications from getting created and last for only a specified window of time.
Silences allow you to stop receiving persistent notifications from one or more alert rules. You can also partially pause an alert based on certain criteria. Silences have their own dedicated section for better organization and visibility, so that you can scan your paused alert rules without cluttering the main alerting view.
**Mute timings**
With mute timings, you can specify a time interval when you dont want new notifications to be generated or sent. You can also freeze alert notifications for recurring periods of time, such as during a maintenance period.
A mute timing is a recurring interval of time when no new notifications for a policy are generated or sent. Use them to prevent alerts from firing a specific and reoccurring period, for example, a regular maintenance period.
Similar to silences, mute timings do not prevent alert rules from being evaluated, nor do they stop alert instances from being shown in the user interface. They only prevent notifications from being created.
## Design your Alerting system
Monitoring complex IT systems and understanding whether everything is up and running correctly is a difficult task. Setting up an effective alert management system is therefore essential to inform you when things are going wrong before they start to impact your business outcomes.
Designing and configuring an alert management set up that works takes time.
Here are some tips on how to create an effective alert management set up for your business:
**Which are the key metrics for your business that you want to monitor and alert on?**
- Find events that are important to know about and not so trivial or frequent that recipients ignore them.
- Alerts should only be created for big events that require immediate attention or intervention.
- Consider quality over quantity.
**Which type of Alerting do you want to use?**
- Choose between Grafana-managed Alerting or Grafana Mimir or Loki-managed Alerting; or both.
**How do you want to organize your alerts and notifications?**
- Be selective about who you set to receive alerts. Consider sending them to whoever is on call or a specific Slack channel.
- Automate as far as possible using the Alerting API or alerts as code (Terraform).
**How can you reduce alert fatigue?**
- Avoid noisy, unnecessary alerts by using silences, mute timings, or pausing alert rule evaluation.
- Continually tune your alert rules to review effectiveness. Remove alert rules to avoid duplication or ineffective alerts.
- Think carefully about priority and severity levels.
- Continually review your thresholds and evaluation rules.
## Useful links
- [Fundamental concepts]({{< relref "/docs/grafana/latest/alerting/fundamentals" >}}) of Grafana Alerting.
- [Role-based access control]({{< relref "/docs/grafana/latest/administration/roles-and-permissions/access-control" >}}) in Grafana Enterprise.
- [High availability]({{< relref "/docs/grafana/latest/alerting/fundamentals/high-availability" >}})
- [Introduction to Alerting]({{< relref "/docs/grafana/latest/alerting/fundamentals" >}})

View File

@ -10,42 +10,40 @@ weight: 105
Whether youre starting or expanding your implementation of Grafana Alerting, learn more about the key concepts and available features that help you create, manage, and take action on your alerts and improve your teams ability to resolve issues quickly.
First of all, lets look at the different alert rule types that Grafana Alerting offers.
The following diagram gives you an overview of how Grafana Alerting works and introduces you to some of the key concepts that work together and form the core of our flexible and powerful alerting engine.
## Alert rule types
{{< figure src="/media/docs/alerting/how-alerting-works.png" max-width="750px" caption="How Alerting works" >}}
### Grafana-managed rules
You can either create your alerting resources (alert rules, notification policies, and so on) directly in the Grafana UI, using provisioning, or in your Grafana Mimir or Loki instances.
Grafana-managed rules are the most flexible alert rule type. They allow you to create alerts that can act on data from any of our supported data sources.
In addition to supporting multiple data sources, you can also add expressions to transform your data and set alert conditions.
This is the only type of rule that allows alerting from multiple data sources in a single rule definition.
**Alert rules**
### Mimir and Loki rules
An alert rule is a set of evaluation criteria for when an alert rule should fire. An alert rule consists of one or more queries and expressions, a condition, and the duration over which the condition needs to be met to start firing.
To create Mimir or Loki alerts you must have a compatible Prometheus or Loki data source. You can check if your data source supports rule creation via Grafana by testing the data source and observing if the ruler API is supported.
Add annotations to your alert rule to provide additional information about the alert rule and add labels to uniquely identify your alert rule and configure alert routing. Labels link alert rules to notification policies, so you can easily manage which policy should handle which alerts and who gets notified.
### Recording rules
Once alert rules are created, they go through various states and transitions. An alert rule can produce multiple alert instances - one alert instance for each time series.
Recording rules are only available for compatible Prometheus or Loki data sources.
A recording rule allows you to pre-compute frequently needed or computationally expensive expressions and save their result as a new set of time series. This is useful if you want to run alerts on aggregated data or if you have dashboards that query computationally expensive expressions repeatedly.
Grafana Enterprise offers an alternative to recorded rules in the form of [recorded queries](https://grafana.com/docs/grafana/v9.0/enterprise/recorded-queries/) that can be executed against any data source.
The alert rule state is determined by the “worst case” state of the alert instances produced and the states can be Normal, Pending, or Firing. For example, if one alert instance is firing, the alert rule state will also be firing.
## Key concepts and features
The alert rule health is determined by the status of the evaluation of the alert rule, which can be Ok, Error, and NoData.
The following table includes a list of key concepts, features and their definitions, designed to help you make the most of Grafana Alerting.
**Alert instances**
| Key concept or feature | Definition |
| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Data sources for Alerting | Select data sources you want to query and visualize metrics, logs and traces from. |
| Provisioning for Alerting | Manage your alerting resources and provision them into your Grafana system using file provisioning or Terraform. |
| Scheduler | Evaluates your alert rules; think of it as the component that periodically runs your query against data sources. It is only applicable to Grafana-managed rules. |
| Alertmanager | Manages the routing and grouping of alert instances. |
| Alert rule | A set of evaluation criteria for when an alert rule should fire. An alert rule consists of one or more queries and expressions, a condition, the frequency of evaluation, and the duration over which the condition is met. An alert rule can produce multiple alert instances. |
| Alert instance | An alert instance is created when an alert rule fires. An alert rule can create one or more alert instances. When multiple instances are created as a result of one alert rule, this is referred to as a multi-dimensional alert. |
| Alert group | The Alertmanager groups alert instances by default using the labels for the root notification policy. This controls de-duplication and groups of alert instances which are sent to contact points. |
| Contact point | Define how your contacts are notified when an alert rule fires. |
| Message templating | Create reusable custom templates and use them in contact points. |
| Notification policy | Set of rules for where, when, and how the alerts are grouped and routed to contact points. |
| Labels and label matchers | Labels uniquely identify alert rules. They link alert rules to notification policies and silences, determining which policy should handle them and which alert rules should be silenced. |
| Silences | Stop notifications from one or more alert instances. The difference between a silence and a mute timing is that a silence only lasts for only a specified window of time whereas a mute timing is meant to be recurring on a schedule. Uses label matchers to silence alert instances. |
| Mute timings | Specify a time interval when you dont want new notifications to be generated or sent. You can also freeze alert notifications for recurring periods of time, such as during a maintenance period. Must be linked to an existing notification policy. |
For Grafana-managed alert rules, multiple alert instances can be created as a result of one alert rule (also known as a multi-dimensional alerting).
Both Grafana-managed alert and Mimir or Loki-managed alert instances can be in Normal, Pending, Alerting, No Data, Error states.
**Note:** For Mimir or Loki-managed alert rules, alert instances are only created when the threshold condition defined in an alert rule is breached.
Alerting alert instances are grouped by labels according to the notification policy. This controls de-duplication and groups alert instances to send to your contact points.
**Notification policy**
Set where, when, and how firing alert instances get routed.
Each notification policy contains a set of label matchers to indicate which alerts rules or instances it is responsible for. It also has a contact point assigned to it that consists of one or more contact point types, such as Slack or email. Contact points define how your contacts are notified when an alert instance fires.
Use message templates for your notifications to create reusable custom templates and use them in contact points.
Add silences to stop notifications from one or more alert instances or use mute timings to specify time intervals when you dont want new notifications to be generated or sent out. The difference between the two being that a silence only lasts for only a specified window of time whereas a mute timing recurs on a schedule, for example, during a maintenance period.

View File

@ -8,9 +8,9 @@ title: Alert rules
weight: 101
---
# About alert rules
# Alert rules
An alerting rule is a set of evaluation criteria that determines whether an alert instance will fire. The rule consists of one or more queries and expressions, a condition, the frequency of evaluation, and optionally, the duration over which the condition is met.
An alert rule is a set of evaluation criteria for when an alert rule should fire. An alert rule consists of one or more queries and expressions, a condition, and the duration over which the condition needs to be met to start firing.
While queries and expressions select the data set to evaluate, a condition sets the threshold that an alert must meet or exceed to create an alert.

View File

@ -10,24 +10,69 @@ weight: 102
# Alert rule types
Grafana supports several alert rule types, the following sections will explain their merits and demerits and help you choose the right alert type for your use case.
Grafana supports several different alert rule types. Learn more about each of the alert rule types, how they work, and decide which one is best for your use case.
## Grafana managed rules
## Grafana-managed alert rules
Grafana-managed rules are the most flexible alert rule type. They allow you to create alerts that can act on data from any of your existing data sources.
Grafana-managed alert rules are the most flexible alert rule type. They allow you to create alerts that can act on data from any of our supported data sources.
In addition to supporting any data source, you can add [expressions]({{< relref "/docs/grafana/latest/panels-visualizations/query-transform-data/expression-queries" >}}) to transform your data and express alert conditions.
In addition to supporting multiple data sources, you can also add expressions to transform your data and set alert conditions. Using images in alert notifications is also supported. This is the only type of rule that allows alerting from multiple data sources in a single rule definition.
## Mimir, Loki and Cortex rules
The following diagram shows how Grafana-managed alerting works.
To create Mimir, Loki or Cortex alerts you must have a compatible Prometheus data source. You can check if your data source is compatible by testing the data source and checking the details if the ruler API is supported.
{{< figure src="/media/docs/alerting/grafana-managed-rule.png" max-width="750px" caption="How Alerting works" >}}
{{< figure src="/static/img/docs/alerting/unified/mimir-datasource-check.png" caption="Successfully connected to a Mimir Prometheus datasource" max-width="40%" >}}
1. Alert rules are created within Grafana based on one or more data sources.
1. Alert rules are evaluated by the Alert Rule Evaluation Engine from within Grafana.
1. Alerts are delivered using the internal Grafana Alertmanager.
**Note:**
You can also configure alerts to be delivered using an external Alertmanager; or use both internal and external alertmanagers.
For more information, see Add an external Alertmanager.
## Grafana Mimir or Loki-managed alert rules
To create Grafana Mimir or Grafana Loki-managed alert rules, you must have a compatible Prometheus or Loki data source.
You can check if your data source supports rule creation via Grafana by testing the data source and observing if the Ruler API is supported.
For more information on the Ruler API, refer to [Ruler API](docs/loki/latest/api/#ruler).
The following diagram shows how Grafana Mimir or Grafana Loki-managed alerting works.
{{< figure src="/media/docs/alerting/loki-mimir-rule.png" max-width="750px" caption="How Alerting works" >}}
1. Alert rules are created and stored within the data source itself.
1. Alert rules can only be created based on Prometheus data.
1. Alert rule evaluation and delivery is distributed across multiple nodes for high availability and fault tolerance.
## Recording rules
Recording rules are only available for compatible Prometheus data sources like Mimir, Loki and Cortex.
Recording rules are only available for compatible Prometheus or Loki data sources.
A recording rule allows you to save an expression's result to a new set of time series. This is useful if you want to run alerts on aggregated data or if you have dashboards that query the same expression repeatedly.
A recording rule allows you to pre-compute frequently needed or computationally expensive expressions and save their result as a new set of time series. This is useful if you want to run alerts on aggregated data or if you have dashboards that query computationally expensive expressions repeatedly.
Read more about [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) in Prometheus.
Grafana Enterprise offers an alternative to recorded rules in the form of recorded queries that can be executed against any data source.
For more information on recording rules in Prometheus, refer to [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/).
## Choose an alert rule type
When choosing which alert rule type to use, consider the following comparison between Grafana-managed alert rules and Grafana Mimir or Loki alert rules.
| Feature | Grafana-managed alert rule | Loki/Mimir-managed alert rule |
| ----------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Create alert rules based on data from any of our supported data sources | Yes | No: You can only create alert rules that are based on Prometheus data. The data source must have the Ruler API enabled. |
| Mix and match data sources | Yes | No |
| Includes support for recording rules | No | Yes |
| Add expressions to transform your data and set alert conditions | Yes | No |
| Use images in alert notifications | Yes | No |
| Scaling | More resource intensive, depend on the database, and are likely to suffer from transient errors. They only scale vertically. | Store alert rules within the data source itself and allow for “infinite” scaling. Generate and send alert notifications from the location of your data. |
| Alert rule evaluation and delivery | Alert rule evaluation and delivery is done from within Grafana, using an external Alertmanager; or both. | Alert rule evaluation and alert delivery is distributed, meaning there is no single point of failure. |
**Note:**
If you are using non-Prometheus data, we recommend choosing Grafana-managed alert rules. Otherwise, choose Grafana Mimir or Grafana Loki alert rules where possible.