Alerting: Document Grafana 8 High Availability setup (#41503)

* Alerting: Document Grafana 8 High Availability setup * Apply suggestions from code review Co-authored-by: Yuriy Tseretyan <tceretian@gmail.com> * Wordsmithing * Apply suggestions from code review Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> * Address review feedback * Apply suggestions from code review Co-authored-by: George Robinson <george.robinson@grafana.com> * Update docs/sources/alerting/unified-alerting/high-availability.md Co-authored-by: George Robinson <george.robinson@grafana.com> * address review feedback * Prettier Co-authored-by: Yuriy Tseretyan <tceretian@gmail.com> Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com> Co-authored-by: George Robinson <george.robinson@grafana.com>
2025-02-25 18:55:37 -06:00 · 2021-11-12 18:47:39 +00:00 · 2021-11-12 18:47:39 +00:00 · dfa14e9500
commit dfa14e9500
parent bcc46990aa
4 changed files with 74 additions and 5 deletions
--- a/docs/sources/administration/set-up-for-high-availability.md
+++ b/docs/sources/administration/set-up-for-high-availability.md
@ -22,7 +22,15 @@ Grafana will now persist all long term data in the database. How to configure th

 ## Alerting

-Currently alerting supports a limited form of high availability. [Alert notifications]({{< relref "../alerting/old-alerting/notifications.md" >}}) are deduplicated when running multiple servers. This means all alerts are executed on every server but alert notifications are only sent once per alert. Grafana does not support load distribution between servers.
+**Grafana 8 alerts**
+
+Grafana 8 Alerts provides a new highly-available model under the hood. It preserves the previous semantics by executing all alerts on every server and notifications are sent only once per alert. There is no support for load distribution between servers at this time.
+
+For configuration, [follow the guide]({{ relref "../alerting/unified-alerting/high-availability.md" >}}).
+
+**Legacy dashboard alerts**
+
+Legacy Grafana alerting supports a limited form of high availability. [Alert notifications]({{< relref "../alerting/old-alerting/notifications.md" >}}) are deduplicated when running multiple servers. This means all alerts are executed on every server but alert notifications are only sent once per alert. Grafana does not support load distribution between servers.

 ## Grafana Live

--- a/docs/sources/alerting/_index.md
+++ b/docs/sources/alerting/_index.md
@ -7,11 +7,11 @@ weight = 110

 Alerts allow you to learn about problems in your systems moments after they occur. Robust and actionable alerts help you identify and resolve issues quickly, minimizing disruption to your services.

-Grafana 8.0 has new and improved alerting that centralizes alerting information in a single, searchable view. It allows you to to:
+Grafana 8.0 has new and improved alerting that centralizes alerting information in a single, searchable view. It allows you to:

- Create and manage Grafana managed alerts
+- Create and manage Grafana alerts
 - Create and manage Cortex and Loki managed alerts
- View alerting information from Prometheus compatible data sources
+- View alerting information from Prometheus and Alertmanager compatible data sources

 Grafana 8 alerting has four key components:

--- a/docs/sources/alerting/unified-alerting/_index.md
+++ b/docs/sources/alerting/unified-alerting/_index.md
@ -22,5 +22,4 @@ Before you begin using Grafana 8 alerting, we recommend that you familiarize you

 ## Limitations

- Grafana 8 alerting doesn’t support high availability. Alert notifications are not de-duplicated and load balancing is not supported between instances. For example, silences from one instance will not appear in another.
 - The Grafana 8 alerting system can retrieve rules from all available Prometheus, Loki, and Alertmanager data sources. It might not be able to fetch rules from other supported data sources.
--- a/docs/sources/alerting/unified-alerting/high-availability.md
+++ b/docs/sources/alerting/unified-alerting/high-availability.md
@ -0,0 +1,62 @@
+++
+title = " Configure high availability"
+description = "High Availability"
+keywords = ["grafana", "alerting", "tutorials", "ha", "high availability"]
+weight = 450
+++
+
+# High availability
+
+The Grafana alerting system has two main components: a `Scheduler` and an internal `Alertmanager`. The `Scheduler` is responsible for the evaluation of your [alert rules]({{< relref "./fundamentals/evaluate-grafana-alerts.md" >}}) while the internal Alertmanager takes care of the **routing** and **grouping**.
+
+When it comes to running Grafana alerting in high availability the operational mode of the scheduler is unaffected such that all alerts continue be evaluated in each Grafana instance. Rather the operational change happens in the Alertmanager which \*deduplicates\*\* alert notifications across Grafana instances.
+
+```
+  .─────.
+ ╱       ╲                                                                      ┌────────────────┐
+(  User   )──────┐                        ┌──────────────────────────────────┐  │                │
+ `.     ,'       │                        │┌─────────┐      ┌──────────────┐ │  │                ▼
+   `───'         │                        ││Scheduler│──────▶Alertmananager│─┼──┘    ┌──────────────────────┐
+                 │      ┌───────────┐  ┌─▶│└─────────┘      ▲──────────────┤ │       │                      │
+  .─────.        │      │   Load    │  │  │Grafana          │              │ │       │                      │
+ ╱       ╲       │      │ Balancing │  │  └─────────────────┼──────────────┼─┘       │     Integrations     │
+(  User   )──────┼─────▶│  Reverse  │──┤  ┌─────────────────┼──────────────┼─┐       │                      │
+ `.     ,'       │      │   Proxy   │  │  │┌─────────┐      ├──────────────▼ │       │                      │
+   `───'         │      └───────────┘  │  ││Scheduler│──────▶Alertmananager│─┼──┐    └──────────────────────┘
+                 │                     └─▶│└─────────┘      └──────────────┘ │  │                ▲
+  .─────.        │                        │Grafana                           │  │                │
+ ╱       ╲       │                        └──────────────────────────────────┘  └────────────────┘
+(  User   )──────┘
+ `.     ,'
+   `───'
+```
+
+The coordination between Grafana instances happens via [a Gossip protocol](https://en.wikipedia.org/wiki/Gossip_protocol). Alerts are not gossiped between instances. It is expected that each scheduler delivers the same alerts to each Alertmanager.
+
+The two types of messages that are gossiped between instances are:
+
+- Notification logs: Who (which instance) notified what (which alert)
+- Silences: If an alert should fire or not
+
+These two states are persisted in the database periodically and when Grafana is gracefully shutdown.
+
+## Enable high availability
+
+To enable high availability support you need to add at least 1 Grafana instance to the [`[ha_peer]` configuration option]({{<relref"../../administration/configuration.md#unified_alerting">}}) within the `[unified_alerting]` section:
+
+1. In your custom configuration file ($WORKING_DIR/conf/custom.ini), go to the `[unified_alerting]` section.
+2. Set `[ha_peers]` to the set of hosts for each grafana instance in the cluster (using a format of host:port) e.g. `ha_peers=10.0.0.5:9094,10.0.0.6:9094,10.0.0.7:9094`
+3. Gossiping of notifications and silences uses both TCP and UDP port 9094. Each Grafana instance will need to be able to accept incoming connections on these ports.
+4. Set `[ha_listen_address]` to the instance IP address using a format of host:port (or the [Pod's](https://kubernetes.io/docs/concepts/workloads/pods/) IP in the case of using Kubernetes) by default it is set to listen to all interfaces (`0.0.0.0`).
+
+## Kubernetes
+
+If you are using Kubernetes, you can expose the pod IP [through an environment variable](https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/) via the container definition such as:
+
+```bash
+env:
+- name: POD_IP
+  valueFrom:
+    fieldRef:
+      fieldPath: status.podIP
+```