docs(alerting): add section about running redis for HA (#73153)

2025-02-25 18:55:37 -06:00 · 2023-08-11 13:57:00 +02:00
parent a70d2d39f6
commit 44b94eca19
1 changed files with 27 additions and 2 deletions
--- a/docs/sources/alerting/set-up/configure-high-availability/_index.md
+++ b/docs/sources/alerting/set-up/configure-high-availability/_index.md
@@ -23,7 +23,7 @@ weight: 400
 You can enable alerting high availability support by updating the Grafana configuration file. If you run Grafana in a Kubernetes cluster, additional steps are required. Both options are described below.
 Please note that the deduplication is done for the notification, but the alert will still be evaluated on every Grafana instance. This means that events in alerting state history will be duplicated by the number of Grafana instances running.

-## Enable alerting high availability in Grafana
+## Enable alerting high availability in Grafana using Memberlist

 ### Before you begin

@@ -36,7 +36,32 @@ Since gossiping of notifications and silences uses both TCP and UDP port `9094`,
   You must have at least one (1) Grafana instance added to the [`[ha_peer]` section.
 3. Set `[ha_listen_address]` to the instance IP address using a format of `host:port` (or the [Pod's](https://kubernetes.io/docs/concepts/workloads/pods/) IP in the case of using Kubernetes).
   By default, it is set to listen to all interfaces (`0.0.0.0`).
-4. Set `[ha_peer_timeout]` in the `[unified_alerting]` section of the custom.ini to specify the time to wait for an instance to send a notification via the Alertmanager. The default value is 15s, but it may increase if Grafana servers are located in different geographic regions or if the network latency between them is high
+4. Set `[ha_peer_timeout]` in the `[unified_alerting]` section of the custom.ini to specify the time to wait for an instance to send a notification via the Alertmanager. The default value is 15s, but it may increase if Grafana servers are located in different geographic regions or if the network latency between them is high.
+
+## Enable alerting high availability in Grafana using Redis
+
+As an alternative to Memberlist, you can use Redis for high availability. This is useful if you want to have a central
+database for HA and cannot support the meshing of all Grafana servers.
+
+1. Make sure you have a redis server that supports pub/sub. If you use a proxy in front of your redis cluster, make sure the proxy supports pub/sub.
+2. In your custom configuration file ($WORKING_DIR/conf/custom.ini), go to the [unified_alerting] section.
+3. Set `ha_redis_address` to the redis server address Grafana should connect to.
+4. [Optional] Set the username and password if authentication is enabled on the redis server using `ha_redis_username` and `ha_redis_password`.
+5. [Optional] Set `ha_redis_prefix` to something unique if you plan to share the redis server with multiple Grafana instances.
+
+The following metrics can be used for meta monitoring, exposed by Grafana's `/metrics` endpoint:
+
+| Metric                                               | Description                                                                                                    |
+| ---------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
+| alertmanager_cluster_messages_received_total         | Total number of cluster messages received.                                                                     |
+| alertmanager_cluster_messages_received_size_total    | Total size of cluster messages received.                                                                       |
+| alertmanager_cluster_messages_sent_total             | Total number of cluster messages sent.                                                                         |
+| alertmanager_cluster_messages_sent_size_total        | Total number of cluster messages received.                                                                     |
+| alertmanager_cluster_messages_publish_failures_total | Total number of messages that failed to be published.                                                          |
+| alertmanager_cluster_members                         | Number indicating current number of members in cluster.                                                        |
+| alertmanager_peer_position                           | Position the Alertmanager instance believes it's in. The position determines a peer's behavior in the cluster. |
+| alertmanager_cluster_pings_seconds                   | Histogram of latencies for ping messages.                                                                      |
+| alertmanager_cluster_pings_failures_total            | Total number of failed pings.                                                                                  |

 ## Enable alerting high availability using Kubernetes