Docs alerting: meta monitoring topics (#74440)

* Docs alerting: adds insights logs and updates metamonitoring topic * updates meta monitoring * Update docs/sources/alerting/monitor/_index.md Co-authored-by: Jack Baldry <jack.baldry@grafana.com> * Update docs/sources/alerting/monitor/meta-monitoring/_index.md Co-authored-by: Jack Baldry <jack.baldry@grafana.com> * removes alias * Update docs/sources/alerting/monitor/meta-monitoring/_index.md Co-authored-by: Jennifer Villa <jvilla2013@gmail.com> * Update docs/sources/alerting/monitor/meta-monitoring/_index.md Co-authored-by: Jennifer Villa <jvilla2013@gmail.com> * Update docs/sources/alerting/monitor/meta-monitoring/_index.md Co-authored-by: Jennifer Villa <jvilla2013@gmail.com> * Update docs/sources/alerting/monitor/meta-monitoring/_index.md Co-authored-by: Jennifer Villa <jvilla2013@gmail.com> * updates numbering * adds codeblock * updates * Update docs/sources/alerting/monitor/meta-monitoring/_index.md Co-authored-by: George Robinson <george.robinson@grafana.com> * Update docs/sources/alerting/monitor/meta-monitoring/_index.md Co-authored-by: George Robinson <george.robinson@grafana.com> --------- Co-authored-by: Jack Baldry <jack.baldry@grafana.com> Co-authored-by: Jennifer Villa <jvilla2013@gmail.com> Co-authored-by: George Robinson <george.robinson@grafana.com>
2025-02-25 18:55:37 -06:00 · 2023-09-07 15:08:11 +02:00 · 2023-09-07 15:08:11 +02:00 · b22cfcd336
commit b22cfcd336
parent 0d1845f857
2 changed files with 118 additions and 23 deletions
--- a/docs/sources/alerting/monitor/_index.md
+++ b/docs/sources/alerting/monitor/_index.md
@ -0,0 +1,27 @@
+---
+canonical: https://grafana.com/docs/grafana/latest/alerting/monitor/
+description: Monitor alerting metrics and data
+keywords:
+  - grafana
+  - alert
+  - monitoring
+labels:
+  products:
+    - cloud
+    - enterprise
+    - oss
+menuTitle: Monitor
+title: Monitor alerting metrics
+weight: 180
+---
+
+## Monitor alerting metrics
+
+Monitor your alerting metrics to ensure you identify potential issues before they become critical.
+
+[Meta monitoring][meta-monitoring]
+
+{{% docs/reference %}}
+[meta-monitoring]: "/docs/grafana/ -> /docs/grafana/<GRAFANA VERSION>/alerting/monitor/meta-monitoring"
+[meta-monitoring]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/meta-monitoring"
+{{% /docs/reference %}}
--- a/docs/sources/alerting/monitor/meta-monitoring/_index.md
+++ b/docs/sources/alerting/monitor/meta-monitoring/_index.md
@ -1,8 +1,8 @@
 ---
 aliases:
-  - meta-monitoring/
  - alerting/meta-monitoring/
-canonical: https://grafana.com/docs/grafana/latest/alerting/set-up/meta-monitoring/
+  - ../set-up/meta-monitoring/
+canonical: https://grafana.com/docs/grafana/latest/alerting/monitor/meta-monitoring/
 description: Meta monitoring
 keywords:
  - grafana
@ -12,19 +12,33 @@ labels:
  products:
    - enterprise
    - oss
+    - cloud
 title: Meta monitoring
-weight: 500
+weight: 100
 ---

 # Meta monitoring

-Meta monitoring is the process of monitoring your monitoring, and alerting when your monitoring is not working as it should. Whether you use Grafana Managed Alerts or Mimir, meta monitoring is possible both on-premise and in Grafana Cloud.
+Meta monitoring is the process of monitoring your monitoring system and alerting when your monitoring is not working as it should.

-## Grafana Managed Alerts
+Whether you use Grafana-managed alerts or Grafana Mimir-managed alerts, meta monitoring is possible both on-premise and in Grafana Cloud.

-Meta monitoring of Grafana Managed Alerts requires having a Prometheus server, or other metrics database, collecting and storing metrics exported by Grafana. For example, if using Prometheus you should add a `scrape_config` to Prometheus to scrape metrics from your Grafana server.
+In order to enable you to meta monitor, Grafana provides predefined metrics for both on-premise and Cloud.

-Here is an example of how this might look:
+Identify which metrics are critical to your monitoring system (i.e. Grafana) and then set up how you want to monitor them.
+
+## Metrics for Grafana-managed alerts
+
+{{% admonition type="note" %}}
+These metrics are not available for Grafana Cloud.
+There are no metrics in the `grafanacloud-usage` data source at this time that can be used to meta monitor Grafana-managed alerts.
+{{% /admonition %}}
+
+To meta monitor Grafana-managed alerts, you need a Prometheus server, or other metrics database to collect and store metrics exported by Grafana.
+
+For example, if you are using Prometheus, add a `scrape_config` to Prometheus to scrape metrics from Grafana, Alertmanager, or your data sources.
+
+### Example

 ```yaml
 - job_name: grafana
@ -39,6 +53,8 @@ Here is an example of how this might look:
        - grafana:3000
 ```

+### List of available metrics
+
 The Grafana ruler, which is responsible for evaluating alert rules, and the Grafana Alertmanager, which is responsible for sending notifications of firing and resolved alerts, provide a number of metrics that let you observe them.

 #### grafana_alerting_alerts
@ -67,19 +83,31 @@ This metric is a histogram that shows you the number of seconds taken to send no

 > These metrics are not available at present in Grafana Cloud.

-## Grafana Mimir
+## Metrics for Mimir-managed alerts

-Meta monitoring in Grafana Mimir requires having a Prometheus/Mimir server, or other metrics database, collecting and storing metrics exported by the Mimir ruler.
+{{% admonition type="note" %}}
+These metrics are available in OSS, on-premise, and Grafana Cloud.
+{{% /admonition %}}

-#### cortex_prometheus_rule_evaluation_failures_total
+To meta monitor Grafana Mimir-managed alerts, open source and on-premise users need a Prometheus/Mimir server, or another metrics database to collect and store metrics exported by the Mimir ruler.
+
+#### rule_evaluation_failures_total

 This metric is a counter that shows you the total number of rule evaluation failures.

-## Alertmanager
+## Metrics for Alertmanager

-Meta monitoring in Alertmanager also requires having a Prometheus/Mimir server, or other metrics database, collecting and storing metrics exported by Alertmanager. For example, if using Prometheus you should add a `scrape_config` to Prometheus to scrape metrics from your Alertmanager.
+{{% admonition type="note" %}}
+These metrics are available for OSS, on-premise, and some are available in Grafana Cloud.

-Here is an example of how this might look:
+Use the data source and Metrics browser in Grafana Cloud called `grafanacloud-usage` that is provisioned for all Grafana Cloud customers to view all available meta monitoring and usage metrics that are available in Grafana Cloud.
+{{% /admonition %}}
+
+To meta monitor the Alertmanager, you need a Prometheus/Mimir server, or another metrics database to collect and store metrics exported by Alertmanager.
+
+For example, if you are using Prometheus you should add a `scrape_config` to Prometheus to scrape metrics from your Alertmanager.
+
+### Example

 ```yaml
 - job_name: alertmanager
@ -94,9 +122,11 @@ Here is an example of how this might look:
        - alertmanager:9093
 ```

+### List of available metrics
+
 #### alertmanager_alerts

-This metric is a counter that shows you the number of active, suppressed and unprocessed alerts in Alertmanager. Suppressed alerts are silenced alerts, and unprocessed alerts are alerts that have been sent to the Alertmanager but have not been processed.
+This metric is a counter that shows you the number of active, suppressed, and unprocessed alerts in Alertmanager. Suppressed alerts are silenced alerts, and unprocessed alerts are alerts that have been sent to the Alertmanager but have not been processed.

 #### alertmanager_alerts_invalid_total

@ -116,9 +146,13 @@ This metric is a histogram that shows you the amount of time it takes Alertmanag

 > In Grafana Cloud some of these metrics are available via the Prometheus usage datasource that is provisioned for all Grafana Cloud customers.

-## Alertmanager in high availability mode
+## Metrics for Alertmanager in high availability mode

-If using Alertmanager in high availability mode there are a number of additional metrics that you might want to create alerts for.
+{{% admonition type="note" %}}
+These metrics are not available in Grafana Cloud as it uses a different high availability strategy than on-premise Alertmanagers.
+{{% /admonition %}}
+
+If you are using Alertmanager in high availability mode there are a number of additional metrics that you might want to create alerts for.

 #### alertmanager_cluster_members

@ -140,14 +174,48 @@ This metric is a gauge. It has a constant value `1`, and contains a label called

 This metric is a counter that shows you the number of failed peer connection attempts. In most cases you will want to use the `rate` function to understand how often reconnections fail as this may be indicative of an issue or instability in your network.

-> These metrics are not available in Grafana Cloud as it uses a different high availability strategy than on-premise Alertmanagers.
+## Monitor your metrics

-<!---
-#### cortex_prometheus_rule_group_last_evaluation_timestamp_seconds
+To monitor your metrics, you can:

-#### cortex_prometheus_rule_group_rules
+1. [Optional] Create a dashboard in Grafana that uses this metric in a panel (just like you would for any other kind of metric).
+1. [Optional] Create an alert rule in Grafana that checks this metric regularly (just like you would do for any other kind of alert rule).
+1. [Optional] Use the Explore module in Grafana.

-This metric is a counter that shows
+## Use insights logs

-> In Grafana Cloud these metrics are available via the Prometheus usage datasource that is provisioned for all Grafana Cloud customers.
-->
+{{% admonition type="note" %}}
+For Grafana Cloud only.
+{{% /admonition %}}
+
+Use insights logs to help you determine which alerting and recording rules are failing to evaluate and why. These logs contain helpful information on specific alert rules that are failing, provide you with the actual error message, and help you evaluate what is going wrong.
+
+### Before you begin
+
+To view your insights logs, you must have the following:
+
+- A Grafana Cloud account
+- Admin or Editor user permissions for the managed Grafana Cloud instance
+
+### Procedure
+
+To explore logs pertaining to failing alerting and recording rules, complete the following steps.
+
+1. Log on to your instance and click the **Explore** (compass) icon in the menu sidebar.
+1. Use the data sources dropdown located at the top of the page to select the data source.
+   The data source name should be similar to `grafanacloud-<yourstackname>-usage-insights`.
+1. To find the logs you want to see, use the **Label filters** and **Line contains** options in the query editor.
+
+To look at a particular stack, you can filter by **instance_id** instead of **org_id**.
+
+The following is an example query that would surface insights logs:
+
+```
+{org_id="<your-org-id>"} | logfmt | component = `ruler` | msg = `Evaluating rule failed`
+```
+
+1. Click **Run query**.
+
+1. In the **Logs** section, view specific information on which alert rule is failing and why.
+
+1. You can see the rule contents (in the `rule` field), the rule name (in the `name` field), the name of the group it’s in (in the `group` field), and the error message (in the `err` field).