grafana

mirror of https://github.com/grafana/grafana.git synced 2024-11-25 18:30:41 -06:00

Author	SHA1	Message	Date
Steve Simpson	ad7f804255	Alerting: Fix evaluation metrics to not count retries (#85873 ) * Change evaluation metrics to only count once per eval, and add new metrics. * Cosmetic: Move eval total Inc() to orginal place.	2024-04-12 16:20:46 +02:00
William Wernert	fabaff9a24	Alerting: Create metric for rules using simple notifications (#82904 ) --------- Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>	2024-02-16 19:01:49 +02:00
George Robinson	c8ccc4649c	Alerting: Support UTF-8 (#81512 ) This pull request updates our fork of Alertmanager to commit 65bdab0, which is based on commit 5658f8c in Prometheus Alertmanager. It applies the changes from grafana/alerting#155 which removes the overrides for validation of alerts, labels and silences that we had put in place to allow alerts and silences to work for non-Prometheus datasources. However, as this is now supported in Alertmanager with the UTF-8 work, we can use the new upstream functions and remove these overrides. The compat package is a package in Alertmanager that takes care of backwards compatibility when parsing matchers, validating alerts, labels and silences. It has three modes: classic mode, UTF-8 strict mode, fallback mode. These modes are controlled via compat.InitFromFlags. Grafana initializes the compat package without any feature flags, which is the equivalent of fallback mode. Classic and UTF-8 strict mode are used in Mimir. While Grafana Managed Alerts have no need for fallback mode, Grafana can still be used as an interface to manage the configurations of Mimir Alertmanagers and view configurations of Prometheus Alertmanager, and those installations might not have migrated or being running on older versions. Such installations behave as if in classic mode, and Grafana must be able to parse their configurations to interact with them for some period of time. As such, Grafana uses fallback mode until we are ready to drop support for outdated installations of Mimir and the Prometheus Alertmanager.	2024-02-06 08:33:47 +00:00
George Robinson	05d858635c	Alerting: Add metric for inhibition rules (#81119 ) This commit adds a metric for the number of inhibition rules. It matches the metric added upstream in #3681.	2024-01-23 19:43:17 +00:00
Jean-Philippe Quéméner	aa25776f81	Alerting: Add a feature flag to periodically save states (#80987 )	2024-01-23 17:03:30 +01:00
Alexander Weaver	3c796ecc8f	Alerting: Add metric counting rule groups per org (#80669 ) * Refactor, fix bad map hint * Count groups per org	2024-01-16 16:35:56 -06:00
Santiago	3afd94185c	Alerting: Add metric to check for default AM configurations (#80225 ) * Alerting: Add metric to check for default AM configurations * Use a gauge for the config hash * don't go out of bounds when converting uint64 to float64 * expose metric for config hash * update metrics after applying config	2024-01-16 17:12:24 +01:00
Santiago	9e78faa7ba	Alerting: Add metrics to the remote Alertmanager struct (#79835 ) * Alerting: Add metrics to the remote Alertmanager struct * rephrase http_requests_failed description * make linter happy * remove unnecessary metrics * extract timed client to separate package * use histogram collector from dskit * remove weaveworks dependency * capture metrics for all requests to the remote Alertmanager (both clients) * use the timed client in the MimirAuthRoundTripper * HTTPRequestsDuration -> HTTPRequestDuration, clean up mimir client factory function * refactor * less git diff * gauge for last readiness check in seconds * initialize LastReadinesCheck to 0, tweak metric names and descriptions * add counters for sync attempts/errors * last config sync and last state sync timestamps (gauges) * change latency metric name * metric for remote Alertmanager mode * code review comments * move label constants to metrics package	2024-01-10 11:18:24 +01:00
Alexander Weaver	a8fb01a502	Swap weaveworks/common utilities for equivalents in grafana/dskit (#80051 ) * Replace histogram collector and grpc injectors * Extract request timing utility * Also vendor test file * Suppress erroneous linter warn	2024-01-05 10:08:38 -06:00
William Wernert	f7bf818527	Alerting: Make alert state history Loki http client public (#78291 ) * Make state history Loki client public * Make historian metrics subsystem configurable	2023-11-27 09:20:50 -05:00
gotjosh	e877174501	Alerting: Expose metrics for Alertmanager Alerts - `grafana_alerting_alertmanager_alerts` (#75802 ) * Alerting: Expose metrics for Alertmanager Alerts In Grafana, the alert evaluation and alert delivery are combined. We're always used a metric named `grafana_alerting_alerts` to get a sense of what are the alerts that are currently firing (these come from the evaluation side) and opted to not map the alertmanager alerts metric directly. I think it's important that we make a disction between alerts that happen at evaluation vs alerts that are received for delivery by the internal Alertmanager as we have options to skip the delivery of these alerts to the internal alertmanager altogether.	2023-10-02 16:36:23 +01:00
gotjosh	59694fb2be	Alerting: Don't use a separate collection system for metrics (#75296 ) * Alerting: Don't use a separate collection system for metrics The state package had a metric collection system that ran every 15s updating the values of the metrics - there is a common pattern for this in the Prometheus ecosystem called "collectors". I have removed the behaviour of using a time-based interval to "set" the metrics in favour of a set of functions as the "value" that get called at scrape time.	2023-09-25 10:27:30 +01:00
Yuri Tseretyan	938e26b59f	Alerting: Add new metrics and tracings to state manager and scheduler (#71398 ) * add metrics and tracing to state manager * propagate tracer to state manager * add scheduler metrics * fix backtesting * add test for state metrics * remove StateUpdateCount * update docs * metrics can be null * add tracer to new tests	2023-08-16 09:04:18 +02:00
George Robinson	f085e99d3c	Alerting: Add matchers metrics to Alertmanager (#69855 )	2023-06-15 09:18:01 +01:00
Alexander Weaver	dd04757fc9	Alerting: Add "backend" label to state history writes metrics (#65395 ) * Add backend label to state history writes metrics * Update test expectations	2023-03-28 08:49:51 -05:00
Alexander Weaver	19d01dff91	Alerting: Expose Prometheus metrics for persisting state history (#63157 ) * Create historian metrics and dependency inject * Record counter for total number of state transitions logged * Track write failures * Track current number of active write goroutines * Record histogram of how long it takes to write history data * Don't copy the registerer * Adjust naming of write failures metric * Introduce WritesTotal to complement WritesFailedTotal * Measure TransitionsFailedTotal to complement TransitionsTotal * Rename all to state_history * Remove redundant Total suffix * Increment totals all the time, not just on success * Drop ActiveWriteGoroutines * Drop PersistDuration in favor of WriteDuration * Drop unused gauge * Make writes and writesFailed per org * Add metric indicating backend and a spot for future metadata * Drop _batch_ from names and update help * Add metric for bytes written * Better pairing of total + failure metric updates * Few tweaks to wording and naming * Record info metric during composition * Create fakeRequester and simple happy path test using it * Blocking test for the full historian and test for happy path metrics * Add tests for failure case metrics * Smoke test for full annotation persistence * Create test for metrics on annotation persistence, both happy and failing paths * Address linter complaints * More linter complaints * Remove unnecessary whitespace * Consistency improvements to help texts * Update tests to match new descs	2023-03-06 10:40:37 -06:00
gotjosh	5422f7cf56	Alerting: Add metrics for active receiver and integrations (#64050 ) * Alerting: Add metrics for active receiver and integrations Introduces metrics that allows us to track the number of configured receivers and integration in the Alertmanager for all orgs. As a bonus, I realised that the alert reception metrics where not being exported nor collected. This does that too.	2023-03-06 16:37:07 +00:00
Alexander Weaver	e77621649d	Alerting: Instrument outgoing state history requests using weaveworks/common (#63600 ) * Loki backend and client depend on a requester * Instrument all requests to loki using weaveworks TimedClient * Construct collector in metrics package	2023-02-23 17:52:02 -06:00
Alex Moreno	f60dc4441f	Alerting: Add status label to GroupRules metric (#63454 ) * Add status label to GroupRules metric * Add state (active and paused) label to GrouRules * Add active/paused metrics tests	2023-02-23 12:38:27 +01:00
Steve Simpson	4d1a2c3370	Alerting: Move `rule_groups_rules` metric from State to Scheduler. (#63144 ) The `rule_groups_rules` metric is currently defined and computed by `State`. It makes more sense for this metric to be computed off of the configured rule set, not based on the rule evaluation state. There could be an edge condition where a rule does not have a state yet, and so is uncounted. Additionally, we would like this metric (and others), to have a `rule_group` label, and this is much easier to achieve if the metric is produced from the `Scheduler` package.	2023-02-09 17:05:19 +01:00
Steve Simpson	c44e9f6b71	Alerting: Add metrics around notification delivery. (#62778 ) This change exposes more metrics from the embedded Alertmanager, which are valuable for troubleshooting Alertmanager operation particularly in HA setups. ``` grafana_alerting_notifications_total grafana_alerting_notifications_failed_total grafana_alerting_notification_requests_total grafana_alerting_notification_requests_failed_total grafana_alerting_notification_latency_seconds grafana_alerting_nflog_gc_duration_seconds grafana_alerting_nflog_snapshot_duration_seconds grafana_alerting_nflog_snapshot_size_bytes grafana_alerting_nflog_queries_total grafana_alerting_nflog_query_errors_total grafana_alerting_nflog_query_duration_seconds grafana_alerting_nflog_gossip_messages_propagated_total grafana_alerting_dispatcher_aggregation_groups grafana_alerting_dispatcher_alert_processing_duration_seconds ``` Note that `alertmanager_dispatcher_aggregation_group_limit_reached_total` is explicitly not exposed, as the group limit metrics are not enabled.	2023-02-02 14:44:20 +01:00
gotjosh	55e7cf1aed	Alerting: Introduce Metric Aggregation starting with Silences (#62512 ) * Alerting: Introduce Metric Aggregation starting with Silences --------- Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>	2023-01-31 19:54:38 +00:00
Serge Zaitsev	d6d4097567	Chore: Fix goimports grouping in alerting (#62424 ) * fix goimports * fix goimports order	2023-01-30 09:55:35 +01:00
gotjosh	3c616da83f	Alerting: Refactor metrics/ngalert.go into seperate files (#62362 ) * Alerting: Refactor metrics/ngalert.go into seperate files	2023-01-27 18:49:49 +00:00
idafurjes	6c5a573772	Chore: Move ReqContext to contexthandler service (#62102 ) * Chore: Move ReqContext to contexthandler service * Rename package to contextmodel * Generate ngalert files * Remove unused imports	2023-01-27 08:50:36 +01:00
Alexander Weaver	9977c7ea43	Alerting: Simplify scheduler configuration and remove dependency on Grafana-wide settings (#59735 ) * Make scheduler not depend directly on grafana-wide settings * Re-add missing interval	2022-12-02 16:02:07 -06:00
Alexander Weaver	bd6a5c900f	Alerting: Extract ticker into shared package (#55703 ) * Move ticker files to dedicated package with no changes * Fix package naming and resolve naming conflicts * Fix up all existing references to moved objects * Remove all alerting-specific references from shared util * Rename TickerMetrics to simply Metrics * Rename base ticker type to T and rename NewTicker to simply New	2022-09-26 12:35:33 -05:00
Ben Kochie	68691d7775	Convert some metrics to Histograms (#50420 ) Because Summary metrics can not be aggreated, convert them to histograms so that users with HA deployments can use these metrics. * Convert metrics registration to promauto. * Improve help text style. Signed-off-by: SuperQ <superq@gmail.com>	2022-06-15 13:19:43 +02:00
gotjosh	c59938b235	Alerting: Schedule Alert rules metric tracking (#50415 ) * Alerting: Schedule Alert rules metric tracking Change the record of metrics from one place to two as an attempt to have a semi-accurate record.	2022-06-08 18:37:33 +01:00
Yuriy Tseretyan	a89d4a5be7	Alerting: Scheduler to drop ticks if a rule's evaluation is too slow (#48885 ) * drop ticks if evaluation of a rule is too slow. * add metric schedule_rule_evaluations_missed_total	2022-06-08 12:50:44 -04:00
George Robinson	c83f84348c	Alerting: Fix database unavailable removes rules from scheduler (#49874 )	2022-06-07 16:20:06 +01:00
sh0rez	3ca3a59079	pkg/web: remove dependency injection (#49123 ) * pkg/web: store http.Handler internally * pkg/web: remove injection Removes any injection code from pkg/web. It already was no longer functional, as we already only injected into `http.Handler`, meaning we only inject ctx.Req and ctx.Resp. Any other types (Context, ReqContext) were already accessed using the http.Request.Context.Value() method. * : remove type mappings Removes any call to the previously removed TypeMapper, as those were non-functional already. pkg/web: remove Context.Invoke was no longer used outside of pkg/web and also no longer functional	2022-05-24 15:35:08 -04:00
Yuriy Tseretyan	75ba4e98c6	Alerting: Remove unused features from ticker + metric + tests (#47828 ) * remove not used code: - remove offset in ticket because it is not used - remove unused ticker and scheduler methods * use duration for interval * add metrics grafana_alerting_ticker_last_consumed_tick_timestamp_seconds, grafana_alerting_ticker_next_tick_timestamp_seconds, grafana_alerting_ticker_interval_seconds	2022-04-22 15:09:47 -04:00
Sofia Papagiannaki	54962c2f0c	Alerting: Rename Recipient path parameter to DatasourceID (#47949 )	2022-04-20 16:20:17 +03:00
George Robinson	5e2280ceee	Add metrics to ngalert scheduler (#44602 ) This pull request adds metrics to the ngalert scheduler so we can see how long it takes to evaluate a tick.	2022-01-31 16:56:43 +00:00
Serge Zaitsev	57fcfd578d	Chore: replace macaron with web package (#40136 ) * replace macaron with web package * add web.go	2021-10-11 14:30:59 +02:00
gotjosh	35e5bfce40	Alerting: Metrics should have the label `org` instead of `user` (#39353 ) An user within Grafana has a completely different meaning. Multi-tenancy is done via Organizations as a top-level concept.	2021-09-17 17:17:26 +01:00
gotjosh	7db97097c9	Alerting: Support Unified Alerting with Grafana HA (#37920 ) * Alerting: Support Unified Alerting in Grafana's HA mode.	2021-09-16 15:33:51 +01:00
Serge Zaitsev	063160aae2	Chore: pass url parameters through context.Context (#38826 ) * pass url parameters through context.Context * fix url param names without colon prefix * change context params to vars * replace url vars in tests using new api * rename vars to params * add some comments * rename seturlvars to seturlparams	2021-09-14 18:34:56 +02:00
gotjosh	a2f4344bf2	Alerting: Refactor & fix unified alerting metrics structure (#39151 ) * Alerting: Refactor & fix unified alerting metrics structure Fixes and refactors the metrics structure we have for the ngalert service. Now, each component has its own metric struct that includes the JUST the metrics it uses. Additionally, I have fixed the configuration metrics and added new metrics to determine if we have discovered and started all the necessary configurations of an instance. This allows us to alert on `grafana_alerting_discovered_configurations - grafana_alerting_active_configurations != 0` to know whether an alertmanager instance did not start successfully.	2021-09-14 12:55:01 +01:00
Arve Knudsen	78596a6756	Migrate to Wire for dependency injection (#32289 ) Fixes #30144 Co-authored-by: dsotirakis <sotirakis.dim@gmail.com> Co-authored-by: Marcus Efraimsson <marcus.efraimsson@gmail.com> Co-authored-by: Ida Furjesova <ida.furjesova@grafana.com> Co-authored-by: Jack Westbrook <jack.westbrook@gmail.com> Co-authored-by: Will Browne <wbrowne@users.noreply.github.com> Co-authored-by: Leon Sorokin <leeoniya@gmail.com> Co-authored-by: Andrej Ocenas <mr.ocenas@gmail.com> Co-authored-by: spinillos <selenepinillos@gmail.com> Co-authored-by: Karl Persson <kalle.persson@grafana.com> Co-authored-by: Leonard Gram <leo@xlson.com>	2021-08-25 15:11:22 +02:00
David Parrott	7fbeefc090	Alerting: create wrapper for Alertmanager to enable org level isolation (#37320 ) Introduces org-level isolation for the Alertmanager and its components. Silences, Alerts and Contact points are not separated by org and are not shared between them. Co-authored with @davidmparrott and @papagian	2021-08-24 11:28:09 +01:00
Owen Diehl	8f350bc353	actually register metrics this time (#34444 )	2021-05-19 22:09:12 +02:00
Owen Diehl	c48c701791	adds missing metric name (#34307 )	2021-05-18 17:24:38 -04:00
Owen Diehl	1367f7171e	Alerting/ruler metrics (#34144 ) * adds active configurations metric * rule evaluation metrics * ruler metrics * pr feedback	2021-05-14 16:13:44 -04:00
Owen Diehl	5e48b54549	Alerting/metrics (#33547 ) * moves alerting metrics to their own pkg * adds grafana_alerting_alerts (by state) metric * alerts_received_{total,invalid} * embed alertmanager alerting struct in ng metrics & remove duplicated notification metrics (already embed alertmanager notifier metrics) * use silence metrics from alertmanager lib * fix - manager has metrics * updates ngalert tests * comment lint Signed-off-by: Owen Diehl <ow.diehl@gmail.com> * cleaner prom registry code * removes ngalert global metrics * new registry use in all tests * ngalert metrics impl service, hack testinfra code to prevent duplicate metric registrations * nilmetrics unexported	2021-04-30 12:28:06 -04:00

46 Commits