Alerting: Refactor & fix unified alerting metrics structure (#39151)

* Alerting: Refactor & fix unified alerting metrics structure Fixes and refactors the metrics structure we have for the ngalert service. Now, each component has its own metric struct that includes the JUST the metrics it uses. Additionally, I have fixed the configuration metrics and added new metrics to determine if we have discovered and started all the necessary configurations of an instance. This allows us to alert on `grafana_alerting_discovered_configurations - grafana_alerting_active_configurations != 0` to know whether an alertmanager instance did not start successfully.
2025-02-25 18:55:37 -06:00 · 2021-09-14 12:55:01 +01:00
parent 1edd415ddf
commit a2f4344bf2
21 changed files with 243 additions and 119 deletions
--- a/pkg/services/ngalert/sender/sender.go
+++ b/pkg/services/ngalert/sender/sender.go
@@ -41,7 +41,7 @@ type Sender struct {
 	sdManager *discovery.Manager
 }

-func New(metrics *metrics.Metrics) (*Sender, error) {
+func New(_ *metrics.Scheduler) (*Sender, error) {
 	l := log.New("sender")
 	sdCtx, sdCancel := context.WithCancel(context.Background())
 	s := &Sender{
@@ -51,6 +51,8 @@ func New(metrics *metrics.Metrics) (*Sender, error) {
 	}

 	s.manager = notifier.NewManager(
+		// Injecting a new registry here means these metrics are not exported.
+		// Once we fix the individual Alertmanager metrics we should fix this scenario too.
 		&notifier.Options{QueueCapacity: defaultMaxQueueCapacity, Registerer: prometheus.NewRegistry()},
 		s.gokitLogger,
 	)