Commit Graph

1608 Commits

Author SHA1 Message Date
Julien Duchesne
087df1d8e5
OpenAPI: Fix alerting DeleteMuteTiming errors (#91109)
The `GenericPublicError` is not what is actually returned by the API. Using `PublicError` describes the API correctly
2024-08-26 10:04:36 -04:00
Julien Duchesne
0075abe383
OpenAPI: Fix ProvisionedAlertRule.for type (#90841)
This override has been in the client for a while now: 9ef6983612/scripts/pull-schema.sh (L34-L39)

The API expects a string here and transforms it to a duration internally
2024-08-26 10:04:15 -04:00
Tito Lins
4a124469fa
Check is config is default by comparing hashes (#92296) 2024-08-23 11:22:06 +02:00
Christian Inkster
922babb157
Alerting: Add mutex to Redis HA subs (#89870) 2024-08-22 16:01:33 +01:00
Alexander Akhmetov
832bb01f36
Alerting: Add MQTT notifications receiver (#91487)
* Alerting: Add MQTT notifications receiver
* Update alerting to 9daa6239cc41dc42bff0e916c8d0d27766caa8b9 (main)
---------

Co-authored-by: Jack Baldry <jack.baldry@grafana.com>
Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
2024-08-22 16:47:48 +02:00
William Wernert
dfbddd8262
Alerting: Fix recording rule export (#91405)
* Fix HCL export

* Update rule export struct to support new optional fields

* Omit `for` field in export API if empty
2024-08-22 09:04:21 -04:00
Dave Henderson
df3d8915ba
Chore: Bump Go to 1.23.0 (#92105)
* chore: Bump Go to 1.23.0

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* update swagger files

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* chore: update .bingo/README.md formatting to satisfy prettier

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* chore(lint): Fix new lint errors found by golangci-lint 1.60.1 and Go 1.23

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* keep golden file

* update openapi

* add name to expected output

* chore(lint): rearrange imports to a sensible order

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

---------

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
Co-authored-by: Ryan McKinley <ryantxu@gmail.com>
2024-08-21 11:40:42 -04:00
Fayzal Ghantiwala
8d725a641c
Alerting: Integration test for testing template via remote alertmanager (#92147)
* Add integration test for testing template

* Update drone signature
2024-08-21 13:06:01 +01:00
Yuri Tseretyan
d27c3822f2
Alerting: Add Create and Update methods to Template service (#91981)
* rename SetTemplate to UpsertTemplate

* Introduce Create\Update methods

* update api endpoint to use GetTemplate
2024-08-20 15:23:01 -04:00
Alexander Akhmetov
d32e1e009b
Alerting: Update prometheus/client_golang to v1.20 (#92070)
Update prometheus/client_golang to v1.20
2024-08-20 11:26:06 +02:00
Alexander Weaver
ac5ebe6e4d
Alerting: Add enablement flag for recording rules (#92032)
* Add enablement flag

* Disable if toggle not enabled
2024-08-19 12:01:00 -05:00
Yuri Tseretyan
135f6571a9
Alerting: Update Time Interval service to support renaming of resources (#91856)
* add RenameTimeIntervalInNotificationSettings to storage
* update dependencies when the time interval is renamed
---------

Co-authored-by: William Wernert <william.wernert@grafana.com>
2024-08-16 20:55:03 +03:00
Fayzal Ghantiwala
e321dbb690
Alerting: Use remote Alertmanager to test templates and receivers when enabled (#91570)
* Initial impl

* Add code to test templates and receivers

* Fix linter

* Fix forked am tests

* Update mimir client

* Remove trailing whitespace

* re-trigger CI
2024-08-15 16:56:14 +01:00
Santiago
f852bf684a
Alerting: Fix duplicated silences in remote primary mode bug (#91902)
* Alerting: Fix duplicated silences in remote primary mode bug

* test that a new silence id returned by calling CreateSilence() on the internal Alertmanager is ignored
2024-08-15 17:14:55 +02:00
Yuri Tseretyan
5834981f86
Alerting: Add GetTemplate to template service and update tests (#91854)
* add GetTemplate to template service
* refactor GetTemplates to fetch all provenances at once
* refactor tests
2024-08-15 09:17:48 -04:00
Alexander Akhmetov
c7fdf8ce70
Alerting: Add error to annotations on data source errors (#91594) 2024-08-15 12:34:50 +02:00
Alexander Weaver
34ab5fe1f3
Alerting: Restart rule routines if the type changes (#90867)
* Restart when types change

* Wire up test hooks correctly

* testing
2024-08-14 14:57:47 -05:00
Yuri Tseretyan
7b919e3277
Alerting: MuteTimeService to support TimeInterval and MuteTimeInterval fields in Alertmanager config (#91500)
* update notification policy provisioing to consider time intervals
* change names of intervals to be in order
2024-08-13 11:37:21 -04:00
Yuri Tseretyan
db33df5041
Alerting: Template API to return errutil errors (#91821) 2024-08-13 10:59:19 -04:00
Alexander Akhmetov
149f02aebe
Alerting: Add rule_group label to grafana_alerting_rule_group_rules metric (#88289)
* Alerting: Add rule_group label to grafana_alerting_rule_group_rules metric (#62361)

* Alerting: Delete rule group metrics when the rule group is deleted

This commit addresses the issue where the GroupRules metric (a GaugeVec)
keeps its value and is not deleted when an alert rule is removed from the rule registry.
Previously, when an alert rule with orgID=1 was active, the metric was:

  grafana_alerting_rule_group_rules{org="1",state="active"} 1

However, after deleting this rule, subsequent calls to updateRulesMetrics
did not update the gauge value, causing the metric to incorrectly remain at 1.

The fix ensures that when updateRulesMetrics is called it
also deletes the group rule metrics with the corresponding label values if needed.
2024-08-13 13:27:23 +02:00
Alexander Akhmetov
b2eeb0dd6e
Alerting: update rule versions on folder move (#88376)
* Alerting: update rule versions on folder move (#88361)
* Add tracing to folder.Move and folder.Update
2024-08-13 12:26:26 +02:00
Karl Persson
8bcd9c2594
Identity: Remove typed id (#91801)
* Refactor identity struct to store type in separate field

* Update ResolveIdentity to take string representation of typedID

* Add IsIdentityType to requester interface

* Use IsIdentityType from interface

* Remove usage of TypedID

* Remote typedID struct

* fix GetInternalID
2024-08-13 10:18:28 +02:00
Konrad Lalik
b67bcdb9b8
Alerting: Handle namespace and group query string params in Ruler API (#91533)
* Handle namespace and group query string params in Ruler API

* Use the new namespace and group query params when slashes in names

* Add validation, add group handling in GMA Api

* Move constants

* Use checkForPathSeparator function

* Fix linter issue
2024-08-13 08:31:07 +02:00
Fayzal Ghantiwala
25dbb32cea
Alerting: Vendor in latest grafana/alerting package (#91786)
* temp

* vendor

* Remove dead code

* Vendoring
2024-08-12 15:37:15 +01:00
Santiago
5487ea444a
Alerting: Update remote Alertmanager config marshalling in test (#91791) 2024-08-12 13:45:22 +02:00
Yuri Tseretyan
1108a00668
Alerting: Support for optimistic concurrency in priovisioning Tempate API (#91195)
* support optimistic concurrency in template service

* update request handler to get version from query parameter

* return not found if a new template is set with version

* update PUT api to set version

* update documentation + for mute timings

---------

Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
2024-08-09 11:40:07 -04:00
Karl Persson
bcfb66b416
Identity: remove GetTypedID (#91745) 2024-08-09 18:20:24 +03:00
Kristin Laemmert
a117b090cf
chore: preallocate slices where we have a good idea of requirements (#91596)
* chore: preallocate slices where we have a good idea of requirements

* pr feedback
2024-08-07 08:34:52 -04:00
Yuri Tseretyan
ee78bb653f
Alerting: Log rule evaluation error in scheduler (#91585) 2024-08-06 19:27:02 +03:00
Matthew Jacobson
53cfdf0ef8
Alerting: Remove option to return settings from api/v1/receivers and restrict provisioning action access (#90861)
* Remove provisioning action access to v1/receivers api

* Separate ListOnly functionality to its own method without decryption
2024-08-05 11:49:23 -04:00
AvivGuiser
93aa5a56ad
Alerting: Use stable identifier of a group,contact point,mute timing when export to HCL (#90917)
---------

Signed-off-by: Aviv Guiser <avivguiser@gmail.com>
2024-08-05 09:56:17 -04:00
Matthew Jacobson
a397bca02e
Alerting: Fix panic with nil annotations & Nodata=alerting/ok/keep (#91506) 2024-08-02 22:15:57 +03:00
Alexander Weaver
72ecde5045
Alerting: Make orgID a direct arg of writer interface (#91422)
make orgID a direct arg of writer interface
2024-08-02 09:37:28 -05:00
Alexander Akhmetov
3952f627eb
Alerting: Parse secret fields case-insensitively when creating or updating a contact point (#90968)
* Alerting: Handle case-insensitive secret fields in contact point settings
2024-08-01 19:03:47 +02:00
William Wernert
a1ee84f757
Alerting: Remove duplicate tracing middleware from prom writer (#91353)
Remove duplicate tracing middleware from prom writer
2024-08-01 11:57:14 -04:00
Ieva
2e2ddc5c42
Folders: Allow folder editors and admins to create subfolders without any additional permissions (#91215)
* separate permissions for root level folder creation and subfolder creation

* fix tests

* fix tests

* fix tests

* frontend fix

* Update pkg/api/accesscontrol.go

Co-authored-by: Eric Leijonmarck <eric.leijonmarck@gmail.com>

* fix frontend when action sets are disabled

---------

Co-authored-by: Eric Leijonmarck <eric.leijonmarck@gmail.com>
2024-08-01 18:20:38 +03:00
Yuri Tseretyan
537f1fb857
Alerting: Fix persisting result fingerprint that is used by recovery threshold (#91224)
* fix persister to save result fingerprint

* revert change

* fmt
2024-07-30 18:07:13 -04:00
Nihal
9ad9b4989b
Alerting: Include a list of ref_Id and aggregated datasource UIDs to alerts when state reason is NoData (#88819)
* include a list of ref_Id and datasource UID to alerts when state reason is NoData. 

---------

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>
2024-07-30 12:55:59 -04:00
Alexander Weaver
4c71cadd5f
Alerting: Detach condition validator from condition evaluator (#91150)
* Detach validator from evaluator

* Drop unnecessary interface and type
2024-07-30 10:55:37 -05:00
github-actions[bot]
66b1a219f4
Alerting: Update Swagger spec (#79850)
* chore: update alerting swagger spec
* update public swagger

---------

Co-authored-by: rwwiv <rwwiv@users.noreply.github.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-07-30 18:17:23 +03:00
Yuri Tseretyan
2023821100
Alerting: update Loki backend of state history to batch requests by folder (#89865)
* refactor `selectorString` and remove Selector struct

* move code from selector string to BuildLogQuery

* batch requests by folder UID

* update historian annotation store to handle multiple queries

* sort folder uids to make consistent queries

* add logs to loki http

* log batch size but not content. content is logged by the client
2024-07-30 11:07:10 -04:00
Yuri Tseretyan
8323b688c6
Alerting: Improve logging in scheduler and states (#91003)
* handle metadata map nil

* remove double context

* clean up logging in scheduler

* do not reuse loggers from previous ticks

* log the dropped tick

* log tick instead of ticknum

* replace with processing tick logs

* log sending notifications

* update logging in persister to fetch context

* logs to historian

moved them upstream to be able to log when store is overridden
2024-07-29 16:01:48 -04:00
Matthew Jacobson
62f67e38b8
Alerting: Implement receiver auth service (#90857) 2024-07-29 15:49:10 -04:00
Yuri Tseretyan
34dbfefc86
Alerting: Template service to check for provenance status of update\delete (#90688) 2024-07-29 14:10:03 -04:00
Matthew Jacobson
a1f0b599a7
Alerting: Refactor receiver_svc and provisioning config store into legacy_storage package (#90856)
* Add more receivers api tests

* Move provisioning config store to new legacy_storage package
2024-07-26 17:45:33 -04:00
Yuri Tseretyan
6b0d20c96a
Alerting: time interval service to support addressing intervals by Base64 encoded name (#90563)
* rename to getMuteTimingByName

* add UID to api model of MuteTiming

* update GetMuteTiming to search by UID

* update UpdateMuteTiming to support search by UID

* update DeleteMuteTiming to support uid

* make sure UID is populated

* update usages

* use base64 url-safe, no padding encoding for UID
2024-07-26 16:43:40 -04:00
Alexander Weaver
b7220b532e
Alerting: Fix bug where patching recording rule queries wouldn't apply (#91011)
* the fix

* tests
2024-07-26 11:02:54 -05:00
Ryan McKinley
be7b1ce2df
Chore: Replace appcontext.User(ctx) with identity.GetRequester(ctx) (#91030) 2024-07-26 16:39:23 +03:00
Ryan McKinley
9db3bc926e
Identity: Rename "namespace" to "type" in the requester interface (#90567) 2024-07-25 12:52:14 +03:00
Sven Grossmann
94dd4105e2
Loki: Allow alert headers to be forwarded (#90890)
* Loki: Allow alert headers to be forwarded

* Loki: fix tests

---------

Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-07-25 07:39:34 +02:00
Santiago
b79b38f02c
Alertmanager: Support limits for silences (#90826)
* Alertmanager: support limits for silences

* update grafana/alerting to latest main
2024-07-24 14:22:29 +02:00
William Wernert
45f298120e
Alerting: Return error when writing recorded metrics instead of default writing NaN (#90743)
* Return error instead of default writing NaN
2024-07-22 15:47:02 -04:00
AvivGuiser
96c3e9c550
Alerting: Use stable identifier of a group when export to HCL (#90196)
* change the rule-group to be hashed when exporting to HCL

Signed-off-by: Aviv Guiser <avivguiser@gmail.com>

---------

Signed-off-by: Aviv Guiser <avivguiser@gmail.com>
2024-07-19 18:13:26 +03:00
Alexander Weaver
418b077c59
Alerting: Integration testing for recording rules including writes (#90390)
* Add success case and tests for writer using metrics

* Use testable version of clock

* Assert a specific series was written

* Fix linter

* Fix manually constructed writer
2024-07-18 17:14:49 -05:00
Alexander Weaver
0e269db8a9
Alerting: Expose recordingWriter on ngalert (#90573)
Expose recordingWriter on ngalert
2024-07-18 13:24:06 -05:00
Yuri Tseretyan
09e10ae9e0
Alerting: Update State history API Open API documentation (#89795) 2024-07-18 10:37:05 -04:00
Alexander Weaver
88ed77e7e8
Alerting: More graceful handling of NoData in recording rules (#90312)
* Handle NoData as its own case

* Debug

* Scalars parseable by CollectionReader

* fix linter

* Orgit add pkg/*git add pkg/* not and
2024-07-17 15:24:03 -05:00
Yuri Tseretyan
c3b9c9b239
Alerting: Send information about alert rule to data source in headers (#90344)
* add support of metadata to condition and adding it to request headers
* support for additional metadata when condition is built
* add additionall context to conditions: source and folder title
* add version
* use percent-encoding for header values
2024-07-17 22:55:12 +03:00
Yuri Tseretyan
970cafa20f
Alerting: Time interval Delete API to check for usages in alert rules (#90500)
* Check if a time interval is used in alert rules before deleting it
* Add time interval to parameters of ListAlertRulesQuery and ListNotificationSettings of DbStore

== Refacorings == 
* refactor isMuteTimeInUse to accept a single route
* update getMuteTiming to not return err
* update delete to get the mute timing from config first
2024-07-17 10:53:54 -04:00
Matthew Jacobson
b7f422b68d
Alerting: Receiver API Get+List+Delete (#90384) 2024-07-16 10:02:16 -04:00
Yuri Tseretyan
9c05b30489
Chore: Add more logs and tracing to hysteresis flows (#90369) 2024-07-15 13:38:20 -04:00
Santiago
e097ffc771
Alerting: Update grafana/alerting dependency (#90365)
* update grafana/alerting to latest main
* update alertmanager to  66ec17e3aa45
2024-07-12 14:05:17 -04:00
Matthew Jacobson
ba800692c6
Alerting: Persist AlertInstance ResolvedAt & LastSentAt (#89135)
* Alerting: Persist AlertInstance ResolvedAt & LastSentAt

* Fix test

* Modify existing tests

* Fix merge conflicts from nullable LastSentAt & ResolvedAt
2024-07-12 12:26:58 -04:00
Matthew Jacobson
b7767c79e7
Alerting: Fix contact point export 500 error and notifications/receivers missing settings (#90342)
* Regression test

* Fix 500 error when exporting redacted receivers

* Fix tests to check permissions
2024-07-12 11:42:22 -04:00
Kristin Laemmert
8a6107cd35
DashboardStore: Use ReplDB and get dashboard quotas from the ReadReplica (#90235)
* Use ReplDB in dashboard store and update all fixtures - no other changes

* just moving dashboard counts for now

* find the missing test fixture
2024-07-12 10:47:49 -04:00
Alexander Weaver
111ebd4fb2
Alerting: Create integration testing infra for recording rules (#90306)
* Create some integration testing infra for RRs

* whoops

* Require no error in responding

* fix linter

* Panic, no need to pass testing around

* Extend status test
2024-07-11 14:59:52 -05:00
Alexander Weaver
ab32183e18
Alerting: Track recording rule health and last eval info ephemerally (#90247)
* Track health and last eval info

* Read method for status

* Minor tests
2024-07-11 14:05:09 -05:00
Santiago
3bb861b9f0
Alerting: Remove empty/namespace labels when sending alerts to the remote Alertmanager (#90284)
* Alerting: Remove empty/namespace labels when sending alerts to the remote Alertmanager

* update tests

* fix typo in comment
2024-07-11 15:20:12 +02:00
Yuri Tseretyan
5ae5fa3a7a
Alerting: Support field selectors in time interval API (#90022)
* fix kind of TimeInterval
* register custom fields for selectors
* support field selectors in legacy storage
* support selectors in storage

===== Misc
* refactor conversions to build in one place
* hide implementation of provenance status behind accessors to use the key in selectors
* fix provenance error
2024-07-08 22:45:30 +03:00
Alexander Weaver
3b6a8775bb
Alerting: Fix stale values associated with states that have gone to NoData, unify values calculation (#89807)
* Unify values

* Fix with latest changes on main

* Fix up NaN test

* Keep refIDs with -1 as value

* Test that refIDs are preserved on Normal to Error transition

* Alerting to err test too

* Add a blurb to docs about this behavior
2024-07-08 12:30:23 -05:00
Steve Simpson
e9fd191065
Alerting: Fix some status codes returned from provisioning API. (#90117)
The contact point deletion API was returning 500 when it should have been
returning a 4xx error, when the contact point is in use:

- When in use by a notificiation policy, we were missing
  the `.Errorf("")` to convert `errutil.Base` into `errutil.Error`.
- When in use by an alert rule, an regular error was returned.
2024-07-05 19:06:37 +02:00
Alexander Zobnin
87d86e81ce
Zanzana: Evaluate permissions alongside with RBAC engine (#90064)
* Zanzana: Evaluate permissions if feature flag enabled

* Fix tests

* adjust logs

* fix spelling

* remove unused

* only evaluate implemented resources

* refactor
2024-07-05 11:31:23 +02:00
Yuri Tseretyan
411bab6d44
Alerting: Lower severity of logs about duplicates to debug (#89971)
lower severity of logs about duplicates to debug
2024-07-03 16:46:28 -04:00
Yuri Tseretyan
c3b5cabb14
Alerting: Refactor scheduler's rule evaluator to store rule key (#89925) 2024-07-01 16:43:23 -04:00
Yuri Tseretyan
655e477c20
Alerting: Fix flaky test in scheduler's tests (#89923) 2024-07-01 13:31:03 -04:00
Santiago
fce03cd724
Alerting: Send static headers to the remote Alertmanager (#89846) 2024-07-01 17:48:40 +02:00
Yuri Tseretyan
559738ce6a
Alerting: Fix flaky test in historian (#89913) 2024-07-01 16:59:06 +03:00
Gabriel MABILLE
71d31397e5
Fix flaky tests (#89910) 2024-07-01 14:39:51 +02:00
Alexander Akhmetov
68691c9386
Alerting: Add setting for maximum allowed rule evaluation results (#89468)
* Alerting: Add setting for maximum allowed rule evaluation results

Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
2024-06-27 09:45:15 +02:00
Yuri Tseretyan
06d5850396
Alerting: Update alerting state history API to authorize access using RBAC (#89579)
* add method CanReadAllRules to rule authorization service

* add alias type Namespace for Folder in ngalert's models package. It implements the Namespacer interface that is used by authz logic

* update state history's backends to authorize access to rules.
* update Loki to add folders UIDs to query. 
    * Update BuildLogQuery to drop filter by folders if it's too long and fall back to in-memory filtering.
2024-06-26 10:25:37 -04:00
Santiago
fe1309dd96
Alerting: Send external URL to the remote Alertmanager (#89701)
* Alerting: Send external URL to the remote Alertmanager

* test that the URL is sent to the remote Alertmanager

* AppURL -> ExternalURL
2024-06-26 14:02:02 +02:00
Nihal
b1ce7e8a83
Alerting: Reduce cyclomatic complexity of PrepareRuleGroupStatuses (#89649)
* reduce cyclomatic complexity of PrepareRuleGroupStatuses. see https://github.com/grafana/grafana/issues/88670

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>

* reduce cyclomatic complexity for PrepareRuleGroupStatuses in pkg/services/ngalert/api/api_prometheus.go

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>

* reduce cyclomatic complexity for PrepareRuleGroupStatuses in pkg/services/ngalert/api/api_prometheus.go

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>

---------

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>
2024-06-26 10:21:37 +02:00
Matthew Jacobson
83d05ea777
Alerting: Fix broken state tests (#89712) 2024-06-25 14:19:46 -04:00
Matthew Jacobson
47c9259d75
Alerting: Ensure we update State.LastSentAt before persisting (#89427) 2024-06-25 13:01:26 -04:00
William Wernert
fcfa89f864
Alerting: Implement Prometheus remote write for recording rules (#89189)
* Fix timestamp recorded by rule

* Implement prometheus remote write

* Create http client instead of transport

* Address PR comments

* Remove status code label
2024-06-25 17:23:42 +03:00
Yuri Tseretyan
4a5aab54a5
Alerting: Add max limit for Loki query size in state history API (#89646)
* add setting for query limit

* update BuildLogQuery to return error if limit is exceeded

* move tests for BuildLogQuery to separate suite
2024-06-25 09:20:38 -04:00
Alexander Akhmetov
2035814059
Alerting: fix updating error in the alert rule state during error to error transitions and restarts (#89557)
Alerting: fix preserving errors in the alert rule state during error to error transitions

Alert state transition from one error to another did not update state.Error correctly.
The error in state.Error remained as the initial error encountered.
This led to another issue, where after a Grafana restart, the error was lost because
the state of the alert rule did not change, but the Error is not preserved in the database
between restarts.

This could happen if the expression service returned an error or the alert routine panicked
during querying.
2024-06-25 09:42:00 +02:00
Yuri Tseretyan
b075926202
Alerting: Time Intervals API (#88201)
* expose ngalert API to public
* add delete action to time-intervals
* introduce time-interval model generated by app-platform-sdk from CUE model the fields of the model are chosen to be compatible with the current model
* implement api server
* add feature flag alertingApiServer
---- Test Infra
* update helper to support creating custom users with enterprise permissions
* add generator for Interval model
2024-06-20 16:52:03 -04:00
Matthew Jacobson
3228b64fe6
Alerting: Resend resolved notifications for ResolvedRetention duration (#88938)
* Simple replace of State.Resolved with State.ResolvedAt

* Retain ResolvedAt time between Normal->Normal transition

* Introduce ResolvedRetention to keep sending recently resolved alerts

* Make ResolvedRetention configurable with resolved_alert_retention

* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention

* Do not reset ResolvedAt during Normal->Pending transition

Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.

This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.

* Pointers for ResolvedAt & LastSentAt

To avoid awkward time.Time{}.Unix() defaults on persist
2024-06-20 16:33:03 -04:00
Yuri Tseretyan
92f10b73a8
Alerting: Move interface Namespaced from accesscontrol to models package (#89439)
move Namespaced interface from accesscontrol to models
2024-06-19 16:18:33 -04:00
Steve Simpson
8eabef1f91
Alerting: Update remote alertmanager to use extended receivers API. (#89253)
* Alerting: Update remote alertmanager to use extended receivers API.

* Update integration test and Mimir image

* Update Mimir image in more places
2024-06-19 12:40:22 +03:00
Steve Simpson
43a246f431
Alerting: Improve performance of /api/prometheus for large numbers of alerts. (#89268)
* Alerting: Optimize sorting alert instances.

* Also change other Labels fields for consistency
2024-06-17 12:25:47 +02:00
Alexander Weaver
8491e02caf
Alerting: Instrument outbound requests for Loki Historian and Remote Alertmanager with tracing (#89185)
* Add TracedClient

* Handle errors and status codes

* Wire up tracing to normal ASH and loki annotation mapping

* Add tracing to remote alertmanager

* one more spot

* and not or

* More consistency with other grafana traces, lower cardinality name
2024-06-14 13:24:12 -05:00
Dave Henderson
6262c56132
chore(perf): Pre-allocate where possible (enable prealloc linter) (#88952)
* chore(perf): Pre-allocate where possible (enable prealloc linter)

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* fix TestAlertManagers_buildRedactedAMs

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* prealloc a slice that appeared after rebase

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

---------

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
2024-06-14 14:16:36 -04:00
Steve Simpson
dd3c3b5857
Alerting: Update grafana/alerting. (#88914) 2024-06-14 09:19:04 +02:00
Ryan McKinley
99d8025829
Chore: Move identity and errutil to apimachinery module (#89116) 2024-06-13 07:11:35 +03:00
William Wernert
c62cc25513
Alerting: Configure recording rule writer from config.ini (#89056) 2024-06-12 16:04:46 -04:00
Santiago
b7120c5a30
Alerting: Fix missing argument in call to createRemoteAlertmanager() (#89101) 2024-06-12 11:57:22 +03:00
Santiago
12d5251c12
Alerting: Alertmanager configuration sync loop (#88822)
* make the config sync happen on each call to ApplyConfig(), fix tests

* send autogen config

* add fake autogen function for tests

* update stale comments, tidy things up, make linter happy

* add auto-gen routes only if the feature toggle is enabled

* remove unnecessary fake autogen function

* throttle configuration syncs

* restore pkg/services/store/entity/sqlstash/sql_storage_server.go

* test sync loop in ApplyConfig, skip invalid autogen routes

* restore conf/defaults.ini

* restore conf/defaults.ini

* avoid skipping invalid auto-gen routes in SaveAndApplyConfig

* test that autogenFn is called and its errors are returned

* add debug message about the sync interval not having elapsed

* collapse two log lines into one
2024-06-12 10:13:34 +02:00
Jacob Valdemar
eb76ea47a0
Alerting: Add ha_reconnect_timeout configuration option (#88823)
* Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration

---------

Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
2024-06-11 13:25:48 -04:00