* chore: Bump Go to 1.23.0
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* update swagger files
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* chore: update .bingo/README.md formatting to satisfy prettier
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* chore(lint): Fix new lint errors found by golangci-lint 1.60.1 and Go 1.23
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* keep golden file
* update openapi
* add name to expected output
* chore(lint): rearrange imports to a sensible order
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
---------
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
Co-authored-by: Ryan McKinley <ryantxu@gmail.com>
* add RenameTimeIntervalInNotificationSettings to storage
* update dependencies when the time interval is renamed
---------
Co-authored-by: William Wernert <william.wernert@grafana.com>
* Alerting: Fix duplicated silences in remote primary mode bug
* test that a new silence id returned by calling CreateSilence() on the internal Alertmanager is ignored
* Alerting: Add rule_group label to grafana_alerting_rule_group_rules metric (#62361)
* Alerting: Delete rule group metrics when the rule group is deleted
This commit addresses the issue where the GroupRules metric (a GaugeVec)
keeps its value and is not deleted when an alert rule is removed from the rule registry.
Previously, when an alert rule with orgID=1 was active, the metric was:
grafana_alerting_rule_group_rules{org="1",state="active"} 1
However, after deleting this rule, subsequent calls to updateRulesMetrics
did not update the gauge value, causing the metric to incorrectly remain at 1.
The fix ensures that when updateRulesMetrics is called it
also deletes the group rule metrics with the corresponding label values if needed.
* Refactor identity struct to store type in separate field
* Update ResolveIdentity to take string representation of typedID
* Add IsIdentityType to requester interface
* Use IsIdentityType from interface
* Remove usage of TypedID
* Remote typedID struct
* fix GetInternalID
* Handle namespace and group query string params in Ruler API
* Use the new namespace and group query params when slashes in names
* Add validation, add group handling in GMA Api
* Move constants
* Use checkForPathSeparator function
* Fix linter issue
* support optimistic concurrency in template service
* update request handler to get version from query parameter
* return not found if a new template is set with version
* update PUT api to set version
* update documentation + for mute timings
---------
Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
* refactor `selectorString` and remove Selector struct
* move code from selector string to BuildLogQuery
* batch requests by folder UID
* update historian annotation store to handle multiple queries
* sort folder uids to make consistent queries
* add logs to loki http
* log batch size but not content. content is logged by the client
* handle metadata map nil
* remove double context
* clean up logging in scheduler
* do not reuse loggers from previous ticks
* log the dropped tick
* log tick instead of ticknum
* replace with processing tick logs
* log sending notifications
* update logging in persister to fetch context
* logs to historian
moved them upstream to be able to log when store is overridden
* rename to getMuteTimingByName
* add UID to api model of MuteTiming
* update GetMuteTiming to search by UID
* update UpdateMuteTiming to support search by UID
* update DeleteMuteTiming to support uid
* make sure UID is populated
* update usages
* use base64 url-safe, no padding encoding for UID
* change the rule-group to be hashed when exporting to HCL
Signed-off-by: Aviv Guiser <avivguiser@gmail.com>
---------
Signed-off-by: Aviv Guiser <avivguiser@gmail.com>
* Add success case and tests for writer using metrics
* Use testable version of clock
* Assert a specific series was written
* Fix linter
* Fix manually constructed writer
* add support of metadata to condition and adding it to request headers
* support for additional metadata when condition is built
* add additionall context to conditions: source and folder title
* add version
* use percent-encoding for header values
* Check if a time interval is used in alert rules before deleting it
* Add time interval to parameters of ListAlertRulesQuery and ListNotificationSettings of DbStore
== Refacorings ==
* refactor isMuteTimeInUse to accept a single route
* update getMuteTiming to not return err
* update delete to get the mute timing from config first
* Create some integration testing infra for RRs
* whoops
* Require no error in responding
* fix linter
* Panic, no need to pass testing around
* Extend status test
* fix kind of TimeInterval
* register custom fields for selectors
* support field selectors in legacy storage
* support selectors in storage
===== Misc
* refactor conversions to build in one place
* hide implementation of provenance status behind accessors to use the key in selectors
* fix provenance error
* Unify values
* Fix with latest changes on main
* Fix up NaN test
* Keep refIDs with -1 as value
* Test that refIDs are preserved on Normal to Error transition
* Alerting to err test too
* Add a blurb to docs about this behavior
The contact point deletion API was returning 500 when it should have been
returning a 4xx error, when the contact point is in use:
- When in use by a notificiation policy, we were missing
the `.Errorf("")` to convert `errutil.Base` into `errutil.Error`.
- When in use by an alert rule, an regular error was returned.
* Alerting: Add setting for maximum allowed rule evaluation results
Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
* add method CanReadAllRules to rule authorization service
* add alias type Namespace for Folder in ngalert's models package. It implements the Namespacer interface that is used by authz logic
* update state history's backends to authorize access to rules.
* update Loki to add folders UIDs to query.
* Update BuildLogQuery to drop filter by folders if it's too long and fall back to in-memory filtering.
Alerting: fix preserving errors in the alert rule state during error to error transitions
Alert state transition from one error to another did not update state.Error correctly.
The error in state.Error remained as the initial error encountered.
This led to another issue, where after a Grafana restart, the error was lost because
the state of the alert rule did not change, but the Error is not preserved in the database
between restarts.
This could happen if the expression service returned an error or the alert routine panicked
during querying.
* expose ngalert API to public
* add delete action to time-intervals
* introduce time-interval model generated by app-platform-sdk from CUE model the fields of the model are chosen to be compatible with the current model
* implement api server
* add feature flag alertingApiServer
---- Test Infra
* update helper to support creating custom users with enterprise permissions
* add generator for Interval model
* Simple replace of State.Resolved with State.ResolvedAt
* Retain ResolvedAt time between Normal->Normal transition
* Introduce ResolvedRetention to keep sending recently resolved alerts
* Make ResolvedRetention configurable with resolved_alert_retention
* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention
* Do not reset ResolvedAt during Normal->Pending transition
Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.
This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.
* Pointers for ResolvedAt & LastSentAt
To avoid awkward time.Time{}.Unix() defaults on persist
* Add TracedClient
* Handle errors and status codes
* Wire up tracing to normal ASH and loki annotation mapping
* Add tracing to remote alertmanager
* one more spot
* and not or
* More consistency with other grafana traces, lower cardinality name
* chore(perf): Pre-allocate where possible (enable prealloc linter)
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* fix TestAlertManagers_buildRedactedAMs
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* prealloc a slice that appeared after rebase
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
---------
Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
* make the config sync happen on each call to ApplyConfig(), fix tests
* send autogen config
* add fake autogen function for tests
* update stale comments, tidy things up, make linter happy
* add auto-gen routes only if the feature toggle is enabled
* remove unnecessary fake autogen function
* throttle configuration syncs
* restore pkg/services/store/entity/sqlstash/sql_storage_server.go
* test sync loop in ApplyConfig, skip invalid autogen routes
* restore conf/defaults.ini
* restore conf/defaults.ini
* avoid skipping invalid auto-gen routes in SaveAndApplyConfig
* test that autogenFn is called and its errors are returned
* add debug message about the sync interval not having elapsed
* collapse two log lines into one
* Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration
---------
Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>