* AlertRule to return condition
* update ConditionEval to not return an error because it's always nil
* make getExprRequest private
* refactor executeCondition to just converter and move execution to the ConditionEval as this makes code more readable.
* log error if results have errors
* change signature of evaluate function to not return an error
* Introduce AlertsRouter in the sender package, and move all fields and methods related to notifications out of the scheduler to this router.
* Introduce a new interface AlertsSender in the schedule package and replace calls of anonymous function `notify` inside the ruleRoutine to calling methods of that interface.
* Rename interface Scheduler in api package to ExternalAlertmanagerProvider, and replace scheduler with AlertRouter as struct that implements the interface.
* Define query param and regenerate
* Add query struct for contact points
* Filter contact points by name in query
* Document that name filter is optional
* Define route and run codegen
* Wire up HTTP layer
* Update API layer and test fakes
* Implement reset of policy tree
* Implement service layer test and authorization bindings
* API layer testing
* Be more specific when injecting settings
* Alerting: validate that the receiver exist when updating routing tree
* rename interface
* add missing file
* change constructor
* fix e2e tests
* only import package once
* add unit test for bug
* wording
* close response body
* Update pkg/services/ngalert/api/tooling/definitions/alertmanager_validation.go
* refactor to remove database roundtrip
* make 'for' pointer to distinguish between missing field and 0
* set 'for' to -1 if the value is missing but not allow negative in the request + path -1 with the value from original rule
* update store validation to not allow negative 'for'
* update usages to use pointer
* chore/backend: move dashboard errors to dashboard service
Dashboard-related models are slowly moving out of the models package and into dashboard services. This commit moves dashboard-related errors; the rest will come in later commits.
There are no logical code changes, this is only a structural (package) move.
* lint lint lint
* Extend template and generate
* Generate and fix up alertmanager endpoints
* Prometheus routes
* fix up Testing endpoints
* touch up ruler API
* Update provisioning and fix 500
* Drop dead code
* Remove more dead code
* Resolve merge conflicts
Migrations:
* add a new column alert_group_idx to alert_rule table
* add a new column alert_group_idx to alert_rule_version table
* re-index existing rules during migration
API:
* set group index on update. Use the natural order of items in the array as group index
* sort rules in the group on GET
* update the version of all rules of all affected groups. This will make optimistic lock work in the case of multiple concurrent request touching the same groups.
UI:
* update UI to keep the order of alerts in a group
* Updates to all except alert rules
* Return 400 when rules fail to validate, add testinfra
* More sane package aliases
* More package alias renames
* One more bug in contact point validation
* remove unused function
Co-authored-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>
Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
* Alerting: move group update to alert rule service
* rename validateAlertRuleInterval to validateRuleGroupInterval
* init baseinterval correctly
* add seconds suffix
* extract validation function for reusability
* add context to err message
* Alerting: decapitalize log lines and use "err" as the key for errors
Found using (logger|log).(Warn|Debug|Info|Error)\([A-Z] and (logger|log).(Warn|Debug|Info|Error)\(.+"error"
* Alerting: Remove double quotes from matchers
With #38629 a new Alertmanager configuration object was introduced with `object_matchers`, it was meant to circumvent around the fact that Prometheus label names don't support a set of characters that Grafana needs to support for alerts, silences, matchers, etc. (with a common example being elasticsearch's `.`).
This new object does not include the label of sanitzation or validation that its Prometheus equivalent supports in `matchers` and therefore are semantically not equivalent.
This triggered the problem that when the migration is run, we use `matchers` as the object to populate in configuration for routing policies, but when the UI does its first save this object is transformed to `object_matchers`.
Matchers that were previously running just fine would immediately stop working as soon as the configuration is saved.
This problem surfaced with the introduction of #49952 where we stopped stripping double quotes from matchers (not just regex but _all_ of them).
* Add comment explaining rationale and future removal
Co-authored-by: Alex Weaver <weaver.alex.d@gmail.com>
* update authz to exclude entire group if user does not have access to rule
* change rule update authz to not return changes because if user does not have access to any rule in group, they do not have access to the rule
* a new query that returns alerts in group by UID of alert that belongs to that group
* collect all affected groups during calculate changes
* update authorize to check access to groups
* update tests for calculateChanges to assert new fields
* add authorization tests
* Add validator for mute timing and make it provisionable
* Add tests to ensure prometheus validators are running and errors are propagated
* Internal API for manipulating mute timings
* Define and generate API layer
* Wire up generated code
* Implement API handlers
* Tests for golang layer
* Fix reference bug
* Fix linter and auth tests
* Resolve semantic errors and regenerate
* Remove pointless comment
* Extract out provisioning path param keys, simplify
* Expected number of paths
* Support for documenting stable vs unstable alerting routes
* empty commit, restart drone
* Touch-up references in root makefile and drop trailing escape newline
* Rebase and regenerate
* Extend README with docs for this change
This change adds a field to state.State and models.AlertInstance
that indicate the "Reason" that an instance has its current state. This
helps us account for cases where the state is "Normal" but the
underlying evaluation returned "NoData" or "Error", for example.
Fixes#42606
Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
This change extracts screenshot data from alert messages via a private annotation `__alertScreenshotToken__` and attaches a URL to a Slack message or uploads the data to an image upload endpoint if needed.
This change also implements a few foundational functions for use in other notifiers.
* introduce a fallback handler that checks that role is Viewer.
* update UI nav links to allow alerting tabs for anonymous user
* update rule api to check for Viewer role instead of SignedIn when RBAC is disabled
* create AlertGroupKey structure
* update PrometheusSrv.
- extract creation of RuleGroup to a separate method. Use group key for grouping
* update RuleSrv
- update calculateChanges to use groupKey
- authorize to use groupkey
* Generate API for writing templates
* Persist templates app logic layer
* Validate templates
* Extract logic, make set and delete methods
* Drop post route for templates
* Fix response details, wire up remainder of API
* Authorize routes
* Mirror some existing tests on new APIs
* Generate mock for prov store
* Wire up prov store mock, add tests using it
* Cover cases for both storage paths
* Add happy path tests and fix bugs if file contains no template section
* Normalize template content with define statement
* Tests for deletion
* Fix linter error
* Move provenance field to DTO
* empty commit
* ID to name
* Fix in auth too
* Template service
* Add GET routes and implement them
* Generate mock for persist layer
* Unit tests for reading templates
* Set up composition root and get integration tests working
* Fix prealloc issue
* Extract setup boilerplate
* Update AuthorizationTest
* Rebase and resolve
* Fix linter error
Invalid PostableSilences could be passed to the Alerting API - if they
are passed all the way down into the alertmanager data layer, they can
cause a panic. This change adds validation to avoid a panic in the
alertmanager.
* wip: Implement kvstore for secrets
* wip: Refactor kvstore for secrets
* wip: Add format key function to secrets kvstore sql
* wip: Add migration for secrets kvstore
* Remove unused Key field from secrets kvstore
* Remove secret values from debug logs
* Integrate unified secrets with datasources
* Fix minor issues and tests for kvstore
* Create test service helper for secret store
* Remove encryption tests from datasources
* Move secret operations after datasources
* Fix datasource proxy tests
* Fix legacy data tests
* Add Name to all delete data source commands
* Implement decryption cache on sql secret store
* Fix minor issue with cache and tests
* Use secret type on secret store datasource operations
* Add comments to make create and update clear
* Rename itemFound variable to isFound
* Improve secret deletion and cache management
* Add base64 encoding to sql secret store
* Move secret retrieval to decrypted values function
* Refactor decrypt secure json data functions
* Fix expr tests
* Fix datasource tests
* Fix plugin proxy tests
* Fix query tests
* Fix metrics api tests
* Remove unused fake secrets service from query tests
* Add rename function to secret store
* Add check for error renaming secret
* Remove bus from tests to fix merge conflicts
* Add background secrets migration to datasources
* Get datasource secure json fields from secrets
* Move migration to secret store
* Revert "Move migration to secret store"
This reverts commit 7c3f872072.
* Add secret service to datasource service on tests
* Fix datasource tests
* Remove merge conflict on wire
* Add ctx to data source http transport on prometheus stats collector
* Add ctx to data source http transport on stats collector test
* Test composition simplification from last PR
* Policies use proper API model everywhere
* Expose policy provenance in API, miss some dep injection
* Complete injection
* fix args
* Tests for provenance value
* Extract test helpers so tests are very readable
* Single source adapter struct that was copied in 3 places
* Drop redundant test
* Resolve merge conflicts on changelog
* Refactor GET am config to be extensible
* Extract post config route
* Fix tests
* Remove temporary duplication
* Fix broken test due to layer shift
* Fix duplicated error message
* Properly return 400 on config rejection
* Revert weird half method extraction
* Move things to notifier package and avoid redundant interface
* Simplify documentation
* Split encryption service and depend on minimal abstractions
* Properly initialize things all the way up to the composition root
* Encryption -> Crypto
* Address misc feedback
* Missing docstring
* Few more simple polish improvements
* Unify on MultiOrgAlertmanager. Discover bug in existing test
* Fix rebase conflicts
* Misc feedback, renames, docs
* Access crypto hanging off MultiOrgAlertmanager rather than having a separate API to initialize
* Alerting: unwrap upsert into insert and update function
* add changelog entry
* remove changelog entry
* rename upsertrule to updaterule
* use directly alertrule model for inserts
* add test for updating a rule with a conflicting name
* add check for access to rule's data source in GET APIs
* use more general method GetAlertRules instead of GetNamespaceAlertRules.
* remove unused GetNamespaceAlertRules.
Tests:
* create a method to generate permissions for rules
* extract method to create RuleSrv
* add tests for RouteGetNamespaceRulesConfig
* move validation at the beginning of method
* remove usage of GetOrgRuleGroups because it is not necessary. All information is already available in memory.
* remove unused method
* Base-line API for provisioning notification policies
* Wire API up, some simple tests
* Return provenance status through API
* Fix missing call
* Transactions
* Clarity in package dependencies
* Unify receivers in definitions
* Fix issue introduced by receiver change
* Drop unused internal test implementation
* FGAC hooks for provisioning routes
* Polish, swap names
* Asserting on number of exposed routes
* Don't bubble up updated object
* Integrate with new concurrency token feature in store
* Back out duplicated changes
* Remove redundant tests
* Regenerate and create unit tests for API layer
* Integration tests for auth
* Address linter errors
* Put route behind toggle
* Use alternative store API and fix feature toggle in tests
* Fixes, polish
* Fix whitespace
* Re-kick drone
* Rename services to provisioning
* Alerting: Accurately set value for prom-compatible APIs
Sets the value fields for the prometheus compatible API based on a combination of condition `refID` and the values extracted from the different frames.
* Fix an extra test
* Ensure a consitent ordering
* Address review comments
* address review comments
* Add basic UI for custom ruler URL
* Add build info fetching for alerting data sources
* Add keeping data sources build info in the store
* Use data source build info to construct data source urls
* Remove unused code
* Add custom ruler support in prometheus api calls
* Migrate actions
* Use thunk condition to prevent multiple data source buildinfo fetches
* Unify prom and ruler rules loading
* Upgrade RuleEditor tests
* Upgrade RuleList tests
* Upgrade PanelAlertTab tests
* Upgrade actions tests
* Build info refactoring
* Get rid of lotex ruler support action
* Add prom ruler availability checking when the buildinfo is not available
* Add rulerUrlBuilder tests
* Improve prometheus data source validation, small build info refactoring
* Change prefix based on Prometheus subtype
* Use the correct path
* Revert config routing
* Add deprecation notice for /api/prom prefix
* Add tests to the datasource subtype
* Remove custom ruler support
* Remove deprecation notice
* Prevent fetching ruler rules when ruler api is not available
* Add build info tests
* Unify naming of ruler methods
* Fix test
* Change buildinfo data source validation
* Use strings for subtype params and unveil mimir
* organise imports
* frontend changes and wordsmithing
* fix test suite
* add a nicer verbose message for prometheus datasources
* detect Mimir datasource
* fix test
* fix buildinfo test for Mimir
* shrink vectors
* add some code documentation
* DRY prepareRulesFilterQueryParams
* clarify that Prometheus does not support managing rules
* Improve buildinfo error handling
Co-authored-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: gillesdemey <gilles.de.mey@gmail.com>
* make eval.Evaluator an interface
* inject Evaluator to TestingApiSrv
* move conditionEval to RouteTestGrafanaRuleConfig because it is the only place where it is used
* update rule test api to check data source permissions
* Use alert:create action for folder search with edit permissions. This matches the action that is used to query dashboards (the update will be addressed later)
* Update rule store to use FindDashboards instead of folder service to list folders the user has access to view alerts. Folder service does not support query type and additional filters.
* Do not check whether the user can save to folder if FGAC is enabled because it is checked on API level.
* require legacy Editor for post, put, delete endpoints
* require user to be signed in on group level because handler that checks that user has role Editor does not check it is signed in
* verify that the user has access to all data sources used by the rule that needs to be deleted from the group
* if a user is not authorized to access the rule, the rule is removed from the list to delete
* rename GetRuleGroupAlertRules to GetAlertRules
* make rule group optional in GetAlertRulesQuery
* simplify FakeStore. the current structure did not support optional rule group
update method getEvaluatorForAlertRule to accept permissions evaluator and exit on the first negative result, which is more effective than returning an evaluator that in fact is a bunch of slices.
The directory created by `T.TempDir` is automatically removed when the
test and all its subtests complete.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>