* Alerting: Attempt to retry retryable errors
Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.
I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.
There's two small differences between how retries work now and how they used to work in legacy alerting.
Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* first round of entityapi updates
- quote column names and clean up insert/update queries
- replace grn with guid
- streamline table structure
fixes
streamline entity history
move EntitySummary into proto
remove EntitySummary
add guid to json
fix tests
change DB_Uuid to DB_NVarchar
fix folder test
convert interface to any
more cleanup
start entity store under grafana-apiserver dskit target
CRUD working, kind of
rough cut of wiring entity api to kube-apiserver
fake grafana user in context
add key to entity
list working
revert unnecessary changes
move entity storage files to their own package, clean up
use accessor to read/write grafana annotations
implement separate Create and Update functions
* go mod tidy
* switch from Kind to resource
* basic grpc storage server
* basic support for grpc entity store
* don't connect to database unless it's needed, pass user identity over grpc
* support getting user from k8s context, fix some mysql issues
* assign owner to snowflake dependency
* switch from ulid to uuid for guids
* cleanup, rename Search to List
* remove entityListResult
* EntityAPI: remove extra user abstraction (#79033)
* remove extra user abstraction
* add test stub (but
* move grpc context setup into client wrapper, fix lint issue
* remove unused constants
* remove custom json stuff
* basic list filtering, add todo
* change target to storage-server, allow entityStore flag in prod mode
* fix issue with Update
* EntityAPI: make test work, need to resolve expected differences (#79123)
* make test work, need to resolve expected differences
* remove the fields not supported by legacy
* sanitize out the bits legacy does not support
* sanitize out the bits legacy does not support
---------
Co-authored-by: Ryan McKinley <ryantxu@gmail.com>
* update feature toggle generated files
* remove unused http headers
* update feature flag strategy
* devmode
* update readme
* spelling
* readme
---------
Co-authored-by: Ryan McKinley <ryantxu@gmail.com>
* Alerting: Attempt to retry retryable errors
Currently in a draft state, but this was the minimal diff I could put together to exemplify how could achieve this.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Have the first iteration
* Prepare bench testing
* rename the test files
* Remove unnecessary test file
* Introduce influxqlStreamingParser feature flag
* Apply streaming parser feature flag
* Add new tests
* More tests
* return executedQueryString only in first frame
* add frame meta and config
* Update golden json files
* Support tags/labels
* more tests
* more tests
* Don't change original response_parser.go
* provide context
* create util package
* don't pass the row
* update converter with formatted frameName
* add executedQueryString info only to first frame
* update golden files
* rename
* update test file
* use pointer values
* update testdata
* update parsing
* update converter for null values
* prepare converter for table response
* clean up
* return timeField in fields
* handle no time column responses
* better nil field handling
* refactor the code
* add table tests
* fix config for table
* table response format
* fix value
* if there is no time column set name
* linting
* refactoring
* handle the status code
* add tracing
* Update pkg/tsdb/influxdb/influxql/converter/converter_test.go
Co-authored-by: İnanç Gümüş <m@inanc.io>
* fix import
* update test data
* sanity
* sanity
* linting
* simplicity
* return empty rsp
* rename to prevent confusion
* nullableJson field type for null values
* better handling null values
* remove duplicate test file
* fix healthcheck
* use util for pointer
* move bench test to root
* provide fake feature manager
* add more tests
* partial fix for null values in table response format
* handle partial null fields
* comments for easy testing
* move frameName allocation in readSeries
* one less append operation
* performance improvement by making string conversion once
pkg: github.com/grafana/grafana/pkg/tsdb/influxdb/influxql
│ stream2.txt │ stream3.txt │
│ sec/op │ sec/op vs base │
ParseJson-10 314.4m ± 1% 303.9m ± 1% -3.34% (p=0.000 n=10)
│ stream2.txt │ stream3.txt │
│ B/op │ B/op vs base │
ParseJson-10 425.2Mi ± 0% 382.7Mi ± 0% -10.00% (p=0.000 n=10)
│ stream2.txt │ stream3.txt │
│ allocs/op │ allocs/op vs base │
ParseJson-10 7.224M ± 0% 6.689M ± 0% -7.41% (p=0.000 n=10)
* add comment lines
---------
Co-authored-by: İnanç Gümüş <m@inanc.io>
* Folders: Show folders user has access to at the root level
* Refactor
* Refactor
* Hide parent folders user has no access to
* Skip expensive computation if possible
* Fix tests
* Fix potential nil access
* Fix duplicated folders
* Fix linter error
* Fix querying folders if no managed permissions set
* Update benchmark
* Add special shared with me folder and fetch available non-root folders on demand
* Fix parents query
* Improve db query for folders
* Reset benchmark changes
* Fix permissions for shared with me folder
* Simplify dedup
* Add option to include shared folder permission to user's permissions
* Fix nil UID
* Remove duplicated folders from shared list
* Folders: Fix fetching empty folder
* Nested folders: Show dashboards with directly assigned permissions
* Fix slow dashboards fetch
* Refactor
* Fix cycle dependencies
* Move shared folder to models
* Fix shared folder links
* Refactor
* Use feature flag for permissions
* Use feature flag
* Review comments
* Expose shared folder UID through frontend settings
* Add frontend type for sharedWithMeFolderUID option
* Refactor: apply review suggestions
* Fix parent uid for shared folder
* Fix listing shared dashboards for users with access to all folders
* Prevent creating folder with "shared" UID
* Add tests for shared folders
* Add test for shared dashboards
* Fix linter
* Add metrics for shared with me folder
* Add metrics for shared with me dashboards
* Fix tests
* Tests: add metrics as a dependency
* Fix access control metadata for shared with me folder
* Use constant for shared with me
* Optimize parent folders access check, fetch all folders in one query.
* Use labels for metrics
* Export Notification Policy correctly (#78020)
The JSON version of an exported Notification Policy now
inline correctly the policy in the same way the Yaml version
does.
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
* ngalert `make`: Support GNU install on Darwin
Currently, the Makefile assumes that Darwin is using the Mac version of `sed`
I have the GNU version, so it failed. With this PR, it checks which version is installed
I also called `make` and there are some changes that came out of it
* swagger-gen
* Return data in camelCase from the OAuth fb strategy
* changes
* wip
* Add defaults for oauth fb strategy
* revert other changes
* Add tests
* Add Defaults to cfg and use it in OAuthStrategy
* Return *OAuthInfo from OAuthStrategy
* lint
* Remove unnecessary Defaults
* Introduce const for fields, fix import order
* Align failing tests
* clean up
* Changes requested by @gamab
* Update pkg/services/ssosettings/strategies/oauth_strategy_test.go
Co-authored-by: Gabriel MABILLE <gamab@users.noreply.github.com>
* Load data on startup
* Rename + simplify
---------
Co-authored-by: Gabriel MABILLE <gamab@users.noreply.github.com>
* Alerting: Only warm alert state cache if execute_alerts=true.
If the Grafana instance is not executing alerts, then Warm()-ing the state
manager is wasteful and could lead to misleading rule status queries, as the
status returned will be always based on the state loaded from the database at
startup, and not the most recent evaluation state.
* Move Warm() down to shared conditional.
* Alerting: Add clean_upgrade config and deprecate force_migration
Upgrading to UA and rolling back will no longer delete any data by default.
Instead, each set of tables will remain unchanged when switching between
legacy and UA. As such, the force_migration config has been deprecated
and no extra configuration is required to roll back to legacy anymore.
If clean_upgrade is set to true when upgrading from legacy alerting to Unified
Alerting, grafana will first delete all existing Unified Alerting resources,
thus re-upgrading all organizations from scratch. If false or unset,
organizations that have previously upgraded will not lose their existing Unified
Alerting data when switching between legacy and Unified Alerting.
Similar to force_migration, it should be kept false when not needed as it may
cause unintended data-loss if left enabled.
---------
Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
* Alerting: Keep track of individual org migration status
Save migration status per migrated org.
Change the meaning (and key/value) of the org_id=0 entry
to store the current (previous) config value used by alerting.
This is so we can know when to upgrade/downgrade by
comparing with the new config value in
UnifiedAlerting.IsEnabled.
* Alerting: Add a sync interval for ApplyConfig in remote secondary mode
* remove out of scope code
* remove parentheses after CleanUp for consistency in test comments
* Add comment to ApplyConfig
* Add anonymous stats and user table
- anonymous users users page
- add feature toggle `anonymousAccess`
- remove check for enterprise for `Device-Id` header in request
- add anonusers/device count to stats
* promise all, review comments
* make use of promise all settled
* refactoring: devices instead of users
* review comments, moved countdevices to httpserver
* fakeAnonService for tests and generate openapi spec
* do not commit openapi3 and api-merged
* add openapi
* Apply suggestions from code review
Co-authored-by: Alex Khomenko <Clarity-89@users.noreply.github.com>
* formatin
* precise anon devices to avoid confusion
---------
Co-authored-by: Alex Khomenko <Clarity-89@users.noreply.github.com>
Co-authored-by: jguer <me@jguer.space>
* refactor SSOSettings to use types
* test struct
* refactor SSOSettings struct to use types
* fix database tests
* fix populateSSOSettings() to accept an SSOSettings param
* fix all tests from the database layer
* handle errors for converting to/from SSOSettings
* add json tag on OAuthInfo fields
* use continue instead of if/else
* add the source field to SSOSettingsDTO conversion
* remove omitempty from json tags in OAuthInfo struct
* Alerting: In migration improve deduplication of title and group
This change improves alert titles generated in the legacy migration
that occur when we need to deduplicate titles. Now when duplicate
titles are detected we will first attempt to append a sequential index,
falling back to a random uid if none are unique within 10 attempts.
This should cause shorter and more easily readable deduplicated
titles in most cases.
In addition, groups are no longer deduplicated. Instead we set them
to a combination of truncated dashboard name and humanized alert
frequency. This way, alerts from the same dashboard share a group
if they have the same evaluation interval. In the event that truncation
causes overlap, it won't be a big issue as all alerts will still be in a
group with the correct evaluation interval.
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager
Mimir client that understands the new APIs developed for mimir. Very much a WIP still.
* more wip
* appease the linter
* more linting
* add more code
* get state from kvstore, encode, send
* send state to the remote Alertmanager, extract fullstate logic into its own function
* pass kvstore to remote.NewAlertmanager()
* refactor
* add fake kvstore to tests
* tests
* use FileStore to get state
* always log 'completed state upload'
* refactor compareRemoteConfig
* base64-encode the state in the file store
* export silences and nflog filenames, refactor
* log 'completed state/config upload...' regardless of outcome
* add values to the state store in tests
* address code review comments
* log error from filestore
---------
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* ExtSvcAuth: Assign roles locally
* Fix test
* HandlePluginStateChanged in the OrgID
* Remove Global from command
* Use AssignmentOrgID instead of OrgID
* Remove unecessary test case
* AuthN: Check API Key is not trying to access another organization
* Revert local change
* Add test
* Discussed with Kalle we should set r.OrgID
* Syntax sugar
* Suggestion org-mismatch
* Alerting: Apply query optimization to eval endpoints
Previously, query optimization was applied to alert queries when scheduled but
not when ran through `api/v1/eval` or `/api/v1/rule/test/grafana`. This could
lead to discrepancies between preview and scheduled alert results.
* Alerting: Add GetFullState method to FileStore
* make tests compile, create stateStore in NewAlertmanager
* return errors instead of logging, accept an arbitrary number of strings
* make NewAlertmanager() accept a stateStore
* Alerting: In migration, fallback to '1s' for malformed min interval
During legacy migration, when we encounter an alert datasource query
with a min interval (interval field in the query model) that is not
parseable, instead of failing the migration we fallback to a min interval
of 1s and continue.
The reason for this is a bug in legacy alerting (existing for a few major
versions) which allows arbitrary dashboard variables to be used as the
min interval, even though those variables do not work and will cause
the legacy alert to fail with `interval calculation failed: time: invalid
duration`.
* regression analysis first dragt
* Swap to better regression libraries
* fix name
* Interpolate x points instead of using source x points
* clean up ui and add feature toggle
* fix merge error
* change to loop for finding min max, rename resolution
* Add docs
* add docs and tests
* change name to regression analysis
* update docs
* Fix editor labels
* add regression images
* fix docs
* Remote Alertmanager(refactor): Only parse the URL once
Exactly what it says in the tin.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* use the existing tests
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Alerting: Introduce a Mimir client as part of the Remote Alertmanager
This is our first attempt at making Grafana communicate use Mimir as a backend - it uses a new set of APIs that we've developed on the Mimir side to upload the grafana configuration and alertmanager state so that it can then be ported over.
Codewise, we've introduced a couple of things:
A client to isolate in its own package all the communication that happens with Mimir
A few changes to the remote/alertmanager to include uploading the configuration and state when it starts
A few refactors that align a bit better with the design approach that we're thinking
An integration tests again these newly developed APIs using a custom image
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
* Embed flame graph
* Update test
* Update test
* Use toggle
* Update test
* Add tests
* Use const
* Cleanup
* Update profile tag
* Move flame graph out of tags, remove request and other cleanup + tests
* Update test
* Set flame graph by profile id and simplify logic
* Cleanup and redrawListView
* Create/use feature toggle
* remove all the things
* fix OldFolderPicker tests
* i18n
* remove more unused code
* remove mutation of error object since it's now frozen in the redux state
* fix error handling
* remove use of SignedInUserCopies
* add extra safety to not cross assign permissions
unwind circular dependency
dashboardacl->dashboardaccess
fix missing import
* correctly set teams for permissions
* fix missing inits
* nit: check err
* exit early for api keys
* move the name finding logic for new data sources from frontend to backend
* cleanup and fix test
* linting
* change the way the number after the ds type is incremented - keep incrementing it without adding more hyphens
* enterprise spec updates (unrelated to the PR)
* Lock when creating external service
* Add local lock back
* Improve function signature
* Define lockName separately to make it more explicit
* Update pkg/infra/serverlock/serverlock.go
Co-authored-by: Gabriel MABILLE <gamab@users.noreply.github.com>
* Update pkg/infra/serverlock/serverlock.go
Co-authored-by: Gabriel MABILLE <gamab@users.noreply.github.com>
---------
Co-authored-by: Gabriel MABILLE <gamab@users.noreply.github.com>
* Move test to the db so we test the queries and not just testing the mock
* Remove unused function and dependencies
* Remove unused functions from the database
* Add some integration tests
* prepare backend for structured metadata
* add `lokiStructuredMetadata` feature toggle
* use `lokiStructuredMetadata` feature flag
* add field type check to `labelTypesField`
* remove fixme
* fix feature toggle
* add field in dataplane mode
* use `data.Labels` where possible
* adjust framing tests
* improve verbiage
* improve naming
* update tests to "attributes"
* change where folder checks are done for dash creation/updates
* add test for folder not being found
* test fixes
* more test fixes
* add nlint directive to where folder IDs are used
* fix bad merge
* fix test
* Plugin: Remove external service on plugin removal
* Early exit no service account
* Add log
* WIP
* Cable OAuth2Server client removal
* Move function lower
* Add function to test removal
* Add test to RemoveExternalService
* Test RemoveExtSvcAccount
* remove apostrophy in comment
* Add cfg to plugin installer to check features
* Add feature flag check in the service registration service
* Comments
* Move metrics Inc
* Initialize map
* Reorder
* Initialize mutex as well
* Add HasExternalService as suggested
* WIP: CleanUpOrphanedExternalServices
* Commit suggestion
Co-authored-by: linoman <2051016+linoman@users.noreply.github.com>
* Nit on test.
Co-authored-by: linoman <2051016+linoman@users.noreply.github.com>
* oauthserver return names
* Name is not Slug
* Use plugin ID not slug
* Add background job
* remove negation on feature check
* Add test to the CleanUp function
* Test GetExternalServiceNames
* rename test
* Add test for ExtSvcAccountsService_GetExternalServiceNames
* Add a todo
* Add todo
* Option based on mix
* Rewrite a bit the comment
* Opinionated choice use slugs instead of names everywhere
* Nit.
* Comments and re-ordering
* Comment
* Add log
* Add context
---------
Co-authored-by: linoman <2051016+linoman@users.noreply.github.com>
* LogRow: detect text selection
* LogRow: refactor menu as component
* LogRow: add actions to menu
* LogRow: hack menu position
* Remove unsused imports
* LogRowMessage: remove popover code
* PopoverMenu: refactor
* LogRows: implement PopoverMenu at log rows level
* PopoverMenu: implement copy
* PopoverMenu: receive row model
* PopoverMenu: fix onClick capture issue
* Explore: add new filter methods and props for line filters
* PopoverMenu: use new filter props
* Explore: separate toggleable and non toggleable filters
* PopoverMenu: improve copy
* ModifyQuery: extend line filter with value argument
* PopoverMenu: close with escape
* Remove unused import
* Prettier
* PopoverMenu: remove label filter options
* LogRow: rename text selection handling prop
* Update test
* Remove unused import
* Popover menu: add unit test
* LogRows: update unit test
* Log row: hide the log row menu if the user is selecting text
* Log row: dont hide row menu if popover is not in scope
* Log rows: rename state variable
* Popover menu: allow menu to scroll
* Log rows: fix classname prop
* Log rows: close popover if mouse event comes from outside the log rows
* Declare new class using object style
* Fix style declaration
* Logs Popover Menu: add string filtering functions (#76757)
* Loki modifyQuery: add line does not contain query modification
* Elastic modifyQuery: implement line filters
* Modify query: change action name to not be loki specific
* Prettier
* Prettier
* Elastic: escape filter values
* Popover menu: create feature flag
* Log Rows: integrate logsRowsPopoverMenu flag
* Rename feature flag
* Popover menu: track interactions
* Prettier
* logRowsPopoverMenu: update stage
* Popover menu: add ds type to tracking data
* Log rows: move feature flag check
* Improve handle deselection
* Swagger: Fix listTokensResponse
It should return a list of Tokens, not a single one
Also regenerated the API spec from the latest changes + this branch
* Remove pointer
* [WIP] Lift RBAC from xorm store
* Cleanup RBAC, fix tests
* Use the scope type map as a map
* Remove dependency on dashboard service
* Make dashboards a map for constant time lookups (useful later)
---
* Lift RBAC tests into a new file to test at service level
* Add necessary access resource structs to xorm store tests
* Move authorization into separate service
* Pass features to searchstore.Builder
* Sort imports
* Code cleanup
* Remove useless scope type check
* Lift permission check into `Authorize()`
* Use clearer language when checking scope types
* Include dashboard permissions in test to ensure they're ignored
* Switch to errutil
* Cleanup sql.Cfg refs
* Add `isManaged` property to frontend model
* Remove enabled and token buttons for managed SA
* Replace trash icon for lock icon for managed SA
* Block the role picker for managed SA
* Filter SA list usiong the managed filter
* Rename external for managed
* Add only managed filter
* Toggle the enable buttons for managed sa
* Disable add token and delete token buttons
* Remove the edit name button
* Disable the Role picker for managed sa
* Hide the permissions section
* Add managed by row
---------
Co-authored-by: Gabriel MABILLE <gamab@users.noreply.github.com>
Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>
* Alerting: Add an empty Forked Alertmanager
* Alerting: Add methods for silences to the forked Alertmanager
* check for errors in tests
* make linter happy
* Alerting: Add methods for alerts to the forked Alertmanager
* Alerting: Add methods for receivers to the forked Alertmanager
* Alerting: Add TestTemplate method to the forked Alertmanager
* make linter happy
* separate into both forked AMs
* fix tests
* Alerting: Add lifecycle methods to the forked Alertmanager
* Plugin: Remove external service on plugin removal
* Add feature flag check in the service registration service
* Initialize map
* Add HasExternalService as suggested
* Commit suggestion
Co-authored-by: linoman <2051016+linoman@users.noreply.github.com>
* Nit on test.
Co-authored-by: linoman <2051016+linoman@users.noreply.github.com>
---------
Co-authored-by: linoman <2051016+linoman@users.noreply.github.com>
Check embedded errors in query data response for plugin metrics/logs status label.
Plugin Request Completed log messages are now logged with info level if status=ok,
otherwise error level.
Fixes#76769
* Pass OTEL sampling config to plugins
* fix capital letters
* Do not pass sampler env vars if sampling is not configured
* Add tests
* PR review feedback
* Simplify tracing env vars logic
* Update test to reflect pkg/infra/tracing behaviour