* Alerting: Add setting for maximum allowed rule evaluation results
Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
* add method CanReadAllRules to rule authorization service
* add alias type Namespace for Folder in ngalert's models package. It implements the Namespacer interface that is used by authz logic
* update state history's backends to authorize access to rules.
* update Loki to add folders UIDs to query.
* Update BuildLogQuery to drop filter by folders if it's too long and fall back to in-memory filtering.
Alerting: fix preserving errors in the alert rule state during error to error transitions
Alert state transition from one error to another did not update state.Error correctly.
The error in state.Error remained as the initial error encountered.
This led to another issue, where after a Grafana restart, the error was lost because
the state of the alert rule did not change, but the Error is not preserved in the database
between restarts.
This could happen if the expression service returned an error or the alert routine panicked
during querying.
* expose ngalert API to public
* add delete action to time-intervals
* introduce time-interval model generated by app-platform-sdk from CUE model the fields of the model are chosen to be compatible with the current model
* implement api server
* add feature flag alertingApiServer
---- Test Infra
* update helper to support creating custom users with enterprise permissions
* add generator for Interval model
* Simple replace of State.Resolved with State.ResolvedAt
* Retain ResolvedAt time between Normal->Normal transition
* Introduce ResolvedRetention to keep sending recently resolved alerts
* Make ResolvedRetention configurable with resolved_alert_retention
* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention
* Do not reset ResolvedAt during Normal->Pending transition
Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.
This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.
* Pointers for ResolvedAt & LastSentAt
To avoid awkward time.Time{}.Unix() defaults on persist
* Add TracedClient
* Handle errors and status codes
* Wire up tracing to normal ASH and loki annotation mapping
* Add tracing to remote alertmanager
* one more spot
* and not or
* More consistency with other grafana traces, lower cardinality name
* make the config sync happen on each call to ApplyConfig(), fix tests
* send autogen config
* add fake autogen function for tests
* update stale comments, tidy things up, make linter happy
* add auto-gen routes only if the feature toggle is enabled
* remove unnecessary fake autogen function
* throttle configuration syncs
* restore pkg/services/store/entity/sqlstash/sql_storage_server.go
* test sync loop in ApplyConfig, skip invalid autogen routes
* restore conf/defaults.ini
* restore conf/defaults.ini
* avoid skipping invalid auto-gen routes in SaveAndApplyConfig
* test that autogenFn is called and its errors are returned
* add debug message about the sync interval not having elapsed
* collapse two log lines into one
* Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration
---------
Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
* Make MakeDependencyError public for tests in another package
* Create tests for errors in eval results
* Extract logic to pull frame errors out into exported function
* Maybe we can drop cyclomatic complexity lint suppression now?
* extract frame errors and fail recording rules if frames contain error
* Fix up retry logic to actually work
* Do not retry non retryable errors
* add test for the bug
* remove unused struct
* update db store to post process filters by group using go-lang's case-sensitive string comparison
--------
Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
* add version to time-interval models
* set time interval fingerprint as version
* update to check provided version
* delete to check if version is provided in query parameter 'version'
* update integration tests
* update specs
* Support record struct in provisioning API
* Update api spec
* Use record field
* Restrict API endpoints following toggle
* Fix swagger spec
* Add recording rule validation to store validator
* Alerting: Update grafana/alerting
* make tests pass by implementing yaml unmarshallers and deleting fields with omitempty in their yaml tags
* go mod tidy
* fix tests by implementing not calling GettableApiAlertingConfig.UnmarshalYAML from GettableApiAlertingConfig.UnmarshalJSON
* cleanup, reduce diff
* fix more tests
* update grafana/alerting to latest commit, delete global section from configs in tests
* bring back YAML unmarshaller for GettableApiAlertingConfig
* update alerting package dependency to point to main
* skip test for sns notifier
* Placeholder commit with rule_uid change
* Add new filters to grafana rule state API
* Revert type change
* Split rule_group and rule_name params
* remove debug line
* Change how query params are parsed
* Comment
* Folders: Optionally include fullpath in service responses
* Alerting: Export folder fullpath instead of title
* Escape separator in folder title
* Add support for provisiong alret rules into subfolders
* Use FolderService for creating folders during provisioning
* Export WithFullpath() folder service function
---------
Co-authored-by: Tania B <yalyna.ts@gmail.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>