Commit Graph

1485 Commits

Author SHA1 Message Date
Yuri Tseretyan c3b5cabb14 Alerting: Refactor scheduler's rule evaluator to store rule key (#89925) 2024-07-01 16:43:23 -04:00
Yuri Tseretyan 655e477c20 Alerting: Fix flaky test in scheduler's tests (#89923) 2024-07-01 13:31:03 -04:00
Santiago fce03cd724 Alerting: Send static headers to the remote Alertmanager (#89846) 2024-07-01 17:48:40 +02:00
Yuri Tseretyan 559738ce6a Alerting: Fix flaky test in historian (#89913) 2024-07-01 16:59:06 +03:00
Gabriel MABILLE 71d31397e5 Fix flaky tests (#89910) 2024-07-01 14:39:51 +02:00
Alexander Akhmetov 68691c9386 Alerting: Add setting for maximum allowed rule evaluation results (#89468)
* Alerting: Add setting for maximum allowed rule evaluation results

Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
2024-06-27 09:45:15 +02:00
Yuri Tseretyan 06d5850396 Alerting: Update alerting state history API to authorize access using RBAC (#89579)
* add method CanReadAllRules to rule authorization service

* add alias type Namespace for Folder in ngalert's models package. It implements the Namespacer interface that is used by authz logic

* update state history's backends to authorize access to rules.
* update Loki to add folders UIDs to query. 
    * Update BuildLogQuery to drop filter by folders if it's too long and fall back to in-memory filtering.
2024-06-26 10:25:37 -04:00
Santiago fe1309dd96 Alerting: Send external URL to the remote Alertmanager (#89701)
* Alerting: Send external URL to the remote Alertmanager

* test that the URL is sent to the remote Alertmanager

* AppURL -> ExternalURL
2024-06-26 14:02:02 +02:00
Nihal b1ce7e8a83 Alerting: Reduce cyclomatic complexity of PrepareRuleGroupStatuses (#89649)
* reduce cyclomatic complexity of PrepareRuleGroupStatuses. see https://github.com/grafana/grafana/issues/88670

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>

* reduce cyclomatic complexity for PrepareRuleGroupStatuses in pkg/services/ngalert/api/api_prometheus.go

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>

* reduce cyclomatic complexity for PrepareRuleGroupStatuses in pkg/services/ngalert/api/api_prometheus.go

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>

---------

Signed-off-by: Syed Nihal <syed.nihal@nokia.com>
2024-06-26 10:21:37 +02:00
Matthew Jacobson 83d05ea777 Alerting: Fix broken state tests (#89712) 2024-06-25 14:19:46 -04:00
Matthew Jacobson 47c9259d75 Alerting: Ensure we update State.LastSentAt before persisting (#89427) 2024-06-25 13:01:26 -04:00
William Wernert fcfa89f864 Alerting: Implement Prometheus remote write for recording rules (#89189)
* Fix timestamp recorded by rule

* Implement prometheus remote write

* Create http client instead of transport

* Address PR comments

* Remove status code label
2024-06-25 17:23:42 +03:00
Yuri Tseretyan 4a5aab54a5 Alerting: Add max limit for Loki query size in state history API (#89646)
* add setting for query limit

* update BuildLogQuery to return error if limit is exceeded

* move tests for BuildLogQuery to separate suite
2024-06-25 09:20:38 -04:00
Alexander Akhmetov 2035814059 Alerting: fix updating error in the alert rule state during error to error transitions and restarts (#89557)
Alerting: fix preserving errors in the alert rule state during error to error transitions

Alert state transition from one error to another did not update state.Error correctly.
The error in state.Error remained as the initial error encountered.
This led to another issue, where after a Grafana restart, the error was lost because
the state of the alert rule did not change, but the Error is not preserved in the database
between restarts.

This could happen if the expression service returned an error or the alert routine panicked
during querying.
2024-06-25 09:42:00 +02:00
Yuri Tseretyan b075926202 Alerting: Time Intervals API (#88201)
* expose ngalert API to public
* add delete action to time-intervals
* introduce time-interval model generated by app-platform-sdk from CUE model the fields of the model are chosen to be compatible with the current model
* implement api server
* add feature flag alertingApiServer
---- Test Infra
* update helper to support creating custom users with enterprise permissions
* add generator for Interval model
2024-06-20 16:52:03 -04:00
Matthew Jacobson 3228b64fe6 Alerting: Resend resolved notifications for ResolvedRetention duration (#88938)
* Simple replace of State.Resolved with State.ResolvedAt

* Retain ResolvedAt time between Normal->Normal transition

* Introduce ResolvedRetention to keep sending recently resolved alerts

* Make ResolvedRetention configurable with resolved_alert_retention

* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention

* Do not reset ResolvedAt during Normal->Pending transition

Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.

This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.

* Pointers for ResolvedAt & LastSentAt

To avoid awkward time.Time{}.Unix() defaults on persist
2024-06-20 16:33:03 -04:00
Yuri Tseretyan 92f10b73a8 Alerting: Move interface Namespaced from accesscontrol to models package (#89439)
move Namespaced interface from accesscontrol to models
2024-06-19 16:18:33 -04:00
Steve Simpson 8eabef1f91 Alerting: Update remote alertmanager to use extended receivers API. (#89253)
* Alerting: Update remote alertmanager to use extended receivers API.

* Update integration test and Mimir image

* Update Mimir image in more places
2024-06-19 12:40:22 +03:00
Steve Simpson 43a246f431 Alerting: Improve performance of /api/prometheus for large numbers of alerts. (#89268)
* Alerting: Optimize sorting alert instances.

* Also change other Labels fields for consistency
2024-06-17 12:25:47 +02:00
Alexander Weaver 8491e02caf Alerting: Instrument outbound requests for Loki Historian and Remote Alertmanager with tracing (#89185)
* Add TracedClient

* Handle errors and status codes

* Wire up tracing to normal ASH and loki annotation mapping

* Add tracing to remote alertmanager

* one more spot

* and not or

* More consistency with other grafana traces, lower cardinality name
2024-06-14 13:24:12 -05:00
Dave Henderson 6262c56132 chore(perf): Pre-allocate where possible (enable prealloc linter) (#88952)
* chore(perf): Pre-allocate where possible (enable prealloc linter)

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* fix TestAlertManagers_buildRedactedAMs

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* prealloc a slice that appeared after rebase

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

---------

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
2024-06-14 14:16:36 -04:00
Steve Simpson dd3c3b5857 Alerting: Update grafana/alerting. (#88914) 2024-06-14 09:19:04 +02:00
Ryan McKinley 99d8025829 Chore: Move identity and errutil to apimachinery module (#89116) 2024-06-13 07:11:35 +03:00
William Wernert c62cc25513 Alerting: Configure recording rule writer from config.ini (#89056) 2024-06-12 16:04:46 -04:00
Santiago b7120c5a30 Alerting: Fix missing argument in call to createRemoteAlertmanager() (#89101) 2024-06-12 11:57:22 +03:00
Santiago 12d5251c12 Alerting: Alertmanager configuration sync loop (#88822)
* make the config sync happen on each call to ApplyConfig(), fix tests

* send autogen config

* add fake autogen function for tests

* update stale comments, tidy things up, make linter happy

* add auto-gen routes only if the feature toggle is enabled

* remove unnecessary fake autogen function

* throttle configuration syncs

* restore pkg/services/store/entity/sqlstash/sql_storage_server.go

* test sync loop in ApplyConfig, skip invalid autogen routes

* restore conf/defaults.ini

* restore conf/defaults.ini

* avoid skipping invalid auto-gen routes in SaveAndApplyConfig

* test that autogenFn is called and its errors are returned

* add debug message about the sync interval not having elapsed

* collapse two log lines into one
2024-06-12 10:13:34 +02:00
Jacob Valdemar eb76ea47a0 Alerting: Add ha_reconnect_timeout configuration option (#88823)
* Docs: Update "Configure high availability" guide with ha_reconnect_timeout configuration

---------

Co-authored-by: Christopher Moyer <35463610+chri2547@users.noreply.github.com>
2024-06-11 13:25:48 -04:00
Alexander Akhmetov 667fea6623 Alerting: use hash of labels instead of labels string as the alert state cache key (#88956)
* Alerting: use hash instead of labels as the cache key
* Use data.Labels.Fingerprint to calculate the cache key
2024-06-11 18:34:58 +02:00
Alexander Weaver d004f8a98d Alerting: Recording rules understands errors embedded in dataframes (#88946)
* Make MakeDependencyError public for tests in another package

* Create tests for errors in eval results

* Extract logic to pull frame errors out into exported function

* Maybe we can drop cyclomatic complexity lint suppression now?

* extract frame errors and fail recording rules if frames contain error

* Fix up retry logic to actually work

* Do not retry non retryable errors
2024-06-11 10:37:10 -05:00
Steve Simpson d440d86bbb Alerting: Fix erroneous use of grafana-cli/logger. (#89037)
Can't see how this was intentional, likely just a typo.
2024-06-11 14:29:56 +02:00
Santiago cdbc9d801f Alerting: Use the internal Alertmanager to test templates and receivers (remote primary) (#88988) 2024-06-11 11:06:07 +02:00
Yuri Tseretyan d4b0ac5973 Alerting: Fix rule storage to filter by group names using case-sensitive comparison (#88992)
* add test for the bug
* remove unused struct
* update db store to post process filters by group using go-lang's case-sensitive string comparison

--------

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2024-06-10 19:05:47 -04:00
Santiago 5f4d07bb75 Alerting: Enable remote primary mode using feature toggles (#88976) 2024-06-10 17:07:13 +02:00
Santiago e15e40fbd3 Alerting: Skip setting up clustering in remote primary/only modes (#88968)
* Alerting: Skip setting up clustering in remote primary mode

* Update pkg/services/ngalert/notifier/multiorg_alertmanager.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

---------

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-06-10 13:51:11 +02:00
William Wernert 63e9969c1b Alerting: Recording rule mapping logic for data frames to Prometheus metrics (#88550)
* Add stub Prometheus writer with mapping logic

* Add tests
2024-06-07 20:00:22 +03:00
Alexander Weaver 58fdb24b0b Alerting: Recording rules appear as type=recording in Prometheus API + better abstraction for type (#88805)
* Wire status through to prom API

* Regenerate swagger
2024-06-07 11:24:06 -05:00
Yuri Tseretyan 32ea1801aa Alerting: Support AWS SNS integration in Grafana (#88867) 2024-06-07 11:49:49 -04:00
Alexander Weaver f1dc63565e Alerting: Fix go-swagger extraction and several embedded types from Alertmanager in Swagger docs (#88879)
Drop redundant swagger model comments
2024-06-07 10:47:47 -05:00
Yuri Tseretyan 003e3efce9 Alerting: Update mute timings provisioning API to support optimistic locking (#88731)
* add version to time-interval models
* set time interval fingerprint as version
* update to check provided version
* delete to check if version is provided in query parameter 'version'
* update integration tests
* update specs
2024-06-06 18:06:37 -04:00
Alexander Weaver a2e21d61f8 Alerting: Remove dead evalRunning guard in rule routine (#88312)
Remove dead guard
2024-06-06 16:15:01 -05:00
William Wernert d359591dac Alerting: Support recording rule struct in provisioning API (#87849)
* Support record struct in provisioning API

* Update api spec

* Use record field

* Restrict API endpoints following toggle

* Fix swagger spec

* Add recording rule validation to store validator
2024-06-06 21:05:02 +03:00
Alexander Weaver 820ee6e9db Alerting: Make all in api generator tooling now actually makes all (#88793)
* Make all now actually makes all

* Clean depends on clean-go
2024-06-05 11:52:31 -05:00
Fayzal Ghantiwala 80f54778f3 Alerting: Add option to use Redis in cluster mode for Alerting HA (#88696)
* Add config option to use Redis in cluster mode

* Use UniversalOptions
2024-06-05 17:02:25 +01:00
Dave Henderson df784917e4 Alerting: Improve performance of tupleLablesToLabels function (#88736)
* Alerting: Improve performance of tupleLablesToLabels function

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>

* use %s for string rather than %v

Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>

---------

Signed-off-by: Dave Henderson <dave.henderson@grafana.com>
Co-authored-by: Sofia Papagiannaki <1632407+papagian@users.noreply.github.com>
2024-06-05 16:19:09 +03:00
Santiago 9f9928d41a Alerting: Update grafana/alerting (#88363)
* Alerting: Update grafana/alerting

* make tests pass by implementing yaml unmarshallers and deleting fields with omitempty in their yaml tags

* go mod tidy

* fix tests by implementing not calling GettableApiAlertingConfig.UnmarshalYAML from GettableApiAlertingConfig.UnmarshalJSON

* cleanup, reduce diff

* fix more tests

* update grafana/alerting to latest commit, delete global section from configs in tests

* bring back YAML unmarshaller for GettableApiAlertingConfig

* update alerting package dependency to point to main

* skip test for sns notifier
2024-06-04 20:29:37 +02:00
Yuri Tseretyan a63ef42816 Alerting: Mute Timing service to prevent changing provenance status to none (#88462)
* use relaxed validation to not introduce breaking changes for now but to be able to use the service
in non-provisioning APIs.
2024-06-04 08:54:33 -04:00
Fayzal Ghantiwala b66cd7ef79 Alerting: Add filters for RouteGetRuleStatuses (#88295)
* Placeholder commit with rule_uid change

* Add new filters to grafana rule state API

* Revert type change

* Split rule_group and rule_name params

* remove debug line

* Change how query params are parsed

* Comment
2024-06-04 10:57:55 +01:00
Matthew Jacobson 31d5dd0a12 Alerting: Prevent updating rule uid matcher for silences (#88519)
Prevents updating the `__alert_rule_uid__` equality matcher (used for rule-specific silences) on existing silences
2024-06-03 17:39:06 -04:00
Fayzal Ghantiwala 67b9e3b269 Alerting: Update HA Redis TLS docs (#88538)
* Update HA Redis TLS doc

* Add test for regular TLS

* Update docs

* Update prom registry
2024-05-31 13:23:45 +01:00
Sofia Papagiannaki 17ca61d7f8 Alerting: Export and provisioning rules into subfolders (#77450)
* Folders: Optionally include fullpath in service responses
* Alerting: Export folder fullpath instead of title
* Escape separator in folder title
* Add support for provisiong alret rules into subfolders
* Use FolderService for creating folders during provisioning
* Export WithFullpath() folder service function

---------

Co-authored-by: Tania B <yalyna.ts@gmail.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-05-31 11:09:20 +03:00