Commit Graph

36 Commits

Author SHA1 Message Date
Alexander Akhmetov 100528e274 Alerting: Support retry with backoff in alert rule evaluation (#99710) 2025-09-04 13:56:03 +02:00
Fayzal Ghantiwala 3a054d5e00 Alerting: Add FiredAt field to State (#104046)
* Add FiredAt field to the State

* Update featuretoggle files

* Fix lint errors

* Fix test compilation

* Remove random print line + formatting

* Address PR comments
2025-04-22 12:16:38 +01:00
Yuri Tseretyan dc0083d879 Alerting: Sequential evaluation of rules in group (#98829)
* introduce RulesGroupComparer

* extract runJob method

* implement sequential evaluation

* Make sequence building testable & add comments

* Also run callback in recording rules + add tests

* Improve tests

* Address PR comments

---------

Co-authored-by: William Wernert <william.wernert@grafana.com>
2025-04-02 23:10:32 +03:00
Yuri Tseretyan 32fde6dba4 Alerting: Update scheduler to provide full specification to rule update channel (#101375)
update scheduler's aler rule to accept regular Evaluation in update channel

This makes it accept the full rule definition, which is required in reset state.
2025-02-26 14:39:39 -05:00
Alexander Akhmetov a0bf9202f5 Alerting: Clear the state cache when the alert routine stops (#99681) 2025-01-28 21:15:19 +02:00
Alexander Akhmetov a28328d764 Alerting: Call the deletion reason provider even if the rule is no longer scheduled (#99571)
Alerting: Call the deletion reason provider even if the rule is not scheduled anymore
2025-01-28 11:34:26 +01:00
Alexander Akhmetov 12bda63871 Alerting: Optional function to find the rule deletion reason (#99422) 2025-01-27 11:35:52 +01:00
Santiago ea6cb8f139 Alerting: Panic when rule being evaluated has unexpected key (#99002) 2025-01-15 14:59:50 +02:00
Santiago 86e8147df3 Alerting: Use AlertRuleKey for comparison before rule evaluation (#98808)
(WIP) Alerting: Use AlertRuleKey for comparison before rule evaluation
2025-01-10 15:31:03 +01:00
Yuri Tseretyan f851379f7d Alerting: Add traceID to rule evalutor logger (#98549) 2025-01-06 15:00:25 -05:00
Alexander Akhmetov 0a4e6ff86b Alerting: Add SaveAlertInstancesForRule instance store method (#94505)
Alerting: Add SaveAlertInstancesForRule method to the InstanceStore interface
2024-10-11 13:47:44 +02:00
Alexander Weaver 393faa8732 Alerting: Move rule evaluation status logic out of prometheus API and into scheduler (#89141)
* Add health fields to rules and an aggregator method to the scheduler

* Move health, last error, and last eval time in together to minimize state processing

* Wire up a readonly scheduler to prom api

* Extract to exported function

* Use health in api_prometheus and fix up tests

* Rename health struct to status

* Fix tests one more time

* Several new tests

* Handle inactive rules

* Push state mapping into state manager

* rename to StatusReader

* Rectify cyclo complexity rebase

* Convert existing package local status implementation to models one

* fix tests

* undo RuleDefs rename
2024-09-30 16:52:49 -05:00
Alexander Akhmetov 152d3540db Alerting: Log number of dimensions instead of all evaluation results (#92733) 2024-08-30 12:35:02 +02:00
Alexander Weaver 490d6ba2fd Alerting: Extend scheduler user with datasources:read (#92410)
Add permission
2024-08-26 10:59:54 -05:00
Alexander Weaver ac5ebe6e4d Alerting: Add enablement flag for recording rules (#92032)
* Add enablement flag

* Disable if toggle not enabled
2024-08-19 12:01:00 -05:00
Alexander Weaver 34ab5fe1f3 Alerting: Restart rule routines if the type changes (#90867)
* Restart when types change

* Wire up test hooks correctly

* testing
2024-08-14 14:57:47 -05:00
Yuri Tseretyan ee78bb653f Alerting: Log rule evaluation error in scheduler (#91585) 2024-08-06 19:27:02 +03:00
Yuri Tseretyan 8323b688c6 Alerting: Improve logging in scheduler and states (#91003)
* handle metadata map nil

* remove double context

* clean up logging in scheduler

* do not reuse loggers from previous ticks

* log the dropped tick

* log tick instead of ticknum

* replace with processing tick logs

* log sending notifications

* update logging in persister to fetch context

* logs to historian

moved them upstream to be able to log when store is overridden
2024-07-29 16:01:48 -04:00
Yuri Tseretyan c3b9c9b239 Alerting: Send information about alert rule to data source in headers (#90344)
* add support of metadata to condition and adding it to request headers
* support for additional metadata when condition is built
* add additionall context to conditions: source and folder title
* add version
* use percent-encoding for header values
2024-07-17 22:55:12 +03:00
Yuri Tseretyan c3b5cabb14 Alerting: Refactor scheduler's rule evaluator to store rule key (#89925) 2024-07-01 16:43:23 -04:00
Yuri Tseretyan 655e477c20 Alerting: Fix flaky test in scheduler's tests (#89923) 2024-07-01 13:31:03 -04:00
Matthew Jacobson 47c9259d75 Alerting: Ensure we update State.LastSentAt before persisting (#89427) 2024-06-25 13:01:26 -04:00
Matthew Jacobson 3228b64fe6 Alerting: Resend resolved notifications for ResolvedRetention duration (#88938)
* Simple replace of State.Resolved with State.ResolvedAt

* Retain ResolvedAt time between Normal->Normal transition

* Introduce ResolvedRetention to keep sending recently resolved alerts

* Make ResolvedRetention configurable with resolved_alert_retention

* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention

* Do not reset ResolvedAt during Normal->Pending transition

Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.

This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.

* Pointers for ResolvedAt & LastSentAt

To avoid awkward time.Time{}.Unix() defaults on persist
2024-06-20 16:33:03 -04:00
William Wernert c62cc25513 Alerting: Configure recording rule writer from config.ini (#89056) 2024-06-12 16:04:46 -04:00
Alexander Weaver 58fdb24b0b Alerting: Recording rules appear as type=recording in Prometheus API + better abstraction for type (#88805)
* Wire status through to prom API

* Regenerate swagger
2024-06-07 11:24:06 -05:00
Alexander Weaver a2e21d61f8 Alerting: Remove dead evalRunning guard in rule routine (#88312)
Remove dead guard
2024-06-06 16:15:01 -05:00
William Wernert 5de7d4d06d Alerting: Create writer interface for recording rules (#88459)
* Create writer interface for recording rules

Also create fake impl + use it for stub in scheduler
2024-05-29 22:38:33 +03:00
Alexander Weaver b926b6336d Alerting: Scheduled recording rules execute their queries (#88309)
* Basic eval flow

* Wiring-up

* fix

* Extend todo

* Start with tests

* Include some relevant tests, skip ones that seem to have timing-based race conditions

* Some tests, touch up linter and todo

* Solve TODO

* Add tracing

* Tests to make sure an eval went through

* Wire up feature toggles

* Update pkg/services/ngalert/schedule/recording_rule.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

---------

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-05-28 10:59:21 -05:00
Alexander Weaver 89b54d06e9 Alerting: Schedule a shim implementation for recording rules (#87939)
* Add shim rule implementation for recording rules

* Give ruleFactory access to the original rule definition

* Schedule shim implementation if the rule is a recording rule

* Fix or suppress linter

* Fix nolint
2024-05-21 16:42:58 -05:00
Steve Simpson ad7f804255 Alerting: Fix evaluation metrics to not count retries (#85873)
* Change evaluation metrics to only count once per eval, and add new metrics.

* Cosmetic: Move eval total Inc() to orginal place.
2024-04-12 16:20:46 +02:00
Alexander Weaver 6c5e94095d Alerting: Scheduler and registry handle rules by an interface (#84044)
* export Evaluation

* Export Evaluation

* Export RuleVersionAndPauseStatus

* export Eval, create interface

* Export update and add to interface

* Export Stop and Run and add to interface

* Registry and scheduler use rule by interface and not concrete type

* Update factory to use interface, update tests to work over public API rather than writing to channels directly

* Rename map in registry

* Rename getOrCreateInfo to not reference a specific implementation

* Genericize alertRuleInfoRegistry into ruleRegistry

* Rename alertRuleInfo to alertRule

* Comments on interface

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>

---------

Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
2024-03-11 22:57:38 +02:00
Alexander Weaver 201f5d3ac9 Alerting: Extract large closures in ruleRoutine (#84035)
* extract notify

* extract resetState

* move evaluate metrics inside evaluate

* split out evaluate
2024-03-06 16:39:23 -06:00
Alexander Weaver d5fda06147 Alerting: Decouple rule routine from scheduler (#84018)
* create rule factory for more complicated dep injection into rules

* Rules get direct access to metrics, logs, traces utilities, use factory in tests

* Use clock internal to rule

* Use sender, statemanager, evalfactory directly

* evalApplied and stopApplied

* use schedulableAlertRules behind interface

* loaded metrics reader

* 3 relevant config options

* Drop unused scheduler parameter

* Rename ruleRoutine to run

* Update READMED

* Handle long parameter lists

* remove dead branch
2024-03-06 13:44:53 -06:00
Alexander Weaver 1bb38e8f95 Alerting: Move ruleRoutine to be a method on ruleInfo (#83866)
* Move ruleRoutine to ruleInfo file

* Move tests as well

* swap ruleInfo and scheduler parameters on ruleRoutine

* Fix linter complaint, receiver name
2024-03-04 17:15:55 -06:00
Alexander Weaver f2a9d0a89d Alerting: Refactor ruleRoutine to take an entire ruleInfo instance (#83858)
* Make stop a real method

* ruleRoutine takes a ruleInfo reference directly rather than pieces of it

* Fix whitespace
2024-03-04 15:15:01 -06:00
Alexander Weaver fa51724bc6 Alerting: Move alertRuleInfo and tests to new files (#83854)
Move ruleinfo and tests to new files
2024-03-04 11:24:49 -06:00