* Add FiredAt field to the State
* Update featuretoggle files
* Fix lint errors
* Fix test compilation
* Remove random print line + formatting
* Address PR comments
What is this feature?
This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation).
keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again:
Before
+----------+ +----------+
| Alerting |---->| Normal |
+----------+ +----------+
-----
After
+----------+ +------------+ +----------+
| Alerting |----->| Recovering |---->| Normal |
+----------+ +------------+ +----------+
Why do we need this feature?
This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert
update scheduler's aler rule to accept regular Evaluation in update channel
This makes it accept the full rule definition, which is required in reset state.
* Add health fields to rules and an aggregator method to the scheduler
* Move health, last error, and last eval time in together to minimize state processing
* Wire up a readonly scheduler to prom api
* Extract to exported function
* Use health in api_prometheus and fix up tests
* Rename health struct to status
* Fix tests one more time
* Several new tests
* Handle inactive rules
* Push state mapping into state manager
* rename to StatusReader
* Rectify cyclo complexity rebase
* Convert existing package local status implementation to models one
* fix tests
* undo RuleDefs rename
* Simple replace of State.Resolved with State.ResolvedAt
* Retain ResolvedAt time between Normal->Normal transition
* Introduce ResolvedRetention to keep sending recently resolved alerts
* Make ResolvedRetention configurable with resolved_alert_retention
* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention
* Do not reset ResolvedAt during Normal->Pending transition
Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.
This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.
* Pointers for ResolvedAt & LastSentAt
To avoid awkward time.Time{}.Unix() defaults on persist
* Basic eval flow
* Wiring-up
* fix
* Extend todo
* Start with tests
* Include some relevant tests, skip ones that seem to have timing-based race conditions
* Some tests, touch up linter and todo
* Solve TODO
* Add tracing
* Tests to make sure an eval went through
* Wire up feature toggles
* Update pkg/services/ngalert/schedule/recording_rule.go
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
* Update pkg/services/ngalert/schedule/recording_rule_test.go
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
* Update pkg/services/ngalert/schedule/recording_rule_test.go
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
* Update pkg/services/ngalert/schedule/recording_rule_test.go
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
---------
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
* Add shim rule implementation for recording rules
* Give ruleFactory access to the original rule definition
* Schedule shim implementation if the rule is a recording rule
* Fix or suppress linter
* Fix nolint
* export Evaluation
* Export Evaluation
* Export RuleVersionAndPauseStatus
* export Eval, create interface
* Export update and add to interface
* Export Stop and Run and add to interface
* Registry and scheduler use rule by interface and not concrete type
* Update factory to use interface, update tests to work over public API rather than writing to channels directly
* Rename map in registry
* Rename getOrCreateInfo to not reference a specific implementation
* Genericize alertRuleInfoRegistry into ruleRegistry
* Rename alertRuleInfo to alertRule
* Comments on interface
* Update pkg/services/ngalert/schedule/schedule.go
Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
---------
Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
* create rule factory for more complicated dep injection into rules
* Rules get direct access to metrics, logs, traces utilities, use factory in tests
* Use clock internal to rule
* Use sender, statemanager, evalfactory directly
* evalApplied and stopApplied
* use schedulableAlertRules behind interface
* loaded metrics reader
* 3 relevant config options
* Drop unused scheduler parameter
* Rename ruleRoutine to run
* Update READMED
* Handle long parameter lists
* remove dead branch