grafana

Author	SHA1	Message	Date
Alexander Akhmetov	100528e274	Alerting: Support retry with backoff in alert rule evaluation (#99710 )	2025-09-04 13:56:03 +02:00
Fayzal Ghantiwala	3a054d5e00	Alerting: Add FiredAt field to State (#104046 ) * Add FiredAt field to the State * Update featuretoggle files * Fix lint errors * Fix test compilation * Remove random print line + formatting * Address PR comments	2025-04-22 12:16:38 +01:00
Mariell Hoversholm	757be6365a	CI: Bump golangci-lint to 2.0.2 (#103572 )	2025-04-10 14:42:23 +02:00
Yuri Tseretyan	dc0083d879	Alerting: Sequential evaluation of rules in group (#98829 ) * introduce RulesGroupComparer * extract runJob method * implement sequential evaluation * Make sequence building testable & add comments * Also run callback in recording rules + add tests * Improve tests * Address PR comments --------- Co-authored-by: William Wernert <william.wernert@grafana.com>	2025-04-02 23:10:32 +03:00
Alexander Akhmetov	695ac91290	Alerting: Add backend support for keep_firing_for (#100750 ) What is this feature? This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation). keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again: Before +----------+ +----------+ \| Alerting \|---->\| Normal \| +----------+ +----------+ ----- After +----------+ +------------+ +----------+ \| Alerting \|----->\| Recovering \|---->\| Normal \| +----------+ +------------+ +----------+ Why do we need this feature? This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert	2025-03-18 11:24:48 +01:00
Yuri Tseretyan	32fde6dba4	Alerting: Update scheduler to provide full specification to rule update channel (#101375 ) update scheduler's aler rule to accept regular Evaluation in update channel This makes it accept the full rule definition, which is required in reset state.	2025-02-26 14:39:39 -05:00
Alexander Akhmetov	a0bf9202f5	Alerting: Clear the state cache when the alert routine stops (#99681 )	2025-01-28 21:15:19 +02:00
Alexander Akhmetov	a28328d764	Alerting: Call the deletion reason provider even if the rule is no longer scheduled (#99571 ) Alerting: Call the deletion reason provider even if the rule is not scheduled anymore	2025-01-28 11:34:26 +01:00
Alexander Akhmetov	12bda63871	Alerting: Optional function to find the rule deletion reason (#99422 )	2025-01-27 11:35:52 +01:00
Santiago	ea6cb8f139	Alerting: Panic when rule being evaluated has unexpected key (#99002 )	2025-01-15 14:59:50 +02:00
Alexander Akhmetov	0a4e6ff86b	Alerting: Add SaveAlertInstancesForRule instance store method (#94505 ) Alerting: Add SaveAlertInstancesForRule method to the InstanceStore interface	2024-10-11 13:47:44 +02:00
Alexander Weaver	393faa8732	Alerting: Move rule evaluation status logic out of prometheus API and into scheduler (#89141 ) * Add health fields to rules and an aggregator method to the scheduler * Move health, last error, and last eval time in together to minimize state processing * Wire up a readonly scheduler to prom api * Extract to exported function * Use health in api_prometheus and fix up tests * Rename health struct to status * Fix tests one more time * Several new tests * Handle inactive rules * Push state mapping into state manager * rename to StatusReader * Rectify cyclo complexity rebase * Convert existing package local status implementation to models one * fix tests * undo RuleDefs rename	2024-09-30 16:52:49 -05:00
Alexander Weaver	ac5ebe6e4d	Alerting: Add enablement flag for recording rules (#92032 ) * Add enablement flag * Disable if toggle not enabled	2024-08-19 12:01:00 -05:00
Yuri Tseretyan	c3b5cabb14	Alerting: Refactor scheduler's rule evaluator to store rule key (#89925 )	2024-07-01 16:43:23 -04:00
Matthew Jacobson	47c9259d75	Alerting: Ensure we update State.LastSentAt before persisting (#89427 )	2024-06-25 13:01:26 -04:00
Matthew Jacobson	3228b64fe6	Alerting: Resend resolved notifications for ResolvedRetention duration (#88938 ) * Simple replace of State.Resolved with State.ResolvedAt * Retain ResolvedAt time between Normal->Normal transition * Introduce ResolvedRetention to keep sending recently resolved alerts * Make ResolvedRetention configurable with resolved_alert_retention * Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention * Do not reset ResolvedAt during Normal->Pending transition Initially this was done to be inline with Prom ruler. However, Prom ruler doesn't keep track of Inactive->Pending/Alerting using the same alert instance, so it's more understandable that they choose not to retain ResolvedAt. In our case, since we use the same cached instance to represent the transition, it makes more sense to retain it. This should help alleviate some odd situations where temporarily entering Pending will stop future resolved notifications that would have happened because of ResolvedRetention. * Pointers for ResolvedAt & LastSentAt To avoid awkward time.Time{}.Unix() defaults on persist	2024-06-20 16:33:03 -04:00
Alexander Akhmetov	667fea6623	Alerting: use hash of labels instead of labels string as the alert state cache key (#88956 ) * Alerting: use hash instead of labels as the cache key * Use data.Labels.Fingerprint to calculate the cache key	2024-06-11 18:34:58 +02:00
William Wernert	5de7d4d06d	Alerting: Create writer interface for recording rules (#88459 ) * Create writer interface for recording rules Also create fake impl + use it for stub in scheduler	2024-05-29 22:38:33 +03:00
Alexander Weaver	b926b6336d	Alerting: Scheduled recording rules execute their queries (#88309 ) * Basic eval flow * Wiring-up * fix * Extend todo * Start with tests * Include some relevant tests, skip ones that seem to have timing-based race conditions * Some tests, touch up linter and todo * Solve TODO * Add tracing * Tests to make sure an eval went through * Wire up feature toggles * Update pkg/services/ngalert/schedule/recording_rule.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> * Update pkg/services/ngalert/schedule/recording_rule_test.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> * Update pkg/services/ngalert/schedule/recording_rule_test.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> * Update pkg/services/ngalert/schedule/recording_rule_test.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> --------- Co-authored-by: Steve Simpson <steve.simpson@grafana.com>	2024-05-28 10:59:21 -05:00
Alexander Weaver	89b54d06e9	Alerting: Schedule a shim implementation for recording rules (#87939 ) * Add shim rule implementation for recording rules * Give ruleFactory access to the original rule definition * Schedule shim implementation if the rule is a recording rule * Fix or suppress linter * Fix nolint	2024-05-21 16:42:58 -05:00
Alexander Weaver	a6a9ab4008	Alerting: Do not store series values from past evaluations in state manager for no reason (#87525 ) Do not store previous execution results on states	2024-05-09 15:51:55 -05:00
Yuri Tseretyan	052082a927	Alerting: Refactor Alert Rule Generators (#86813 )	2024-04-29 21:52:15 -04:00
Steve Simpson	ad7f804255	Alerting: Fix evaluation metrics to not count retries (#85873 ) * Change evaluation metrics to only count once per eval, and add new metrics. * Cosmetic: Move eval total Inc() to orginal place.	2024-04-12 16:20:46 +02:00
Alexander Weaver	6c5e94095d	Alerting: Scheduler and registry handle rules by an interface (#84044 ) * export Evaluation * Export Evaluation * Export RuleVersionAndPauseStatus * export Eval, create interface * Export update and add to interface * Export Stop and Run and add to interface * Registry and scheduler use rule by interface and not concrete type * Update factory to use interface, update tests to work over public API rather than writing to channels directly * Rename map in registry * Rename getOrCreateInfo to not reference a specific implementation * Genericize alertRuleInfoRegistry into ruleRegistry * Rename alertRuleInfo to alertRule * Comments on interface * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com> --------- Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>	2024-03-11 22:57:38 +02:00
Alexander Weaver	d5fda06147	Alerting: Decouple rule routine from scheduler (#84018 ) * create rule factory for more complicated dep injection into rules * Rules get direct access to metrics, logs, traces utilities, use factory in tests * Use clock internal to rule * Use sender, statemanager, evalfactory directly * evalApplied and stopApplied * use schedulableAlertRules behind interface * loaded metrics reader * 3 relevant config options * Drop unused scheduler parameter * Rename ruleRoutine to run * Update READMED * Handle long parameter lists * remove dead branch	2024-03-06 13:44:53 -06:00
Alexander Weaver	1bb38e8f95	Alerting: Move ruleRoutine to be a method on ruleInfo (#83866 ) * Move ruleRoutine to ruleInfo file * Move tests as well * swap ruleInfo and scheduler parameters on ruleRoutine * Fix linter complaint, receiver name	2024-03-04 17:15:55 -06:00
Alexander Weaver	fa51724bc6	Alerting: Move alertRuleInfo and tests to new files (#83854 ) Move ruleinfo and tests to new files	2024-03-04 11:24:49 -06:00

27 Commits