grafana

Author	SHA1	Message	Date
Seunghun Shin	512c292e04	Alerting: Add jitter support for periodic alert state storage to reduce database load spikes (#111357 ) What is this feature? This PR implements a jitter mechanism for periodic alert state storage to distribute database load over time instead of processing all alert instances simultaneously. When enabled via the state_periodic_save_jitter_enabled configuration option, the system spreads batch write operations across 85% of the save interval window, preventing database load spikes in high-cardinality alerting environments. Why do we need this feature? In production environments with high alert cardinality, the current periodic batch storage can cause database performance issues by processing all alert instances simultaneously at fixed intervals. Even when using periodic batch storage to improve performance, concentrating all database operations at a single point in time can overwhelm database resources, especially in resource-constrained environments. Rather than performing all INSERT operations at once during the periodic save, distributing these operations across the time window until the next save cycle can maintain more stable service operation within limited database resources. This approach prevents resource saturation by spreading the database load over the available time interval, allowing the system to operate more gracefully within existing resource constraints. For example, with 200,000 alert instances using a 5-minute interval and 4,000 batch size, instead of executing 50 batch operations simultaneously, the jitter mechanism distributes these operations across approximately 4.25 minutes (85% of 5 minutes), with each batch executed roughly every 5.2 seconds. This PR provides system-level protection against such load spikes by distributing operations across time, reducing peak resource usage while maintaining the benefits of periodic batch storage. The jitter mechanism is particularly valuable in resource-constrained environments where maintaining consistent database performance is more critical than precise timing of state updates.	2025-09-29 11:22:36 +02:00
Fayzal Ghantiwala	589046bcdc	Alerting: Persist alert instance FiredAt field (#105927 ) * Persist alert instance fired at * Update protos and tests	2025-05-27 10:04:26 +01:00
Yuri Tseretyan	807f94b2c7	Alerting: Remove feature toggle alertingNoNormalState (#99905 )	2025-02-03 17:32:50 +02:00
Alexander Akhmetov	d6c1e3bb45	Alerting: Use org store to read organization IDs (#99938 )	2025-02-03 15:38:16 +01:00
Alexander Akhmetov	cb43f4b696	Alerting: Add compressed protobuf-based alert state storage (#99193 )	2025-01-27 18:47:33 +01:00
Alexander Akhmetov	1f8f9a45d7	Alerting: Add state_periodic_save_batch_size config option (#98019 ) * Alerting: Add state_periodic_save_batch_size config option --------- Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>	2024-12-16 15:30:38 +01:00
Alexander Akhmetov	0b804e720f	Alerting: Add RuleGroup field to ListAlertInstancesQuery struct (#94615 ) Alerting: add RuleGroup field to ListAlertInstancesQuery struct	2024-10-18 09:44:16 +02:00
Alexander Akhmetov	0a4e6ff86b	Alerting: Add SaveAlertInstancesForRule instance store method (#94505 ) Alerting: Add SaveAlertInstancesForRule method to the InstanceStore interface	2024-10-11 13:47:44 +02:00
Matthew Jacobson	ba800692c6	Alerting: Persist AlertInstance ResolvedAt & LastSentAt (#89135 ) * Alerting: Persist AlertInstance ResolvedAt & LastSentAt * Fix test * Modify existing tests * Fix merge conflicts from nullable LastSentAt & ResolvedAt	2024-07-12 12:26:58 -04:00
Jean-Philippe Quéméner	eb7e1216a1	feat(alerting): add async state persister (#80763 )	2024-01-22 13:07:11 +01:00
Yuri Tseretyan	f6a46744a6	Alerting: Support hysteresis command expression (#75189 ) Backend: * Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command. - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels. - add rule_fingerprint column to alert_instance - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it. - add AlertingResultsFromRuleState that implements the new interface in eval package - update getExprRequest to patch the hysteresis command. * Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition. Frontend: * Add hysteresis option to Threshold in UI. It's called "Recovery Threshold" * Add test for getUnloadEvaluatorTypeFromCondition * Hide hysteresis in panel expressions * Refactor isInvalid and add test for it * Remove unnecesary React.memo * Add tests for updateEvaluatorConditions --------- Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>	2024-01-04 11:47:13 -05:00
Ryan McKinley	f69fd3726b	FeatureToggles: Add context and and an explicit global check (#78081 )	2023-11-14 12:50:27 -08:00
Ryan McKinley	025b2f3011	Chore: use any rather than interface{} (#74066 )	2023-08-30 18:46:47 +03:00
Matthew Jacobson	63187fae0c	Alerting: Remove and revert flag alertingBigTransactions (#65976 ) * Alerting: Remove and revert flag alertingBigTransactions This is a partial revert of #56575 and a removal of the `alertingBigTransactions` flag. Real-word use has seen no clear performance incentive to maintain this flag. Lowered db connection count came at the cost of significant increase in CPU usage and query latency. * Fix lint backend * Removed last bits of alertingBigTransactions --------- Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>	2023-04-06 18:06:25 +02:00
Serge Zaitsev	0beb768427	Chore: Remove result fields from ngalert (#65410 ) * remove result fields from ngalert * remove duplicate imports	2023-03-28 10:34:35 +02:00
Yuri Tseretyan	9d57b1c72e	Alerting: Do not persist noop transition from Normal state. (#61201 ) * add feature flag `alertingNoNormalState` * update instance database to support exclusion of state in list operation * do not save normal state and delete transitions to normal * update get methods to filter out normal state	2023-01-13 18:29:29 -05:00
Kristin Laemmert	05709ce411	chore: remove sqlstore & mockstore dependencies from (most) packages (#57087 ) * chore: add alias for InitTestDB and Session Adds an alias for the sqlstore InitTestDB and Session, and updates tests using these to reduce dependencies on the sqlstore.Store. * next pass of removing sqlstore imports * last little bit * remove mockstore where possible	2022-10-19 09:02:15 -04:00
Kristin Laemmert	c61b5e85b4	chore: replace sqlstore.Store with db.DB (#57010 ) * chore: replace sqlstore.SQLStore with db.DB * more post-sqlstore.SQLStore cleanup	2022-10-14 15:33:06 -04:00
Joe Blubaugh	b476ae62fb	Alerting: Write and Delete multiple alert instances. (#55350 ) Prior to this change, all alert instance writes and deletes happened individually, in their own database transaction. This change batches up writes or deletes for a given rule's evaluation loop into a single transaction before applying it. These new transactions are off by default, guarded by the feature toggle "alertingBigTransactions" Before: ``` goos: darwin goarch: arm64 pkg: github.com/grafana/grafana/pkg/services/ngalert/store BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op --- BENCH: BenchmarkAlertInstanceOperations-8 util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created PASS ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s ``` After: ``` goos: darwin goarch: arm64 pkg: github.com/grafana/grafana/pkg/services/ngalert/store BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op --- BENCH: BenchmarkAlertInstanceOperations-8 util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created PASS ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s ``` So we cut time by about 75% and memory allocations by about 60% when storing and deleting 100 instances.	2022-10-06 14:22:58 +08:00
Alexander Weaver	f11495a4c3	Alerting: Remove dead functionality from alert instance store (#55774 ) * Update tests to use ListAlertInstances * Drop the actual methods rather than just updating tests	2022-09-26 14:38:53 -05:00
Alexander Weaver	a00879ae21	Alerting: Refactor store to not export its own interface for InstanceStore, delete dead dependency injection (#55772 ) * Add consumer-side store interface to state manager * Remove dead dependency * Delete dead dependency in API struct * Delete store-layer InstanceStore interface * Move fake for state's InstanceStore interface to state package	2022-09-26 13:55:05 -05:00
Joe Blubaugh	22c937340e	Revert "Alerting: Write and Delete multiple alert instances. (#54072 )" (#54885 ) This reverts commit `5e4fd94413`.	2022-09-09 17:44:06 +02:00
Joe Blubaugh	5e4fd94413	Alerting: Write and Delete multiple alert instances. (#54072 ) Prior to this change, all alert instance writes and deletes happened individually, in their own database transaction. This change batches up writes or deletes for a given rule's evaluation loop into a single transaction before applying it. Before: ``` goos: darwin goarch: arm64 pkg: github.com/grafana/grafana/pkg/services/ngalert/store BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op --- BENCH: BenchmarkAlertInstanceOperations-8 util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created PASS ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s ``` After: ``` goos: darwin goarch: arm64 pkg: github.com/grafana/grafana/pkg/services/ngalert/store BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op --- BENCH: BenchmarkAlertInstanceOperations-8 util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created PASS ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s ``` So we cut time by about 75% and memory allocations by about 60% when storing and deleting 100 instances. This change also updates some of our tests so that they run successfully against postgreSQL - we were using random Int64s, but postgres integers, which our tables use, max out at 2^31-1	2022-09-02 11:17:20 +08:00
Yuriy Tseretyan	03e746d9df	Alerting: Delete state from the database on reset (#53919 ) * make ResetStatesByRuleUID return states * delete rule states when reset * rule eval routine to clean up the state only when rule is deleted	2022-08-25 14:12:22 -04:00
Joe Blubaugh	1cc034d960	Alerting: Add a "Reason" to Alert Instances to show underlying cause of state. (#49259 ) This change adds a field to state.State and models.AlertInstance that indicate the "Reason" that an instance has its current state. This helps us account for cases where the state is "Normal" but the underlying evaluation returned "NoData" or "Error", for example. Fixes #42606 Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>	2022-05-23 16:49:49 +08:00
George Robinson	67a3e1d6fd	Add context.Context to InstanceStore (#45049 )	2022-02-08 13:49:04 +00:00
David Parrott	b5f464412d	Alerting: automatically remove stale alerting states (#36767 ) * initial attempt at automatic removal of stale states * test case, need espected states * finish unit test * PR feedback * still multiply by time.second * pr feedback	2021-07-26 18:12:04 +02:00
Kyle Brandt	a735c51202	Alerting/Chore: Backend remove def_ columns from instance (#33875 ) rename def_uid and def_org_id to rule_uid and rule_org_id on the alert_instance table and drops the definition table.	2021-05-12 07:17:43 -04:00
Kyle Brandt	c1034f3118	Alerting: Create instanceStore (#33587 ) for https://github.com/grafana/alerting-squad/issues/129	2021-05-03 07:19:15 -04:00
Kyle Brandt	7823842c5d	Alerting: Load annotations from rule into State cache (#33542 ) for https://github.com/grafana/alerting-squad/issues/127	2021-04-30 20:23:12 +02:00
David Parrott	2a8446e435	Alerting: Persist alerts on evaluation and shutdown. Warm cache from DB on startup (#32576 ) * Initial commit for state tracking * basic state transition logic and tests * constructor. test and interface fixup * use new sig for sch.definitionRoutine() * test fixup * make the linter happy * more minor linting cleanup * Alerting: Send alerts from state tracker to notifier * Add evaluation time and test Add evaluation time and test * Add cleanup routine and logging * Pull in compact.go and reconcile differences * Save alert transitions and save all state on shutdown * pr feedback * WIP * WIP * Persist alerts on evaluation and shutdown. Warm cache on startup * Filter non-firing alerts before sending to notifier Co-authored-by: Josue Abreu <josue@grafana.com>	2021-04-02 08:11:33 -07:00
Sofia Papagiannaki	4ce0a49eac	AlertingNG: Split into several packages (#31719 ) * AlertingNG: Split into several packages * Move AlertQuery to models	2021-03-08 22:19:21 +02:00

32 Commits