Commit Graph

343 Commits

Author SHA1 Message Date
Alexander Akhmetov 169bf2ce73 Alerting: Add feature toggle to use the old simplified routing hash generation (#111900)
* Revert "Alerting: Generate simplified routing routes with old fingerprint function (#111893)"

This reverts commit 0da9d49896.

* Add alertingUseNewSimplifiedRoutingHashAlgorithm flag

* Alerting: Add feature toggle to use the old simplified routing hash generation
2025-10-01 15:21:33 -04:00
Seunghun Shin 512c292e04 Alerting: Add jitter support for periodic alert state storage to reduce database load spikes (#111357)
What is this feature?

This PR implements a jitter mechanism for periodic alert state storage to distribute database load over time instead of processing all alert instances simultaneously. When enabled via the state_periodic_save_jitter_enabled configuration option, the system spreads batch write operations across 85% of the save interval window, preventing database load spikes in high-cardinality alerting environments.

Why do we need this feature?

In production environments with high alert cardinality, the current periodic batch storage can cause database performance issues by processing all alert instances simultaneously at fixed intervals. Even when using periodic batch storage to improve performance, concentrating all database operations at a single point in time can overwhelm database resources, especially in resource-constrained environments.

Rather than performing all INSERT operations at once during the periodic save, distributing these operations across the time window until the next save cycle can maintain more stable service operation within limited database resources. This approach prevents resource saturation by spreading the database load over the available time interval, allowing the system to operate more gracefully within existing resource constraints.

For example, with 200,000 alert instances using a 5-minute interval and 4,000 batch size, instead of executing 50 batch operations simultaneously, the jitter mechanism distributes these operations across approximately 4.25 minutes (85% of 5 minutes), with each batch executed roughly every 5.2 seconds.

This PR provides system-level protection against such load spikes by distributing operations across time, reducing peak resource usage while maintaining the benefits of periodic batch storage. The jitter mechanism is particularly valuable in resource-constrained environments where maintaining consistent database performance is more critical than precise timing of state updates.
2025-09-29 11:22:36 +02:00
Santiago 345b72227f Alert State History: Remove redundant JSON serialization when merging Loki streams (#111443) 2025-09-23 20:56:37 +02:00
Santiago 04bc71fa6d Alert State History: Skip invalid entries when merging streams (#111387) 2025-09-22 12:29:39 +02:00
Vadim Stepanov d4bad37853 Alerting: Move notification historian to grafana/alerting (#109078)
* Move notification historian to grafana/alerting

* wip

* golangci-lint

* Revert "golangci-lint"

This reverts commit 10ccebad41.

* JSONEncoder

* alertingInstrument

* go mod tidy

* go.work.sum

* make update-workspace

* merge

* revert go.mod changes

* github.com/grafana/alerting

* make update-workspace

* update github.com/grafana/alerting

* merge
2025-09-15 15:23:51 +01:00
Ryan McKinley 9a54243f09 Chore: update golang.org/x/exp (#110980) 2025-09-11 22:13:07 +03:00
Peter Štibraný 7fd9ab9481 Replace check for integration tests. (#110707)
* Replace check for integration tests.
* Revert changes in pkg/tsdb/mysql packages.
* Fix formatting of few tests.
2025-09-08 15:49:49 +02:00
Alexander Akhmetov 8a7c1f595a Alerting: Backend state filtering for history UI (#109647) 2025-09-03 17:47:03 +02:00
Moustafa Baiou 16f8359d35 Alerting: Update Alert Rule to use int64 for MissingSeriesEvalsToResolve (#109306) 2025-08-06 21:45:48 -04:00
Serge Zaitsev a95fb3a37c Chore: Omit integration tests if short test flag is passed (#108777)
* omit integration tests if short test flag is passed

* Update pkg/services/ngalert/models/receivers_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/tests/api/alerting/api_ruler_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/tests/api/alerting/api_ruler_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/tests/api/alerting/api_ruler_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/tests/api/alerting/api_ruler_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/tests/api/alerting/api_ruler_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/services/ngalert/models/receivers_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/cmd/grafana-cli/commands/datamigrations/to_unified_storage_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* Update pkg/services/ngalert/models/receivers_test.go

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>

* fix the rest

* false positive

---------

Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>
2025-07-28 13:38:54 +02:00
Vadim Stepanov bccc980b90 Alerting: Notifiication history (#107644)
* Add unified_alerting.notification_history to ini files

* Parse notification history settings

* Move Loki client to a separate package

* Loki client: add params for metrics and traces

* add NotificationHistorian

* rm writeDuration

* remove RangeQuery stuff

* wip

* wip

* wip

* wip

* pass notification historian in tests

* unify loki settings

* unify loki settings

* add test

* update grafana/alerting

* make update-workspace

* add feature toggle

* fix configureNotificationHistorian

* Revert "add feature toggle"

This reverts commit de7af8f7

* add feature toggle

* more tests

* RuleUID

* fix metrics test

* met.Info.Set(0)
2025-07-17 14:26:26 +01:00
Serge Zaitsev f66a693438 Chore: Rename integration tests to follow the common convention (#105987)
* automatically rename integration tests to follow the common convention

* name tests differently

* alter column type to bigint

* update another column to bigint

* add another alter

* fix subquery for mysql
2025-06-29 16:56:24 +02:00
Alexander Akhmetov abbae41f60 Alerting: Sanitize Prometheus state history labels before writing (#107181) 2025-06-26 12:40:37 +02:00
Alexander Akhmetov 478f9bf597 Alerting: Emit metrics from prometheus state history backend (#107121)
Alerting: Emit metrics from prometheus historian backend
2025-06-24 18:27:52 +02:00
Alexander Akhmetov ac832c157e Alerting: Maintain endsAt for missing alert instances that are not stale yet (#107011) 2025-06-20 14:31:21 +02:00
Alexander Akhmetov ad683f83ff Alerting: Add state history backend to write ALERTS metric (#104361)
**What is this feature?**

This PR implements a new Prometheus historian backend that allows Grafana alerting to write alert state history as Prometheus-compatible `ALERTS` metrics to remote Prometheus-compatible data sources.

The metric includes a few additional labels:

* `grafana_alertstate`: Grafana's full alert state, more granular than Prometheus.
* `grafana_rule_uid`: Grafana's alert rule UID.

Grafana states are included in the `grafana_alertstate` label also mapped to Prometheus-compatible `alertstate` values:

| Grafana alert state | `alertstate`          | `grafana_alertstate`  |
|---------------------|-----------------------|-----------------------|
| `Alerting`          | `firing`              | `alerting`            |
| `Recovering`        | `firing`              | `recovering`          |
| `Pending`           | `pending`             | `pending`             |
| `Error`             | `firing`              | `error`               |
| `NoData`            | `firing`              | `nodata`              |
| `Normal`            | _(no metric emitted)_ | _(no metric emitted)_ |
2025-06-18 07:17:57 +02:00
Stephanie Hingtgen a8886ad5ec Annotations: Use dashboard uids instead of dashboard ids (#106676) 2025-06-13 13:59:24 -05:00
Fayzal Ghantiwala 85df859589 Alerting: Correctly persist FiredAt in SyncRuleStatePersister (#106658)
Correctly persist FiredAt
2025-06-12 18:07:16 +01:00
Alexander Akhmetov 1a75787e74 Alerting: Send notifications immediately on Error|NoData -> Normal transitions (#106421) 2025-06-10 16:36:30 +02:00
Alexander Akhmetov 3bb4c92028 Alerting: Fix resolved notifications for same-label Error to Normal transitions (#106210)
What is this feature?

Ensures that resolved notifications are sent when alert states transition from Error to Normal after the configured number of evaluation intervals: Missing series evaluations to resolve.

Why do we need this feature?

Before this change, when an alert was transitioning from Error to Normal, in case when the labels on the new Normal alert instance are the same, Grafana would not send resolved notifications for the Error alert state. The alert would be resolved after a few evaluation intervals automatically in the alertmanager, following the endsAt.

With this change the resolved notification is sent after the configured number of evaluation intervals: Missing series evaluations to resolve.
2025-06-07 14:03:11 +02:00
Moustafa Baiou 0ce086bd2e Alerting: Ensure errors cleared when Alerting after error (#105246)
When a rule configured with `ExecErrState` state of `Alerting`, has an instance which is Alerting then has a data source error, then successfully evaluates and continues to be Alerting, the cached instance keeps the error cached until it is no longer firing.

This is unexpected and leads to misleading results.
2025-06-04 12:16:14 +02:00
Alexander Akhmetov 4cde79e802 Alerting: Clean up join errors code (#106243) 2025-06-02 10:30:04 +02:00
Alexander Akhmetov eae77aa695 Alerting: Resend alerts for states that are missing in the eval results (#105965)
What is this feature?

This PR fixes the MissingSeriesEvalsToResolve behavior when it's set to more than 4 evaluation intervals.

Why do we need this feature?

The MissingSeriesEvalsToResolve setting was not working correctly due to alerts being auto-resolved by Alertmanager after 4 evaluation intervals (via the endsAt field).

Before we had deleteStaleStatesFromCache method that was returning only stale states that had to be resolved. Non-stale states for which the current evaluation does not have a series never had endsAt updated and were never resend to the Alertmanager, so they were automatically resolved after 4 evaluations regardless of the setting.

The new processMissingSeriesStates returns state for each missing series on every evaluation, and resolves the stale ones. This guarantees that alerts without series still alert for the configured number of evaluations.
2025-05-29 23:22:35 +02:00
Alexander Akhmetov faeddf334a Alerting: Fix $value type when single data source is queried (#106080) 2025-05-27 21:04:07 +02:00
Fayzal Ghantiwala 589046bcdc Alerting: Persist alert instance FiredAt field (#105927)
* Persist alert instance fired at

* Update protos and tests
2025-05-27 10:04:26 +01:00
Alexander Akhmetov 0743689d42 Alerting: Add recovering state to the grafana_alerting_alerts metric (#104380) 2025-04-23 13:58:57 +02:00
Fayzal Ghantiwala 3a054d5e00 Alerting: Add FiredAt field to State (#104046)
* Add FiredAt field to the State

* Update featuretoggle files

* Fix lint errors

* Fix test compilation

* Remove random print line + formatting

* Address PR comments
2025-04-22 12:16:38 +01:00
Alexander Akhmetov acfd998fa6 Alerting: Send resolved notifications immediately for state deleted states (#103996)
What is this feature?

Send resolved notifications not only when an alert state becomes stale (series is missing) and transitions from Alerting to Normal, but also from Error, NoData and Recovering.

Why do we need this feature?

Previously, when an alert state became stale or was deleted, it would transition to Normal but wouldn't trigger resolved notifications to the Alertmanager. This meant we relied on the Alertmanager to send resolved notifications when the alert expires. However, if the Alertmanager state is lost, these resolved notifications would never be sent, leaving users with firing alerts in their notification channels. This PR ensures that any transition from a firing state (Alerting, Error, NoData, Recovering) to Normal triggers a resolved notification.
2025-04-14 21:40:44 +02:00
Mariell Hoversholm 757be6365a CI: Bump golangci-lint to 2.0.2 (#103572) 2025-04-10 14:42:23 +02:00
Alexander Akhmetov b49c532999 Alerting: Fix state transition from Recovering back to Alerting (#103286)
What is this feature?

This PR fixes a state transition issue where alerts transitioning from the Recovering state back to the Alerting state incorrectly entered the Pending state first if the rule had a For duration configured.

Why do we need this feature?

When an alert goes from Alerting to Recovering (when using the Keep firing for) and then back to Alerting, the existing logic would incorrectly put the alert into Pending state while it should be alerting and still sending notifications to the Alertmanager.
2025-04-03 11:40:45 +03:00
maicon d8c5c2d3b8 K8s: Folders: Modify GetChildren to return only Folder References (#103072)
* Return FolderReference instead of Folder on GetChildren

Signed-off-by: Maicon Costa <maiconscosta@gmail.com>

---------

Signed-off-by: Maicon Costa <maiconscosta@gmail.com>
2025-04-02 01:30:17 -03:00
Alexander Akhmetov c54da8f955 Alerting: Make $value return the query value in case when a single datasource is used (#102301)
What is this feature?

This PR changes the behavior of the $value and .Value variables in alerting templating to be more compatible with Prometheus templating. When a single datasource is used in the alerting rule, these variables will now return the numeric value from the query instead of the evaluation string.

Why do we need this feature?

It makes Grafana templating more compatible with Prometheus templates. In Prometheus, $value returns the numeric value of the query, but in Grafana it's the evaluation string: [ var='A' labels={instance=instance1} value=81.234 ]. This is because in Grafana multiple datasources can be used in the alert rule, and it's not always possible to get a single value.

This change makes Grafana's behavior consistent with Prometheus when a single datasource is used, and in case when multiple datasources are used in the query, it keeps the old behaviour.

Both $value and .Value are not recommended to use (documentation), and it's better to use .Values instead.
2025-03-26 10:31:38 +01:00
Alexander Akhmetov 695ac91290 Alerting: Add backend support for keep_firing_for (#100750)
What is this feature?

This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation).

keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again:

Before                                          

+----------+     +----------+                    
| Alerting |---->|  Normal  |                    
+----------+     +----------+                    

-----
After

+----------+      +------------+     +----------+
| Alerting |----->| Recovering |---->|  Normal  |
+----------+      +------------+     +----------+                                                 

Why do we need this feature?

This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert
2025-03-18 11:24:48 +01:00
Yuri Tseretyan e30034a42a Alerting: Remove feature flag alertingNoDataErrorExecution (#102156)
* remove feature flag

* remove feature flag in state manager

* make sure no data with empty results is handled

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>

---------

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2025-03-14 14:51:58 -04:00
Alexander Akhmetov 7dd6f52630 Alerting: Add MissingSeriesEvalsToResolve option to the AlertRule (#101184) 2025-03-11 22:12:06 +01:00
Yuri Tseretyan 67b44ad22a Alerting: Fix state reason (#101530)
---------

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2025-03-04 17:05:41 +02:00
Matthew Jacobson b78a63b0ad Alerting: Use new image TokenProvider and send image url in annotation (#99989)
* Send new annotation containing image url

* Use new image TokenProvider with TokenStore

New abstraction GetImage no longer needs to support parsing both token and
url from annotations, as remote AM will use the new URLProvider. Instead, we
use the new generic TokenProvider and give it a TokenStore backed by the
grafana database.

That means we revert back to always using token simplifying code and security
considerations.

* Upgrade grafana/alerting to merged commit SHA
2025-02-20 12:47:40 -05:00
Yuri Tseretyan 9dd75aee32 Alerting: Refactor State Transition (part 2 of n) (#99985)
* split create to create and patch and move to state

patch will be refactored further

* move setNextState to state transition

* move tests

* split tests for patch function
2025-02-13 09:45:16 -05:00
Yuri Tseretyan 807f94b2c7 Alerting: Remove feature toggle alertingNoNormalState (#99905) 2025-02-03 17:32:50 +02:00
Alexander Akhmetov d6c1e3bb45 Alerting: Use org store to read organization IDs (#99938) 2025-02-03 15:38:16 +01:00
Alexander Akhmetov f45265b5f7 Alerting: Read from both proto and simple DB instance stores on startup (#99855) 2025-01-31 23:34:00 +01:00
Alexander Akhmetov a0bf9202f5 Alerting: Clear the state cache when the alert routine stops (#99681) 2025-01-28 21:15:19 +02:00
Alexander Akhmetov cb43f4b696 Alerting: Add compressed protobuf-based alert state storage (#99193) 2025-01-27 18:47:33 +01:00
Alexander Akhmetov 651430e34a Alerting: Add sync state persister to save entire state of the rule (#96628) 2025-01-20 12:12:27 +01:00
Yuri Tseretyan d025523a8b Alerting: Log reason for taking image. (#99036) 2025-01-15 16:11:38 -05:00
Matthew Jacobson fc90a446c6 Alerting: Ensure long-lived repeat alerts keep images after 24h expiry (#98993)
Ensures we retake images after expiration on long-lived repeat alerts.
Otherwise, logs would show "Image not found in database" and notifications
would cease to contain an image after 24h of continuous firing.
2025-01-15 11:45:43 -05:00
Yuri Tseretyan 4f62c8a160 Alerting: Update state manager to take image only once per rule evaluation (#98289)
* add test

* update state manager to take image only once per rule evaluation process execution

* update test
2025-01-09 12:57:58 -05:00
Alexander Akhmetov 1f8f9a45d7 Alerting: Add state_periodic_save_batch_size config option (#98019)
* Alerting: Add state_periodic_save_batch_size config option

---------

Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
2024-12-16 15:30:38 +01:00
Yuri Tseretyan 420db99d16 Alerting: Update state manager to have immutable state in cache (#95985)
* create a new state and set at the end
* propagate labels datasource_uid and ref_id from current state if it's error
* copy the state when apply to all
2024-11-15 15:01:02 -05:00
Alexander Akhmetov 580d073b96 Alerting: Add context to the logger in state manager Warm (#96228) 2024-11-12 19:41:05 +01:00