grafana

Author	SHA1	Message	Date
Serge Zaitsev	a95fb3a37c	Chore: Omit integration tests if short test flag is passed (#108777 ) * omit integration tests if short test flag is passed * Update pkg/services/ngalert/models/receivers_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/tests/api/alerting/api_ruler_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/tests/api/alerting/api_ruler_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/tests/api/alerting/api_ruler_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/tests/api/alerting/api_ruler_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/tests/api/alerting/api_ruler_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/services/ngalert/models/receivers_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/cmd/grafana-cli/commands/datamigrations/to_unified_storage_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * Update pkg/services/ngalert/models/receivers_test.go Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com> * fix the rest * false positive --------- Co-authored-by: Matheus Macabu <macabu@users.noreply.github.com>	2025-07-28 13:38:54 +02:00
Serge Zaitsev	f66a693438	Chore: Rename integration tests to follow the common convention (#105987 ) * automatically rename integration tests to follow the common convention * name tests differently * alter column type to bigint * update another column to bigint * add another alter * fix subquery for mysql	2025-06-29 16:56:24 +02:00
Alexander Akhmetov	ad683f83ff	Alerting: Add state history backend to write ALERTS metric (#104361 ) What is this feature? This PR implements a new Prometheus historian backend that allows Grafana alerting to write alert state history as Prometheus-compatible `ALERTS` metrics to remote Prometheus-compatible data sources. The metric includes a few additional labels: * `grafana_alertstate`: Grafana's full alert state, more granular than Prometheus. * `grafana_rule_uid`: Grafana's alert rule UID. Grafana states are included in the `grafana_alertstate` label also mapped to Prometheus-compatible `alertstate` values: \| Grafana alert state \| `alertstate` \| `grafana_alertstate` \| \|---------------------\|-----------------------\|-----------------------\| \| `Alerting` \| `firing` \| `alerting` \| \| `Recovering` \| `firing` \| `recovering` \| \| `Pending` \| `pending` \| `pending` \| \| `Error` \| `firing` \| `error` \| \| `NoData` \| `firing` \| `nodata` \| \| `Normal` \| _(no metric emitted)_ \| _(no metric emitted)_ \|	2025-06-18 07:17:57 +02:00
Alexander Akhmetov	1a75787e74	Alerting: Send notifications immediately on Error\|NoData -> Normal transitions (#106421 )	2025-06-10 16:36:30 +02:00
Alexander Akhmetov	3bb4c92028	Alerting: Fix resolved notifications for same-label Error to Normal transitions (#106210 ) What is this feature? Ensures that resolved notifications are sent when alert states transition from Error to Normal after the configured number of evaluation intervals: Missing series evaluations to resolve. Why do we need this feature? Before this change, when an alert was transitioning from Error to Normal, in case when the labels on the new Normal alert instance are the same, Grafana would not send resolved notifications for the Error alert state. The alert would be resolved after a few evaluation intervals automatically in the alertmanager, following the endsAt. With this change the resolved notification is sent after the configured number of evaluation intervals: Missing series evaluations to resolve.	2025-06-07 14:03:11 +02:00
Moustafa Baiou	0ce086bd2e	Alerting: Ensure errors cleared when Alerting after error (#105246 ) When a rule configured with `ExecErrState` state of `Alerting`, has an instance which is Alerting then has a data source error, then successfully evaluates and continues to be Alerting, the cached instance keeps the error cached until it is no longer firing. This is unexpected and leads to misleading results.	2025-06-04 12:16:14 +02:00
Alexander Akhmetov	eae77aa695	Alerting: Resend alerts for states that are missing in the eval results (#105965 ) What is this feature? This PR fixes the MissingSeriesEvalsToResolve behavior when it's set to more than 4 evaluation intervals. Why do we need this feature? The MissingSeriesEvalsToResolve setting was not working correctly due to alerts being auto-resolved by Alertmanager after 4 evaluation intervals (via the endsAt field). Before we had deleteStaleStatesFromCache method that was returning only stale states that had to be resolved. Non-stale states for which the current evaluation does not have a series never had endsAt updated and were never resend to the Alertmanager, so they were automatically resolved after 4 evaluations regardless of the setting. The new processMissingSeriesStates returns state for each missing series on every evaluation, and resolves the stale ones. This guarantees that alerts without series still alert for the configured number of evaluations.	2025-05-29 23:22:35 +02:00
Fayzal Ghantiwala	589046bcdc	Alerting: Persist alert instance FiredAt field (#105927 ) * Persist alert instance fired at * Update protos and tests	2025-05-27 10:04:26 +01:00
Fayzal Ghantiwala	3a054d5e00	Alerting: Add FiredAt field to State (#104046 ) * Add FiredAt field to the State * Update featuretoggle files * Fix lint errors * Fix test compilation * Remove random print line + formatting * Address PR comments	2025-04-22 12:16:38 +01:00
Alexander Akhmetov	b49c532999	Alerting: Fix state transition from Recovering back to Alerting (#103286 ) What is this feature? This PR fixes a state transition issue where alerts transitioning from the Recovering state back to the Alerting state incorrectly entered the Pending state first if the rule had a For duration configured. Why do we need this feature? When an alert goes from Alerting to Recovering (when using the Keep firing for) and then back to Alerting, the existing logic would incorrectly put the alert into Pending state while it should be alerting and still sending notifications to the Alertmanager.	2025-04-03 11:40:45 +03:00
Alexander Akhmetov	695ac91290	Alerting: Add backend support for keep_firing_for (#100750 ) What is this feature? This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation). keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again: Before +----------+ +----------+ \| Alerting \|---->\| Normal \| +----------+ +----------+ ----- After +----------+ +------------+ +----------+ \| Alerting \|----->\| Recovering \|---->\| Normal \| +----------+ +------------+ +----------+ Why do we need this feature? This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert	2025-03-18 11:24:48 +01:00
Yuri Tseretyan	e30034a42a	Alerting: Remove feature flag `alertingNoDataErrorExecution` (#102156 ) * remove feature flag * remove feature flag in state manager * make sure no data with empty results is handled Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com> --------- Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>	2025-03-14 14:51:58 -04:00
Alexander Akhmetov	7dd6f52630	Alerting: Add MissingSeriesEvalsToResolve option to the AlertRule (#101184 )	2025-03-11 22:12:06 +01:00
Alexander Akhmetov	d6c1e3bb45	Alerting: Use org store to read organization IDs (#99938 )	2025-02-03 15:38:16 +01:00
Alexander Akhmetov	cb43f4b696	Alerting: Add compressed protobuf-based alert state storage (#99193 )	2025-01-27 18:47:33 +01:00
Will Browne	25abd57029	Plugins: Update to latest go plugin SDK (0.256.0) (#95065 ) * update to latest go plugin SDK * make update-workspace * update alerting tests	2024-10-22 15:44:53 +01:00
Alexander Akhmetov	d0481bb568	Alerting: Refactor state manager Warm method to accept instance store as an argument (#95098 )	2024-10-22 09:45:50 +02:00
Alexander Akhmetov	0a4e6ff86b	Alerting: Add SaveAlertInstancesForRule instance store method (#94505 ) Alerting: Add SaveAlertInstancesForRule method to the InstanceStore interface	2024-10-11 13:47:44 +02:00
Alexander Akhmetov	d32e1e009b	Alerting: Update prometheus/client_golang to v1.20 (#92070 ) Update prometheus/client_golang to v1.20	2024-08-20 11:26:06 +02:00
Alexander Akhmetov	c7fdf8ce70	Alerting: Add error to annotations on data source errors (#91594 )	2024-08-15 12:34:50 +02:00
Matthew Jacobson	ba800692c6	Alerting: Persist AlertInstance ResolvedAt & LastSentAt (#89135 ) * Alerting: Persist AlertInstance ResolvedAt & LastSentAt * Fix test * Modify existing tests * Fix merge conflicts from nullable LastSentAt & ResolvedAt	2024-07-12 12:26:58 -04:00
Alexander Weaver	3b6a8775bb	Alerting: Fix stale values associated with states that have gone to NoData, unify values calculation (#89807 ) * Unify values * Fix with latest changes on main * Fix up NaN test * Keep refIDs with -1 as value * Test that refIDs are preserved on Normal to Error transition * Alerting to err test too * Add a blurb to docs about this behavior	2024-07-08 12:30:23 -05:00
Yuri Tseretyan	06d5850396	Alerting: Update alerting state history API to authorize access using RBAC (#89579 ) * add method CanReadAllRules to rule authorization service * add alias type Namespace for Folder in ngalert's models package. It implements the Namespacer interface that is used by authz logic * update state history's backends to authorize access to rules. * update Loki to add folders UIDs to query. * Update BuildLogQuery to drop filter by folders if it's too long and fall back to in-memory filtering.	2024-06-26 10:25:37 -04:00
Matthew Jacobson	47c9259d75	Alerting: Ensure we update State.LastSentAt before persisting (#89427 )	2024-06-25 13:01:26 -04:00
Alexander Akhmetov	2035814059	Alerting: fix updating error in the alert rule state during error to error transitions and restarts (#89557 ) Alerting: fix preserving errors in the alert rule state during error to error transitions Alert state transition from one error to another did not update state.Error correctly. The error in state.Error remained as the initial error encountered. This led to another issue, where after a Grafana restart, the error was lost because the state of the alert rule did not change, but the Error is not preserved in the database between restarts. This could happen if the expression service returned an error or the alert routine panicked during querying.	2024-06-25 09:42:00 +02:00
Matthew Jacobson	3228b64fe6	Alerting: Resend resolved notifications for ResolvedRetention duration (#88938 ) * Simple replace of State.Resolved with State.ResolvedAt * Retain ResolvedAt time between Normal->Normal transition * Introduce ResolvedRetention to keep sending recently resolved alerts * Make ResolvedRetention configurable with resolved_alert_retention * Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention * Do not reset ResolvedAt during Normal->Pending transition Initially this was done to be inline with Prom ruler. However, Prom ruler doesn't keep track of Inactive->Pending/Alerting using the same alert instance, so it's more understandable that they choose not to retain ResolvedAt. In our case, since we use the same cached instance to represent the transition, it makes more sense to retain it. This should help alleviate some odd situations where temporarily entering Pending will stop future resolved notifications that would have happened because of ResolvedRetention. * Pointers for ResolvedAt & LastSentAt To avoid awkward time.Time{}.Unix() defaults on persist	2024-06-20 16:33:03 -04:00
Alexander Akhmetov	667fea6623	Alerting: use hash of labels instead of labels string as the alert state cache key (#88956 ) * Alerting: use hash instead of labels as the cache key * Use data.Labels.Fingerprint to calculate the cache key	2024-06-11 18:34:58 +02:00
Steve Simpson	67fa96f88d	Alerting: Pass logger into NewAnnotationBackend. (#87812 ) * Alerting: Pass logger into NewAnnotationBackend. Make it possible to pass loggers into more places for code reuse. * Mistake in passing logger	2024-05-14 15:51:27 +02:00
Alexander Weaver	a6a9ab4008	Alerting: Do not store series values from past evaluations in state manager for no reason (#87525 ) Do not store previous execution results on states	2024-05-09 15:51:55 -05:00
Yuri Tseretyan	052082a927	Alerting: Refactor Alert Rule Generators (#86813 )	2024-04-29 21:52:15 -04:00
William Wernert	10dc6c6d75	Alerting: Add "Keep Last State" backend functionality (#83940 ) * Implement keep last state for state transitions * Respect For duration when keeping state * Only keep transition from recording an annotation * Add keep last state option for nodata/error in UI	2024-03-12 10:00:43 -04:00
Diego Augusto Molina	9c29e1a783	Alerting: Fix data races and improve testing (#81994 ) * Alerting: fix race condition in (ngalert/sender.ExternalAlertmanager).Run Chore: Fix data races when accessing members of ngalert/state.FakeInstanceStore Chore: Fix data races in tests in ngalert/schedule and enable some parallel tests * Chore: fix linters * Chore: add TODO comment to remove loopvar once we move to Go 1.22	2024-02-14 12:45:39 -03:00
Ryan McKinley	0c6e409350	Chore: Update arrow and prometheus dependencies (#82215 ) * update arrow and prometheus * keep codeowner * use compare * use grafana-plugin-sdk-go v0.210.0 --------- Co-authored-by: ismail simsek <ismailsimsek09@gmail.com>	2024-02-13 01:50:25 +01:00
Dan Cech	790e1feb93	Chore: Update test database initialization (#81673 ) * streamline initialization of test databases, support on-disk sqlite test db * clean up test databases * introduce testsuite helper * use testsuite everywhere we use a test db * update documentation * improve error handling * disable entity integration test until we can figure out locking error	2024-02-09 09:35:39 -05:00
Jean-Philippe Quéméner	82638d059f	feat(alerting): add state persister interface (#80384 )	2024-01-17 13:33:13 +01:00
Yuri Tseretyan	f6a46744a6	Alerting: Support hysteresis command expression (#75189 ) Backend: * Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command. - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels. - add rule_fingerprint column to alert_instance - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it. - add AlertingResultsFromRuleState that implements the new interface in eval package - update getExprRequest to patch the hysteresis command. * Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition. Frontend: * Add hysteresis option to Threshold in UI. It's called "Recovery Threshold" * Add test for getUnloadEvaluatorTypeFromCondition * Hide hysteresis in panel expressions * Refactor isInvalid and add test for it * Remove unnecesary React.memo * Add tests for updateEvaluatorConditions --------- Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>	2024-01-04 11:47:13 -05:00
Alexander Weaver	65ecde6eed	Alerting: Don't record annotations for mapped NoData transitions, when NoData is mapped to OK (#77164 ) * Exclude mapped nodata transitions when nodata mapped to OK * Fix processEvalResults test * Don't check NoDataState when filtering transition * Add comment to explain purpose of separate function --------- Co-authored-by: William Wernert <william.wernert@grafana.com>	2023-12-18 16:59:32 -05:00
William Wernert	f7bf818527	Alerting: Make alert state history Loki http client public (#78291 ) * Make state history Loki client public * Make historian metrics subsystem configurable	2023-11-27 09:20:50 -05:00
Alexander Weaver	6ee52ac80c	Alerting: Allow more time before Alertmanager expire-resolves alerts (#77094 ) * Sync endsAt factor with prometheus * Fix state tests	2023-10-25 10:03:46 -05:00
gotjosh	59694fb2be	Alerting: Don't use a separate collection system for metrics (#75296 ) * Alerting: Don't use a separate collection system for metrics The state package had a metric collection system that ran every 15s updating the values of the metrics - there is a common pattern for this in the Prometheus ecosystem called "collectors". I have removed the behaviour of using a time-based interval to "set" the metrics in favour of a set of functions as the "value" that get called at scrape time.	2023-09-25 10:27:30 +01:00
Steve Simpson	894f420014	Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158 )	2023-09-20 15:07:02 +02:00
Yuri Tseretyan	938e26b59f	Alerting: Add new metrics and tracings to state manager and scheduler (#71398 ) * add metrics and tracing to state manager * propagate tracer to state manager * add scheduler metrics * fix backtesting * add test for state metrics * remove StateUpdateCount * update docs * metrics can be null * add tracer to new tests	2023-08-16 09:04:18 +02:00
Yuri Tseretyan	0053b07885	Alerting: Refactor of state manager tests (#72849 ) * calculate cacheID instead of literals * use mocked clocks * advance clocks with the eval results * use clearer timestamp aliases * make expected state labels be more clear to read Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>	2023-08-04 13:39:49 -04:00
Yuri Tseretyan	78fc3bcdf4	Alerting: Fix state manager to not keep datasource_uid and ref_id labels in state after Error (#72216 )	2023-07-26 11:41:46 -04:00
Kyle Brandt	1df4d332c9	SSE: Use errutil to show better error messages in prod (#71658 ) - include public message - propagate data source query errors so they are shown as well to which fixes #70026	2023-07-21 06:38:29 -04:00
Alexander Weaver	18b910e654	Alerting: Refactor annotation historian to isolate dashboard service dependency (#71689 ) * Refactor annotation historian to isolate dashboard service dependency * Export PanelKey * Don't export parsePanelKey * Remove commented out code	2023-07-18 08:18:55 -05:00
George Robinson	7edbe72483	Alerting: Support concurrent queries for saving alert instances (#70525 ) This commit adds support for concurrent queries when saving alert instances to the database. This is an experimental feature in response to some customers experiencing delays between rule evaluation and sending alerts to Alertmanager, resulting in flapping. It is disabled by default.	2023-06-23 11:36:07 +01:00
Matthew Jacobson	63187fae0c	Alerting: Remove and revert flag alertingBigTransactions (#65976 ) * Alerting: Remove and revert flag alertingBigTransactions This is a partial revert of #56575 and a removal of the `alertingBigTransactions` flag. Real-word use has seen no clear performance incentive to maintain this flag. Lowered db connection count came at the cost of significant increase in CPU usage and query latency. * Fix lint backend * Removed last bits of alertingBigTransactions --------- Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>	2023-04-06 18:06:25 +02:00
Matthew Jacobson	b9dc04139a	Alerting: Respect "For" Duration for NoData alerts (#65574 ) * Alerting: Respect "For" Duration for NoData alerts This change modifies `resultNoData` to be more inline with the logic of the other state handlers. The main effects of this are: 1) NoData states with NoDataState config set to Alerting will respect "For" duration. 2) Prevents zero value in StartsAt and EndsAt for alerts that have only even been in normal state. This includes state transitions from NoDataState=OK and ExecErrState=OK. 3) Better state transition logging.	2023-03-31 19:05:15 +03:00
Serge Zaitsev	0beb768427	Chore: Remove result fields from ngalert (#65410 ) * remove result fields from ngalert * remove duplicate imports	2023-03-28 10:34:35 +02:00

1 2 3

115 Commits