Commit Graph

61 Commits

Author SHA1 Message Date
Alexander Akhmetov 3bb4c92028 Alerting: Fix resolved notifications for same-label Error to Normal transitions (#106210)
What is this feature?

Ensures that resolved notifications are sent when alert states transition from Error to Normal after the configured number of evaluation intervals: Missing series evaluations to resolve.

Why do we need this feature?

Before this change, when an alert was transitioning from Error to Normal, in case when the labels on the new Normal alert instance are the same, Grafana would not send resolved notifications for the Error alert state. The alert would be resolved after a few evaluation intervals automatically in the alertmanager, following the endsAt.

With this change the resolved notification is sent after the configured number of evaluation intervals: Missing series evaluations to resolve.
2025-06-07 14:03:11 +02:00
Alexander Akhmetov eae77aa695 Alerting: Resend alerts for states that are missing in the eval results (#105965)
What is this feature?

This PR fixes the MissingSeriesEvalsToResolve behavior when it's set to more than 4 evaluation intervals.

Why do we need this feature?

The MissingSeriesEvalsToResolve setting was not working correctly due to alerts being auto-resolved by Alertmanager after 4 evaluation intervals (via the endsAt field).

Before we had deleteStaleStatesFromCache method that was returning only stale states that had to be resolved. Non-stale states for which the current evaluation does not have a series never had endsAt updated and were never resend to the Alertmanager, so they were automatically resolved after 4 evaluations regardless of the setting.

The new processMissingSeriesStates returns state for each missing series on every evaluation, and resolves the stale ones. This guarantees that alerts without series still alert for the configured number of evaluations.
2025-05-29 23:22:35 +02:00
Fayzal Ghantiwala 589046bcdc Alerting: Persist alert instance FiredAt field (#105927)
* Persist alert instance fired at

* Update protos and tests
2025-05-27 10:04:26 +01:00
Alexander Akhmetov 0743689d42 Alerting: Add recovering state to the grafana_alerting_alerts metric (#104380) 2025-04-23 13:58:57 +02:00
Yuri Tseretyan 9dd75aee32 Alerting: Refactor State Transition (part 2 of n) (#99985)
* split create to create and patch and move to state

patch will be refactored further

* move setNextState to state transition

* move tests

* split tests for patch function
2025-02-13 09:45:16 -05:00
Yuri Tseretyan 807f94b2c7 Alerting: Remove feature toggle alertingNoNormalState (#99905) 2025-02-03 17:32:50 +02:00
Yuri Tseretyan 420db99d16 Alerting: Update state manager to have immutable state in cache (#95985)
* create a new state and set at the end
* propagate labels datasource_uid and ref_id from current state if it's error
* copy the state when apply to all
2024-11-15 15:01:02 -05:00
Alexander Akhmetov 305123df91 Alerting: Keep state manager cache during cache warm-up (#95727)
* Alerting: Keep state manager cache during cache warm-up

Instead of overwriting the state manager cache during warm-up,
we update the data in the cache if it is not there yet. If the cache
already contains a state entry with the same key, we do not overwrite it.
2024-11-04 18:26:52 +01:00
Alexander Akhmetov 9d182986f1 Alerting: make StatePersister more configurable to support custom rule-level state persisters (#94590) 2024-10-11 15:25:29 +02:00
Yuri Tseretyan 537f1fb857 Alerting: Fix persisting result fingerprint that is used by recovery threshold (#91224)
* fix persister to save result fingerprint

* revert change

* fmt
2024-07-30 18:07:13 -04:00
Matthew Jacobson ba800692c6 Alerting: Persist AlertInstance ResolvedAt & LastSentAt (#89135)
* Alerting: Persist AlertInstance ResolvedAt & LastSentAt

* Fix test

* Modify existing tests

* Fix merge conflicts from nullable LastSentAt & ResolvedAt
2024-07-12 12:26:58 -04:00
Alexander Weaver 3b6a8775bb Alerting: Fix stale values associated with states that have gone to NoData, unify values calculation (#89807)
* Unify values

* Fix with latest changes on main

* Fix up NaN test

* Keep refIDs with -1 as value

* Test that refIDs are preserved on Normal to Error transition

* Alerting to err test too

* Add a blurb to docs about this behavior
2024-07-08 12:30:23 -05:00
Yuri Tseretyan 411bab6d44 Alerting: Lower severity of logs about duplicates to debug (#89971)
lower severity of logs about duplicates to debug
2024-07-03 16:46:28 -04:00
Alexander Akhmetov 667fea6623 Alerting: use hash of labels instead of labels string as the alert state cache key (#88956)
* Alerting: use hash instead of labels as the cache key
* Use data.Labels.Fingerprint to calculate the cache key
2024-06-11 18:34:58 +02:00
Yuri Tseretyan 1eebd2a4de Alerting: Support for simplified notification settings in rule API (#81011)
* Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping
* Support validation of notification settings when rules are updated

* Implement route generator for Alertmanager configuration. That fetches all notification settings.
* Update multi-tenant Alertmanager to run the generator before applying the configuration.

* Add notification settings labels to state calculation
* update the Multi-tenant Alertmanager to provide validation for notification settings

* update GET API so only admins can see auto-gen
2024-02-15 09:45:10 -05:00
Jean-Philippe Quéméner eb7e1216a1 feat(alerting): add async state persister (#80763) 2024-01-22 13:07:11 +01:00
Yuri Tseretyan f6a46744a6 Alerting: Support hysteresis command expression (#75189)
Backend: 

* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
   - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
  -  add rule_fingerprint column to alert_instance
   - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
   - add AlertingResultsFromRuleState that implements the new interface in eval package
   - update getExprRequest to patch the hysteresis command.

* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.


Frontend:

* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions

* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions

---------

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
2024-01-04 11:47:13 -05:00
gotjosh e877174501 Alerting: Expose metrics for Alertmanager Alerts - grafana_alerting_alertmanager_alerts (#75802)
* Alerting: Expose metrics for Alertmanager Alerts

In Grafana, the alert evaluation and alert delivery are combined. We're always used a metric named `grafana_alerting_alerts` to get a sense of what are the alerts that are currently firing (these come from the evaluation side) and opted to not map the alertmanager alerts metric directly.

I think it's important that we make a disction between alerts that happen at evaluation vs alerts that are received for delivery by the internal Alertmanager as we have options to skip the delivery of these alerts to the internal alertmanager altogether.
2023-10-02 16:36:23 +01:00
gotjosh 59694fb2be Alerting: Don't use a separate collection system for metrics (#75296)
* Alerting: Don't use a separate collection system for metrics

The state package had a metric collection system that ran every 15s updating the values of the metrics - there is a common pattern for this in the Prometheus ecosystem called "collectors".

I have removed the behaviour of using a time-based interval to "set" the metrics in favour of a set of functions as the "value" that get called at scrape time.
2023-09-25 10:27:30 +01:00
SatVeer Singh 1bfa3a0f1e Chore: Replace go-multierror with errors package (#66432)
* code refactor and type assertions added to tests

* no-lint rule added for specific line
2023-06-19 12:29:45 +03:00
Yuri Tseretyan baffe83da6 Alerting: Improve performance of cache.getOrCreate (#63909)
* move expansion of labels and annotations outside of mutex lock
* propagate struct but not pointer
2023-06-15 09:37:47 -04:00
Matthew Jacobson b9dc04139a Alerting: Respect "For" Duration for NoData alerts (#65574)
* Alerting: Respect "For" Duration for NoData alerts

This change modifies `resultNoData` to be more inline with the logic of the other state handlers.

The main effects of this are:

1) NoData states with NoDataState config set to Alerting will respect "For" duration.
2) Prevents zero value in StartsAt and EndsAt for alerts that have only even been in normal state. This includes state transitions from NoDataState=OK and ExecErrState=OK.
3) Better state transition logging.
2023-03-31 19:05:15 +03:00
George Robinson 0c8876c3a2 Alerting: Return errors when expanding templates (#63662)
This commit changes the state package so that errors encountered while
expanding templates for custom labels and annotations are returned
from the function. This is not used at present, but will be used in the
future as we look at how to offer better feedback to users who don't
have access to logs, for example our customers who use Hosted Grafana.
2023-03-08 12:25:02 +00:00
George Robinson ed71012ced Alerting: Fix Classic Conditions $values variable (#64243)
This commit fixes a bug in the $values variable in notification
templates when using Classic Conditions. Since Classic Conditions
are not multi-dimensional, the values of each series that exceeded
the condition should be available as a RefID and offset. For example,
B0, B1, etc. However, this bug meant that instead just a single
condition would be printed as B, not B0.
2023-03-06 12:08:00 -05:00
George Robinson 0a01391ebe Alerting: Small readability improvements to template.go (#63422)
* Alerting: Small readability improvements to template.go

* Fix lint
2023-02-20 09:24:11 +00:00
George Robinson 9e86916d48 Alerting: Move templating to template package (#63347)
This commit moves templating from the state package to a sub-package
called template. This sub-package will be the logical package for
future ease-of-use improvements to templating custom annotations
and labels.
2023-02-16 17:16:36 +01:00
Steve Simpson 4d1a2c3370 Alerting: Move rule_groups_rules metric from State to Scheduler. (#63144)
The `rule_groups_rules` metric is currently defined and computed by `State`.
It makes more sense for this metric to be computed off of the configured rule
set, not based on the rule evaluation state. There could be an edge condition
where a rule does not have a state yet, and so is uncounted.

Additionally, we would like this metric (and others), to have a `rule_group`
label, and this is much easier to achieve if the metric is produced from the
`Scheduler` package.
2023-02-09 17:05:19 +01:00
Yuri Tseretyan 9d57b1c72e Alerting: Do not persist noop transition from Normal state. (#61201)
* add feature flag `alertingNoNormalState`
* update instance database to support exclusion of state in list operation
* do not save normal state and delete transitions to normal
* update get methods to filter out normal state
2023-01-13 18:29:29 -05:00
Denis Limarev 90badc8729 Performance: Add preallocation for some slices (#59593) 2023-01-11 18:03:37 +01:00
Yuri Tseretyan 3621cf5a12 Alerting: Update handling of stale state (#58276)
* delete all stale states in one lock
* do not use touched states to detect stale rely only on LastEvaluationTime maintained correctly
* fix tests to use correct eval time
* delete unused method
2022-11-07 11:03:53 -05:00
Alexander Weaver de46c1b002 Alerting: Improve logs in state manager and historian (#57374)
* Touch up log statements, fix casing, add and normalize contexts

* Dedicated logger for dashboard resolver

* Avoid injecting logger to historian

* More minor log touch-ups

* Dedicated logger for state manager

* Use rule context in annotation creator

* Rename base logger and avoid redundant contextual loggers
2022-10-21 16:16:51 -05:00
Alexander Weaver 3ddb28bad9 Find-and-replace 'err' logs to 'error' to match log search conventions (#57309) 2022-10-19 17:36:54 -04:00
George Robinson 52965de369 Alerting: Add doc comments to state struct and normalize fields (#56647) 2022-10-11 09:30:33 +01:00
George Robinson 802d67eeca Alerting: Support values in notification templates (#56457)
We have received a lot of feedback regarding the ValueString in alert notifications. Perhaps one of the most frequent complaints about ValueString is that it is difficult to read because it contains a lot of information, and the information is shown as a JSON-like string. Users have often asked how it can be templated and the answer is that it can't.

Until now users have been able to add custom annotations to their alert rules which contains values via the $values variable added in previous versions of Grafana. However, these custom annotations must be added for each of the user's alert rule, instead of once in a template that all of their alerts can be notified via.

This commit adds then the much requested feature to support values in notification templates. Users can then create a single template that prints the annotations, labels and values of their alerts in a format of their choice!
2022-10-10 13:40:21 +01:00
Yuriy Tseretyan e2f1201382 Alerting: Fix migration to not add label "alertname" (#56509)
* do not add label alertname because it is overridden in state manager anyway
* update state manager to not consider labels with same value as dupe
2022-10-07 15:06:53 -04:00
Yuriy Tseretyan 7b6437402a Alerting: Refactor state manager's cache (#56197)
* remove ResetAllStates because it's not used
* refactor cache to accept logs, metrics and url as method args
* update manager Warm method to set the entire state at once
* remove unused reset method
* introduce ruleStates
* change getOrCreate to belong to ruleStates
* update Get to not return error
2022-10-06 15:30:12 -04:00
Yuriy Tseretyan 03e746d9df Alerting: Delete state from the database on reset (#53919)
* make ResetStatesByRuleUID return states
* delete rule states when reset
* rule eval routine to clean up the state only when rule is deleted
2022-08-25 14:12:22 -04:00
Yuriy Tseretyan e5e8747ee9 Alerting: Update state manager to accept reserved labels (#52189)
* add tests for cache getOrCreate
* update ProcessEvalResults to accept extra lables
* extract to getRuleExtraLabels
* move populating of constant rule labels to extra labels
2022-07-14 15:59:59 -04:00
George Robinson 43358c7248 Alerting: Keep private annotations across evaluations (#49080) 2022-05-18 11:21:18 +02:00
idafurjes 56c3875bb9 Chore: Remove context.TODO (#43458)
* Remove context.TODO() from services

* Fix live test
2021-12-28 10:26:18 +01:00
Santiago 562cd9e44e Alerting template functions (#39261)
* Alerting: (wip) add template funcs

* Alerting: (wip) numeric template functions

* Alerting: (wip) template functions

* Test for the "args" function

* Alerting: (wip) Documentation for template functions

* Alerting: template functions - refactor

* code review changes

* disable linter error

* Use Prometheus implementation of TemplateExpander

* Update docs/sources/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule.md

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>

* change templateCaptureValue to support using template functions

* Update pkg/services/ngalert/state/template.go

Co-authored-by: gotjosh <josue.abreu@gmail.com>

* Test and documentation added for reReplaceAll template function

* complete missing functions, documentation and tests

* Use the alert instance's evaluation time for expanding the template

* strvalue graphlink and tablelink functions

* delete duplicate test

* make strvalue return an empty string

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
2021-10-04 15:04:37 -03:00
Santiago c3cf95f383 Revert "Alerting: add template funcs (#38404)" (#39258)
This reverts commit d6fb0181fb.
2021-09-15 19:47:22 -03:00
Santiago d6fb0181fb Alerting: add template funcs (#38404)
* Alerting: (wip) add template funcs

* Alerting: (wip) numeric template functions

* Alerting: (wip) template functions

* Test for the "args" function

* Alerting: (wip) Documentation for template functions

* Alerting: template functions - refactor

* code review changes

* disable linter error

* Use Prometheus implementation of TemplateExpander

* Update docs/sources/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule.md

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>

Co-authored-by: achatterjee-grafana <70489351+achatterjee-grafana@users.noreply.github.com>
2021-09-15 18:48:29 -03:00
gotjosh a2f4344bf2 Alerting: Refactor & fix unified alerting metrics structure (#39151)
* Alerting: Refactor & fix unified alerting metrics structure

Fixes and refactors the metrics structure we have for the ngalert service. Now, each component has its own metric struct that includes the JUST the metrics it uses. Additionally, I have fixed the configuration metrics and added new metrics to determine if we have discovered and started all the necessary configurations of an instance.

This allows us to alert on `grafana_alerting_discovered_configurations - grafana_alerting_active_configurations != 0` to know whether an alertmanager instance did not start successfully.
2021-09-14 12:55:01 +01:00
George Robinson 5caf6cb369 Change templateCaptureValue to support using template functions (#38766)
* Change templateCaptureValue to support using template functions

This commit changes templateCaptureValue to use float64 for the value
instead of *float64. This change means that annotations and labels can
use the float64 value with functions such as printf and avoid having to
check for nil. It also means that absent values are now printed as 0.

* Use math.NaN() instead of 0 for absent value
2021-09-08 10:46:15 +01:00
David Parrott b5f464412d Alerting: automatically remove stale alerting states (#36767)
* initial attempt at automatic removal of stale states

* test case, need espected states

* finish unit test

* PR feedback

* still multiply by time.second

* pr feedback
2021-07-26 18:12:04 +02:00
George Robinson 2f4c893cf3 Expand the value string in annotations and labels of alerts (#37051)
This commit makes it possible to use the value string in
annotations and labels for alerts with "{{ $value }}"
2021-07-22 15:20:44 +01:00
George Robinson 456dac1303 Expand the value of math and reduce expressions in annotations and labels (#36611)
* Expand the value of math and reduce expressions in annotations and labels

This commit makes it possible to use the values of reduce and math
expressions in annotations and labels via their RefIDs. It uses the
Stringer interface to ensure that "{{ $values.A }}" still prints the
value in decimal format while also making the labels for each RefID
available with "{{ $values.A.Labels }}" and the float64 value with
"{{ $values.A.Value }}"
2021-07-15 13:10:56 +01:00
David Parrott 310d3ebe3d change template expansion missing value handling (#36679) 2021-07-13 06:57:18 -07:00
Ganesh Vernekar dcd4bf1615 Alerting: Fill the empty GeneratorURL (#35740)
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2021-06-16 15:34:12 +05:30