* automatically rename integration tests to follow the common convention
* name tests differently
* alter column type to bigint
* update another column to bigint
* add another alter
* fix subquery for mysql
**What is this feature?**
This PR implements a new Prometheus historian backend that allows Grafana alerting to write alert state history as Prometheus-compatible `ALERTS` metrics to remote Prometheus-compatible data sources.
The metric includes a few additional labels:
* `grafana_alertstate`: Grafana's full alert state, more granular than Prometheus.
* `grafana_rule_uid`: Grafana's alert rule UID.
Grafana states are included in the `grafana_alertstate` label also mapped to Prometheus-compatible `alertstate` values:
| Grafana alert state | `alertstate` | `grafana_alertstate` |
|---------------------|-----------------------|-----------------------|
| `Alerting` | `firing` | `alerting` |
| `Recovering` | `firing` | `recovering` |
| `Pending` | `pending` | `pending` |
| `Error` | `firing` | `error` |
| `NoData` | `firing` | `nodata` |
| `Normal` | _(no metric emitted)_ | _(no metric emitted)_ |
What is this feature?
Ensures that resolved notifications are sent when alert states transition from Error to Normal after the configured number of evaluation intervals: Missing series evaluations to resolve.
Why do we need this feature?
Before this change, when an alert was transitioning from Error to Normal, in case when the labels on the new Normal alert instance are the same, Grafana would not send resolved notifications for the Error alert state. The alert would be resolved after a few evaluation intervals automatically in the alertmanager, following the endsAt.
With this change the resolved notification is sent after the configured number of evaluation intervals: Missing series evaluations to resolve.
When a rule configured with `ExecErrState` state of `Alerting`, has an instance which is Alerting then has a data source error, then successfully evaluates and continues to be Alerting, the cached instance keeps the error cached until it is no longer firing.
This is unexpected and leads to misleading results.
What is this feature?
This PR fixes the MissingSeriesEvalsToResolve behavior when it's set to more than 4 evaluation intervals.
Why do we need this feature?
The MissingSeriesEvalsToResolve setting was not working correctly due to alerts being auto-resolved by Alertmanager after 4 evaluation intervals (via the endsAt field).
Before we had deleteStaleStatesFromCache method that was returning only stale states that had to be resolved. Non-stale states for which the current evaluation does not have a series never had endsAt updated and were never resend to the Alertmanager, so they were automatically resolved after 4 evaluations regardless of the setting.
The new processMissingSeriesStates returns state for each missing series on every evaluation, and resolves the stale ones. This guarantees that alerts without series still alert for the configured number of evaluations.
* Add FiredAt field to the State
* Update featuretoggle files
* Fix lint errors
* Fix test compilation
* Remove random print line + formatting
* Address PR comments
What is this feature?
Send resolved notifications not only when an alert state becomes stale (series is missing) and transitions from Alerting to Normal, but also from Error, NoData and Recovering.
Why do we need this feature?
Previously, when an alert state became stale or was deleted, it would transition to Normal but wouldn't trigger resolved notifications to the Alertmanager. This meant we relied on the Alertmanager to send resolved notifications when the alert expires. However, if the Alertmanager state is lost, these resolved notifications would never be sent, leaving users with firing alerts in their notification channels. This PR ensures that any transition from a firing state (Alerting, Error, NoData, Recovering) to Normal triggers a resolved notification.
What is this feature?
This PR fixes a state transition issue where alerts transitioning from the Recovering state back to the Alerting state incorrectly entered the Pending state first if the rule had a For duration configured.
Why do we need this feature?
When an alert goes from Alerting to Recovering (when using the Keep firing for) and then back to Alerting, the existing logic would incorrectly put the alert into Pending state while it should be alerting and still sending notifications to the Alertmanager.
What is this feature?
This PR changes the behavior of the $value and .Value variables in alerting templating to be more compatible with Prometheus templating. When a single datasource is used in the alerting rule, these variables will now return the numeric value from the query instead of the evaluation string.
Why do we need this feature?
It makes Grafana templating more compatible with Prometheus templates. In Prometheus, $value returns the numeric value of the query, but in Grafana it's the evaluation string: [ var='A' labels={instance=instance1} value=81.234 ]. This is because in Grafana multiple datasources can be used in the alert rule, and it's not always possible to get a single value.
This change makes Grafana's behavior consistent with Prometheus when a single datasource is used, and in case when multiple datasources are used in the query, it keeps the old behaviour.
Both $value and .Value are not recommended to use (documentation), and it's better to use .Values instead.
What is this feature?
This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation).
keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again:
Before
+----------+ +----------+
| Alerting |---->| Normal |
+----------+ +----------+
-----
After
+----------+ +------------+ +----------+
| Alerting |----->| Recovering |---->| Normal |
+----------+ +------------+ +----------+
Why do we need this feature?
This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert
* remove feature flag
* remove feature flag in state manager
* make sure no data with empty results is handled
Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
---------
Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
* Send new annotation containing image url
* Use new image TokenProvider with TokenStore
New abstraction GetImage no longer needs to support parsing both token and
url from annotations, as remote AM will use the new URLProvider. Instead, we
use the new generic TokenProvider and give it a TokenStore backed by the
grafana database.
That means we revert back to always using token simplifying code and security
considerations.
* Upgrade grafana/alerting to merged commit SHA
* split create to create and patch and move to state
patch will be refactored further
* move setNextState to state transition
* move tests
* split tests for patch function
Ensures we retake images after expiration on long-lived repeat alerts.
Otherwise, logs would show "Image not found in database" and notifications
would cease to contain an image after 24h of continuous firing.
* create a new state and set at the end
* propagate labels datasource_uid and ref_id from current state if it's error
* copy the state when apply to all
* Alerting: Keep state manager cache during cache warm-up
Instead of overwriting the state manager cache during warm-up,
we update the data in the cache if it is not there yet. If the cache
already contains a state entry with the same key, we do not overwrite it.
* Add health fields to rules and an aggregator method to the scheduler
* Move health, last error, and last eval time in together to minimize state processing
* Wire up a readonly scheduler to prom api
* Extract to exported function
* Use health in api_prometheus and fix up tests
* Rename health struct to status
* Fix tests one more time
* Several new tests
* Handle inactive rules
* Push state mapping into state manager
* rename to StatusReader
* Rectify cyclo complexity rebase
* Convert existing package local status implementation to models one
* fix tests
* undo RuleDefs rename