Files

T

Konrad Lalik 0a8dccc19a Alerting: New alert list filter improvements (#103107 )

* Move filtering code to generators for performance reasons

Discarding rules and groups early in the iterable chain limits the number of promises we need to wait for which improves performance significantly

* Add error handling for generators

* Add support for data source filter for GMA rules

* search WIP fix

* Fix datasource filter

* Move filtering back to filtered rules hook, use paged groups for improved performance

* Add queriedDatasources field to grafana managed rules and update filtering logic to rely on it

- Introduced a new field `queriedDatasources` in the AlertingRule struct to track data sources used in rules.
- Updated the Prometheus API to populate `queriedDatasources` when creating alerting rules.
- Modified filtering logic in the ruleFilter function to utilize the new `queriedDatasources` field for improved data source matching.
- Adjusted related tests to reflect changes in rule structure and filtering behavior.

* Add FilterView performance logging

* Improve GMA Prometheus types, rename queried datasources property

* Use custom generator helpers for flattening and filtering rule groups

* Fix lint errors, add missing translations

* Revert test condition

* Refactor api prom changes

* Fix lint errors

* Update backend tests

* Refactor rule list components to improve error handling and data source management

- Enhanced error handling in FilterViewResults by logging errors before returning an empty iterable.
- Simplified conditional rendering in GrafanaRuleLoader for better readability.
- Updated data source handling in PaginatedDataSourceLoader and PaginatedGrafanaLoader to use new individual rule group generator.
- Renamed toPageless function to toIndividualRuleGroups for clarity in prometheusGroupsGenerator.
- Improved filtering logic in useFilteredRulesIterator to utilize a dedicated function for data source type validation.
- Added isRulesDataSourceType utility function for better data source type checks.
- Removed commented-out code in PromRuleDTOBase for cleaner interface definition.

* Fix abort controller on FilterView

* Improve generators filtering

* fix abort controller

* refactor cancelSearch

* make states exclusive

* Load full page in one loadResultPage call

* Update tests, update translations

* Refactor filter status into separate component

* hoist hook

* Use the new function for supported rules source type

---------

Co-authored-by: Gilles De Mey <gilles.de.mey@gmail.com>

2025-04-11 10:02:34 +02:00

accesscontrol

Alerting: Relax permissions for access a rule (#103664 )

2025-04-11 00:58:37 +01:00

api

Alerting: New alert list filter improvements (#103107 )

2025-04-11 10:02:34 +02:00

backtesting

Alerting: Send information about alert rule to data source in headers (#90344 )

2024-07-17 22:55:12 +03:00

client

Alerting: Instrument outbound requests for Loki Historian and Remote Alertmanager with tracing (#89185 )

2024-06-14 13:24:12 -05:00

eval

Alerting: Fix plugin not found error during condition validation (#102437 )

2025-04-04 10:37:55 +01:00

image

Chore: Remove public vars in setting package (#81018 )

2024-01-23 12:36:22 +01:00

metrics

Alerting: Metric to count imported from Prometheus rules (#100847 )

2025-03-05 14:02:28 +01:00

models

CI: Bump golangci-lint to 2.0.2 (#103572 )

2025-04-10 14:42:23 +02:00

notifier

CI: Bump golangci-lint to 2.0.2 (#103572 )

2025-04-10 14:42:23 +02:00

prom

Alerting: Allow importing Prometheus rules with keep_firing_for (#103557 )

2025-04-07 22:33:07 +02:00

provisioning

CI: Bump golangci-lint to 2.0.2 (#103572 )

2025-04-10 14:42:23 +02:00

remote

CI: Bump golangci-lint to 2.0.2 (#103572 )

2025-04-10 14:42:23 +02:00

schedule

CI: Bump golangci-lint to 2.0.2 (#103572 )

2025-04-10 14:42:23 +02:00

sender

CI: Bump golangci-lint to 2.0.2 (#103572 )

2025-04-10 14:42:23 +02:00

state

CI: Bump golangci-lint to 2.0.2 (#103572 )

2025-04-10 14:42:23 +02:00

store

Alerting: Relax permissions for access a rule (#103664 )

2025-04-11 00:58:37 +01:00

tests

K8s: Folders: Modify GetChildren to return only Folder References (#103072 )

2025-04-02 01:30:17 -03:00

testutil

App Platform: Remove mutable globals (#102962 )

2025-03-27 15:46:09 +01:00

writer

Alerting: handle mimir BadRequest write errors (#102027 )

2025-03-19 14:56:00 +01:00

accesscontrol.go

Alerting: Notifications Routes API (#91550 )

2024-10-24 13:53:03 -04:00

CHANGELOG.md

Update Alerting changelog (#56684 )

2022-10-11 10:55:18 +00:00

limits_test.go

Alerting: Decouple quota configuration logic from API interfaces and add tests (#78930 )

2023-12-01 10:47:19 -06:00

limits.go

Alerting: Guided legacy alerting upgrade dry-run (#80071 )

2024-01-05 18:19:12 -05:00

ngalert_test.go

K8s: Folders: Modify GetChildren to return only Folder References (#103072 )

2025-04-02 01:30:17 -03:00

ngalert.go

Alerting: Remove feature toggles relating to Loki Alert State History (#103540 )

2025-04-08 09:50:27 -04:00

README.md

Alerting: Decouple rule routine from scheduler (#84018 )

2024-03-06 13:44:53 -06:00

README.md

Next generation alerting (ngalert) in Grafana 8

Ngalert (Next generation alert) is the next generation of alerting in Grafana 8.

Overview

The ngalert package can be found in pkg/services/ngalert and has the following sub-packages:

- api
- eval
- logging
- metrics
- models
- notifier
- schedule
- sender
- state
- store
- tests

Scheduling and evaluation of alert rules

The scheduling of alert rules happens in the schedule package. This package is responsible for managing the evaluation of alert rules including checking for new alert rules and stopping the evaluation of deleted alert rules.

The scheduler runs at a fixed interval, called its heartbeat, in which it does a number of tasks:

Fetch the alert rules for all organizations (excluding disabled)
Start a goroutine (if this is a new alert rule or the scheduler has just started) to evaluate the alert rule
Send an *evaluation event to the goroutine for each alert rule if its interval has elapsed
Stop the goroutines for all alert rules that have been deleted since the last heartbeat

The function that evaluates each alert rule is called run. It waits for an *evaluation event (sent each interval seconds elapsed and is configurable per alert rule) and then evaluates the alert rule. To ensure that the scheduler is evaluating the latest version of the alert rule it compares its local version of the alert rule with that in the *evaluation event, fetching the latest version of the alert rule from the database if the version numbers mismatch. It then invokes the Evaluator which evaluates any queries, classic conditions or expressions in alert rule and passes the results of this evaluation to the State Manager. An evaluation can return no results in the case of NoData or Error, a single result in the case of classic conditions, or more than one result if the alert rule is multi-dimensional (i.e. one result per label set). In the case of multi-dimensional alert rules the results from an evaluation should never contain more than one per label set.

The State Manager is responsible for determining the current state of the alert rule (normal, pending, firing, etc) by comparing each evaluation result to the previous evaluations of the same label set in the state cache. Given a label set, it updates the state cache with the new current state, the evaluation time of the current evaluation and appends the current evaluation to the slice of previous evaluations. If the alert changes state (i.e. pending to firing) then it also creates an annotation to mark it on the dashboard and panel for this alert rule.

You might have noticed that so far we have avoided using the word "Alert" and instead talked about evaluation results and the current state of an alert rule. The reason for that is at this time in the evaluation of an alert rule the State Manager does not know about alerts, it just knows for each label set the state of an alert rule, the current evaluation and previous evaluations.

Notification of alerts

When an evaluation transitions the state of an alert rule for a given label set from pending to firing or from firing to normal the scheduler creates an alert instance and passes it to Alertmanager. In the case where a label set is transitioning from pending to firing the state of the alert instance is "Firing" and when transitioning from firing to normal the state of the alert instance is "Normal".

Which Alertmanager?

In ngalert it is possible to send alerts to the internal Alertmanager, an external Alertmanager, or both.

The internal Alertmanager is called MultiOrgAlertmanager and creates an Alertmanager for each organization in Grafana to preserve isolation between organizations in Grafana. The MultiOrgAlertmanager receives alerts from the scheduler and then forwards the alert to the correct Alertmanager for the organization.

When Grafana is configured to send alerts to an external Alertmanager it does so via the sender which creates an abstraction over notification of alerts and discovery of external Alertmanagers in Prometheus. The sender receives alerts via the SendAlerts function and then passes them to Prometheus.

How does Alertmanager turn alerts into notifications?

Alertmanager receives alerts via the PutAlerts function. Each alert is validated and its annotations and labels are normalized, then the alerts are put in an in-memory structure. The dispatcher iterates over the alerts and matches it to a route in the configuration as explained here.

The alert is then matched to an alert group depending on the configuration in the route. The alert is then sent through a number of stages including silencing and inhibition (which is currently not supported) and at last the receiver which can include wait, de-duplication, retry.

What are notification channels?

Notification channels receive alerts and turn them into notifications and is often the last callback in the receiver after wait, de-duplication and retry.