* Alerting: Add rule_group label to grafana_alerting_rule_group_rules metric (#62361)
* Alerting: Delete rule group metrics when the rule group is deleted
This commit addresses the issue where the GroupRules metric (a GaugeVec)
keeps its value and is not deleted when an alert rule is removed from the rule registry.
Previously, when an alert rule with orgID=1 was active, the metric was:
grafana_alerting_rule_group_rules{org="1",state="active"} 1
However, after deleting this rule, subsequent calls to updateRulesMetrics
did not update the gauge value, causing the metric to incorrectly remain at 1.
The fix ensures that when updateRulesMetrics is called it
also deletes the group rule metrics with the corresponding label values if needed.
* Simple replace of State.Resolved with State.ResolvedAt
* Retain ResolvedAt time between Normal->Normal transition
* Introduce ResolvedRetention to keep sending recently resolved alerts
* Make ResolvedRetention configurable with resolved_alert_retention
* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention
* Do not reset ResolvedAt during Normal->Pending transition
Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.
This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.
* Pointers for ResolvedAt & LastSentAt
To avoid awkward time.Time{}.Unix() defaults on persist
* Alerting: fix race condition in (*ngalert/sender.ExternalAlertmanager).Run
* Chore: Fix data races when accessing members of *ngalert/state.FakeInstanceStore
* Chore: Fix data races in tests in ngalert/schedule and enable some parallel tests
* Chore: fix linters
* Chore: add TODO comment to remove loopvar once we move to Go 1.22
* extract method processTick
* make processTick return scheduled rules
* move state manager tests to state manager
* update test
* move all tests into one file
* remove unused fields
* Update API to call the scheduler to remove\update an alert rule. When a rule is updated by a user, the scheduler will remove the currently firing alert instances and clean up the state cache.
* Update evaluation loop in the scheduler to support one more channel that is used to communicate updates to it.
* Improved rule deletion from the internal registry.
* Move alert rule version from the internal registry (structure alertRuleInfo) closer rule evaluation loop (to evaluation task structure), which will make the registry values immutable.
* Extract notification code to a separate function to reuse in update flow.
* Fix Annotation creation
- Remove validation of panelID, now annotations are created irrespective on whether they're attached to a panel or not.
- Alwasy attach the annotation to an AlertID
* Fix annotation creation
* fix tests
* Chore: GetDashboardQuery should be dispatched using DispatchCtx
* Fix after merge
* Changes after review
* Various fixes
* Use GetDashboardCtx function instead of GetDashboard
Introduces org-level isolation for the Alertmanager and its components.
Silences, Alerts and Contact points are not separated by org and are not shared between them.
Co-authored with @davidmparrott and @papagian
* Alerting: Expose discovered and dropped Alertmanagers
Exposes the API for discovered and dropped Alertmanagers.
* make admin config poll interval configurable
* update after rebase
* wordsmith
* More wordsmithing
* change name of the config
* settings package too
* Alerting: Send alerts to external Alertmanager(s)
Within this PR we're adding support for registering or unregistering
sending to a set of external alertmanagers. A few of the things that are
going are:
- Introduce a new table to hold "admin" (either org or global)
configuration we can change at runtime.
- A new periodic check that polls for this configuration and adjusts the
"senders" accordingly.
- Introduces a new concept of "senders" that are responsible for
shipping the alerts to the external Alertmanager(s). In a nutshell,
this is the Prometheus notifier (the one in charge of sending the alert)
mapped to a multi-tenant map.
There are a few code movements here and there but those are minor, I
tried to keep things intact as much as possible so that we could have an
easier diff.