Commit Graph

271 Commits

Author SHA1 Message Date
Alexander Akhmetov d44728f4e5 Alerting: Metric to count imported from Prometheus rules (#100847) 2025-03-05 14:02:28 +01:00
Yuri Tseretyan 879b121136 Alerting: Add GUID to alert rule tables (#101321)
* add column guid to alert rule table and rule_guid to rule version table
+ populate the new field with UUID
* update storage and domain models
* patch GUID
* ignore GUID in fingerprint tests
2025-02-28 09:47:25 -05:00
Yuri Tseretyan 32fde6dba4 Alerting: Update scheduler to provide full specification to rule update channel (#101375)
update scheduler's aler rule to accept regular Evaluation in update channel

This makes it accept the full rule definition, which is required in reset state.
2025-02-26 14:39:39 -05:00
Yuri Tseretyan 4cac3158c7 Alerting: Fix alert rule copy to include metadata (#100212)
* copy metadata

* add tests for copy and generator

* extract copy rule to a production method and update usages

* fix tests
2025-02-11 09:46:02 -05:00
Yuri Tseretyan 33b11d5c76 Alerting: Remove ID and OrgID from hash calculation (#100140) 2025-02-05 14:15:02 -05:00
Alexander Akhmetov a0bf9202f5 Alerting: Clear the state cache when the alert routine stops (#99681) 2025-01-28 21:15:19 +02:00
Alexander Akhmetov a28328d764 Alerting: Call the deletion reason provider even if the rule is no longer scheduled (#99571)
Alerting: Call the deletion reason provider even if the rule is not scheduled anymore
2025-01-28 11:34:26 +01:00
Alexander Akhmetov 12bda63871 Alerting: Optional function to find the rule deletion reason (#99422) 2025-01-27 11:35:52 +01:00
Yuri Tseretyan 92d6762a3a Alerting: Store information about user that created\updated alert rule (#99395)
* introduce new fields created_by in rule tables
* update domain model and compat layer to support UpdatedBy
* add alert rule generator mutators for UpdatedBy
* ignore UpdatedBy in diff and hash calculation
* Add user context to alert rule insert/update operations
  Updated InsertAlertRules and UpdateAlertRules methods to accept a user context parameter. This change ensures auditability and better tracking of user actions when creating or updating alert rules. Adjusted all relevant calls and interfaces to pass the user context accordingly.

* set UpdatedBy in PreSave because this is where Updated is set
* Use nil userID for system-initiated updates
This ensures differentiation between system and user-initiated changes for better traceability and clarity in update origins.

---------

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2025-01-24 12:09:17 -05:00
Santiago ea6cb8f139 Alerting: Panic when rule being evaluated has unexpected key (#99002) 2025-01-15 14:59:50 +02:00
Santiago 86e8147df3 Alerting: Use AlertRuleKey for comparison before rule evaluation (#98808)
(WIP) Alerting: Use AlertRuleKey for comparison before rule evaluation
2025-01-10 15:31:03 +01:00
Yuri Tseretyan f851379f7d Alerting: Add traceID to rule evalutor logger (#98549) 2025-01-06 15:00:25 -05:00
Alexander Akhmetov bb713cf8e4 Alerting: Add simplified_notifications_section setting to grafana_alerting_simplified_editor_rules metric (#98053) 2024-12-17 11:13:31 +01:00
Alexander Akhmetov 324503ee8b Alerting: Add simplified_notifications_section field to the alert rule metadata (#95988) 2024-11-14 12:55:54 +01:00
Will Browne 25abd57029 Plugins: Update to latest go plugin SDK (0.256.0) (#95065)
* update to latest go plugin SDK

* make update-workspace

* update alerting tests
2024-10-22 15:44:53 +01:00
Alexander Akhmetov 0a4e6ff86b Alerting: Add SaveAlertInstancesForRule instance store method (#94505)
Alerting: Add SaveAlertInstancesForRule method to the InstanceStore interface
2024-10-11 13:47:44 +02:00
Alexander Weaver 393faa8732 Alerting: Move rule evaluation status logic out of prometheus API and into scheduler (#89141)
* Add health fields to rules and an aggregator method to the scheduler

* Move health, last error, and last eval time in together to minimize state processing

* Wire up a readonly scheduler to prom api

* Extract to exported function

* Use health in api_prometheus and fix up tests

* Rename health struct to status

* Fix tests one more time

* Several new tests

* Handle inactive rules

* Push state mapping into state manager

* rename to StatusReader

* Rectify cyclo complexity rebase

* Convert existing package local status implementation to models one

* fix tests

* undo RuleDefs rename
2024-09-30 16:52:49 -05:00
Alexander Akhmetov 0ed70d0b2f Alerting: Add a metric to track the number of rules with simplified editor settings (#93511)
* Alerting: Add a metric to track the number of rules with simplified editor settings
2024-09-20 17:56:40 +02:00
Alexander Akhmetov 9f5b05f936 Alerting: Add metadata field with editor_settings to alert rule (#93245) 2024-09-19 16:43:41 +02:00
William Wernert efe62086f9 Alerting: Add type label rule_group_rules metric (#91425)
* Add group and type labels to rule_group_rules metric

* Don't include group to avoid high cardinality

* Add comments

* Reset rule_group_rules before recording new values

* Edit description for rule_group_rules

* Include ruleGroup combo key in labels

* Fix lint
2024-09-12 17:27:09 +03:00
Alexander Akhmetov 152d3540db Alerting: Log number of dimensions instead of all evaluation results (#92733) 2024-08-30 12:35:02 +02:00
Alexander Weaver 490d6ba2fd Alerting: Extend scheduler user with datasources:read (#92410)
Add permission
2024-08-26 10:59:54 -05:00
Alexander Akhmetov d32e1e009b Alerting: Update prometheus/client_golang to v1.20 (#92070)
Update prometheus/client_golang to v1.20
2024-08-20 11:26:06 +02:00
Alexander Weaver ac5ebe6e4d Alerting: Add enablement flag for recording rules (#92032)
* Add enablement flag

* Disable if toggle not enabled
2024-08-19 12:01:00 -05:00
Alexander Weaver 34ab5fe1f3 Alerting: Restart rule routines if the type changes (#90867)
* Restart when types change

* Wire up test hooks correctly

* testing
2024-08-14 14:57:47 -05:00
Alexander Akhmetov 149f02aebe Alerting: Add rule_group label to grafana_alerting_rule_group_rules metric (#88289)
* Alerting: Add rule_group label to grafana_alerting_rule_group_rules metric (#62361)

* Alerting: Delete rule group metrics when the rule group is deleted

This commit addresses the issue where the GroupRules metric (a GaugeVec)
keeps its value and is not deleted when an alert rule is removed from the rule registry.
Previously, when an alert rule with orgID=1 was active, the metric was:

  grafana_alerting_rule_group_rules{org="1",state="active"} 1

However, after deleting this rule, subsequent calls to updateRulesMetrics
did not update the gauge value, causing the metric to incorrectly remain at 1.

The fix ensures that when updateRulesMetrics is called it
also deletes the group rule metrics with the corresponding label values if needed.
2024-08-13 13:27:23 +02:00
Yuri Tseretyan ee78bb653f Alerting: Log rule evaluation error in scheduler (#91585) 2024-08-06 19:27:02 +03:00
Alexander Weaver 72ecde5045 Alerting: Make orgID a direct arg of writer interface (#91422)
make orgID a direct arg of writer interface
2024-08-02 09:37:28 -05:00
William Wernert a1ee84f757 Alerting: Remove duplicate tracing middleware from prom writer (#91353)
Remove duplicate tracing middleware from prom writer
2024-08-01 11:57:14 -04:00
Alexander Weaver 4c71cadd5f Alerting: Detach condition validator from condition evaluator (#91150)
* Detach validator from evaluator

* Drop unnecessary interface and type
2024-07-30 10:55:37 -05:00
Yuri Tseretyan 8323b688c6 Alerting: Improve logging in scheduler and states (#91003)
* handle metadata map nil

* remove double context

* clean up logging in scheduler

* do not reuse loggers from previous ticks

* log the dropped tick

* log tick instead of ticknum

* replace with processing tick logs

* log sending notifications

* update logging in persister to fetch context

* logs to historian

moved them upstream to be able to log when store is overridden
2024-07-29 16:01:48 -04:00
William Wernert 45f298120e Alerting: Return error when writing recorded metrics instead of default writing NaN (#90743)
* Return error instead of default writing NaN
2024-07-22 15:47:02 -04:00
Alexander Weaver 418b077c59 Alerting: Integration testing for recording rules including writes (#90390)
* Add success case and tests for writer using metrics

* Use testable version of clock

* Assert a specific series was written

* Fix linter

* Fix manually constructed writer
2024-07-18 17:14:49 -05:00
Alexander Weaver 88ed77e7e8 Alerting: More graceful handling of NoData in recording rules (#90312)
* Handle NoData as its own case

* Debug

* Scalars parseable by CollectionReader

* fix linter

* Orgit add pkg/*git add pkg/* not and
2024-07-17 15:24:03 -05:00
Yuri Tseretyan c3b9c9b239 Alerting: Send information about alert rule to data source in headers (#90344)
* add support of metadata to condition and adding it to request headers
* support for additional metadata when condition is built
* add additionall context to conditions: source and folder title
* add version
* use percent-encoding for header values
2024-07-17 22:55:12 +03:00
Alexander Weaver 111ebd4fb2 Alerting: Create integration testing infra for recording rules (#90306)
* Create some integration testing infra for RRs

* whoops

* Require no error in responding

* fix linter

* Panic, no need to pass testing around

* Extend status test
2024-07-11 14:59:52 -05:00
Alexander Weaver ab32183e18 Alerting: Track recording rule health and last eval info ephemerally (#90247)
* Track health and last eval info

* Read method for status

* Minor tests
2024-07-11 14:05:09 -05:00
Yuri Tseretyan c3b5cabb14 Alerting: Refactor scheduler's rule evaluator to store rule key (#89925) 2024-07-01 16:43:23 -04:00
Yuri Tseretyan 655e477c20 Alerting: Fix flaky test in scheduler's tests (#89923) 2024-07-01 13:31:03 -04:00
Matthew Jacobson 47c9259d75 Alerting: Ensure we update State.LastSentAt before persisting (#89427) 2024-06-25 13:01:26 -04:00
William Wernert fcfa89f864 Alerting: Implement Prometheus remote write for recording rules (#89189)
* Fix timestamp recorded by rule

* Implement prometheus remote write

* Create http client instead of transport

* Address PR comments

* Remove status code label
2024-06-25 17:23:42 +03:00
Matthew Jacobson 3228b64fe6 Alerting: Resend resolved notifications for ResolvedRetention duration (#88938)
* Simple replace of State.Resolved with State.ResolvedAt

* Retain ResolvedAt time between Normal->Normal transition

* Introduce ResolvedRetention to keep sending recently resolved alerts

* Make ResolvedRetention configurable with resolved_alert_retention

* Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention

* Do not reset ResolvedAt during Normal->Pending transition

Initially this was done to be inline with Prom ruler. However, Prom ruler
doesn't keep track of Inactive->Pending/Alerting using the same alert instance,
so it's more understandable that they choose not to retain ResolvedAt. In our
case, since we use the same cached instance to represent the transition, it
makes more sense to retain it.

This should help alleviate some odd situations where temporarily entering
Pending will stop future resolved notifications that would have happened
because of ResolvedRetention.

* Pointers for ResolvedAt & LastSentAt

To avoid awkward time.Time{}.Unix() defaults on persist
2024-06-20 16:33:03 -04:00
William Wernert c62cc25513 Alerting: Configure recording rule writer from config.ini (#89056) 2024-06-12 16:04:46 -04:00
Alexander Akhmetov 667fea6623 Alerting: use hash of labels instead of labels string as the alert state cache key (#88956)
* Alerting: use hash instead of labels as the cache key
* Use data.Labels.Fingerprint to calculate the cache key
2024-06-11 18:34:58 +02:00
Alexander Weaver d004f8a98d Alerting: Recording rules understands errors embedded in dataframes (#88946)
* Make MakeDependencyError public for tests in another package

* Create tests for errors in eval results

* Extract logic to pull frame errors out into exported function

* Maybe we can drop cyclomatic complexity lint suppression now?

* extract frame errors and fail recording rules if frames contain error

* Fix up retry logic to actually work

* Do not retry non retryable errors
2024-06-11 10:37:10 -05:00
Alexander Weaver 58fdb24b0b Alerting: Recording rules appear as type=recording in Prometheus API + better abstraction for type (#88805)
* Wire status through to prom API

* Regenerate swagger
2024-06-07 11:24:06 -05:00
Alexander Weaver a2e21d61f8 Alerting: Remove dead evalRunning guard in rule routine (#88312)
Remove dead guard
2024-06-06 16:15:01 -05:00
William Wernert 5de7d4d06d Alerting: Create writer interface for recording rules (#88459)
* Create writer interface for recording rules

Also create fake impl + use it for stub in scheduler
2024-05-29 22:38:33 +03:00
Alexander Weaver b926b6336d Alerting: Scheduled recording rules execute their queries (#88309)
* Basic eval flow

* Wiring-up

* fix

* Extend todo

* Start with tests

* Include some relevant tests, skip ones that seem to have timing-based race conditions

* Some tests, touch up linter and todo

* Solve TODO

* Add tracing

* Tests to make sure an eval went through

* Wire up feature toggles

* Update pkg/services/ngalert/schedule/recording_rule.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

* Update pkg/services/ngalert/schedule/recording_rule_test.go

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>

---------

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-05-28 10:59:21 -05:00
Alexander Weaver 89b54d06e9 Alerting: Schedule a shim implementation for recording rules (#87939)
* Add shim rule implementation for recording rules

* Give ruleFactory access to the original rule definition

* Schedule shim implementation if the rule is a recording rule

* Fix or suppress linter

* Fix nolint
2024-05-21 16:42:58 -05:00