Commit Graph

296 Commits

Author SHA1 Message Date
Alexander Akhmetov 100528e274 Alerting: Support retry with backoff in alert rule evaluation (#99710) 2025-09-04 13:56:03 +02:00
Yuri Tseretyan 7d32640179 Alerting: Fix ticker tests to not fail if channel is empty (#110538) 2025-09-03 16:21:47 -04:00
Yuri Tseretyan c5667476a7 Alerting: Update ticker to accept logger in the constructor (#110176)
* add logger to ticker
* move ticker to schedule
2025-08-26 12:17:48 -04:00
Gábor Farkas 2e5b55a855 datasources: querier: renamed the "mt" builder to "qs" builder (#109779) 2025-08-19 12:37:56 +02:00
Alexander Akhmetov 4e94e463cf Alerting: Fix private labels filtering test (#109393) 2025-08-08 14:08:36 +00:00
Alexander Akhmetov 89d6756c67 Alerting: Filter out private labels before writing recording rules (#109295) 2025-08-07 17:25:12 +02:00
Moustafa Baiou 16f8359d35 Alerting: Update Alert Rule to use int64 for MissingSeriesEvalsToResolve (#109306) 2025-08-06 21:45:48 -04:00
Sarah Zinger 3fad863fd1 Query Service: Combine SSE handling in single tenant and multi tenant paths (#108041)
* parse via sse

I need to figure out how to handle the pipeline.execute with our own
client. I think this is important for MT reasons, just like using our
own cache (via legacy) is important.

parsing is done though!

* WIP nonsense

* horrible code but i think it works

* Add support for sql expressions config settings

* Cleanup:
- remove spew from nodes.go
- uncomment out plugin context and use in single tenant flow
- make code more readable and add comments

* Cleanup:
- create separate file for mt ds client builder
- ensure error handling is the same for both expressions and regular queries
- other cleanup

* not working but good thoughts

* WIP, vector not working for non sse

* super hacky but i think vectors work now

* delete delete delete

* Comments for future ref

* break out query handling and start test

* add prom debugger

* clean up: remove comments and commented out bits

* fix query_test

* add prom debugger

* create table-driven tests with testsdata files

* Fix test

* Add test

* go mod??

* idk

* Remove comment

* go enterprise issue maybe

* Fix codeowners

* Delete

* Remove test data

* Clean up

* logger

* Remove go changes hopefully

* idk go man

* sad

* idk i ran go mod tidy and this is what it wants

* Fix readme, with much help from adam

* some linting and testing errors

* lint

* fix lint

* fix lint register.go

* another lint

* address lint in test

* fix dead code and linters for query_test

* Go mod?

* Struggling with go mod

* Fix test

* Fix another test

* Revert headers change

* Its difficult to test this in OSS as it depends on functionality defined in enterprise, let's bring these tests back in some form in enterprise

* Fix codeowners

---------

Co-authored-by: Adam Simpson <adam@adamsimpson.net>
2025-07-17 17:22:55 -04:00
Ryan McKinley 3f502f305d Chore: Update mocks with recent mockery (#107816) 2025-07-09 09:15:34 +02:00
Alexander Akhmetov e92baba748 Alerting: Support PDC in Grafana-managed recording rules (#106677) 2025-06-17 11:46:34 +02:00
Tito Lins 7688089a57 alerting: stop using rule group idx to calculate alert fingerprint (#106407) 2025-06-11 11:49:45 +02:00
Alexander Akhmetov 82549ea8b3 Alerting: Add state label to prometheus_imported_rules metric (#106365) 2025-06-05 14:24:48 +02:00
Alexander Akhmetov da88e5912f Alerting: Evaluate all imported from Prometheus rules sequentially (#106295)
What is this feature?

Makes all alert rules imported from a Prometheus YAML or Prometheus-compatible data source evaluate sequentially.

Why do we need this feature?

Currently only alert rules [imported via the API](https://grafana.com/docs/grafana-cloud/alerting-and-irm/alerting/alerting-rules/alerting-migration/migration-api/) are evaluated sequentially, because only they have the original alert rule definition in YAML. But alert rules can be imported [in the UI, and from a YAML file](https://grafana.com/docs/grafana-cloud/alerting-and-irm/alerting/alerting-rules/alerting-migration/), and they won't be evaluated sequentially which can lead to issues with recording rules.
2025-06-05 12:08:44 +02:00
Alexander Akhmetov 6ff67722b8 Alerting: Include rules imported in the UI into prometheus_imported_rules metric (#106229) 2025-06-02 12:47:09 +02:00
Alexander Akhmetov e256f2d5e2 Alerting: Enable recording rules by default (#105603) 2025-06-02 10:56:05 +02:00
Fayzal Ghantiwala d94a59cd08 Alerting: Fix flaky test (#104450)
Fix flaky test
2025-04-24 12:28:28 +01:00
Fayzal Ghantiwala 3a054d5e00 Alerting: Add FiredAt field to State (#104046)
* Add FiredAt field to the State

* Update featuretoggle files

* Fix lint errors

* Fix test compilation

* Remove random print line + formatting

* Address PR comments
2025-04-22 12:16:38 +01:00
Mariell Hoversholm 757be6365a CI: Bump golangci-lint to 2.0.2 (#103572) 2025-04-10 14:42:23 +02:00
Yuri Tseretyan dc0083d879 Alerting: Sequential evaluation of rules in group (#98829)
* introduce RulesGroupComparer

* extract runJob method

* implement sequential evaluation

* Make sequence building testable & add comments

* Also run callback in recording rules + add tests

* Improve tests

* Address PR comments

---------

Co-authored-by: William Wernert <william.wernert@grafana.com>
2025-04-02 23:10:32 +03:00
Alexander Akhmetov 695ac91290 Alerting: Add backend support for keep_firing_for (#100750)
What is this feature?

This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation).

keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again:

Before                                          

+----------+     +----------+                    
| Alerting |---->|  Normal  |                    
+----------+     +----------+                    

-----
After

+----------+      +------------+     +----------+
| Alerting |----->| Recovering |---->|  Normal  |
+----------+      +------------+     +----------+                                                 

Why do we need this feature?

This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert
2025-03-18 11:24:48 +01:00
Alexander Akhmetov 7dd6f52630 Alerting: Add MissingSeriesEvalsToResolve option to the AlertRule (#101184) 2025-03-11 22:12:06 +01:00
Steve Simpson bbab62ce39 Alerting: Select remote write path dependent on metrics backend type. (#101891)
The remote write path differs based on whether the data source is actually
Prometheus, Mimir, Cortex, or an older version of Cortex. We do not want
users to have to specify the path, so this change determines the path as
best it can.

It may be in the future we have to make this configurable per-datasource
to cater for setups where it's impossible to determine the correct path.
2025-03-11 13:45:16 +01:00
Steve Simpson cc80681beb Alerting: Extend recording rules test to exercise writing with data sources. (#101775)
The change to use WriteDatasource was done in a previous commit, this adds a
test case using DatasourceWriter, in addition to the one using PrometheusWriter.
2025-03-07 13:51:50 +01:00
Steve Simpson eed07cf503 Alerting: Refactor NewPrometheusWriter function. (#101706)
* Alerting: Refactor NewPrometheusWriter function.

In order to re-use PrometheusWriter, changing the function take a
PrometheusWriterConfig instead of RecordingRulesSettings, and adapt the old
interface onto the new interface.

* Make linter happy
2025-03-06 16:13:22 +01:00
Steve Simpson b7dcfcedcb Alerting: Extend recording rule definitions/interfaces with data source. (#101678)
Extend the recording rule definition to include the target data source, allowing
configuration of where the output of the recording rule is written to. Also
extends the relevant interfaces in preparation for the next set of changes.
2025-03-06 14:09:17 +01:00
Alexander Akhmetov d44728f4e5 Alerting: Metric to count imported from Prometheus rules (#100847) 2025-03-05 14:02:28 +01:00
Yuri Tseretyan 879b121136 Alerting: Add GUID to alert rule tables (#101321)
* add column guid to alert rule table and rule_guid to rule version table
+ populate the new field with UUID
* update storage and domain models
* patch GUID
* ignore GUID in fingerprint tests
2025-02-28 09:47:25 -05:00
Yuri Tseretyan 32fde6dba4 Alerting: Update scheduler to provide full specification to rule update channel (#101375)
update scheduler's aler rule to accept regular Evaluation in update channel

This makes it accept the full rule definition, which is required in reset state.
2025-02-26 14:39:39 -05:00
Yuri Tseretyan 4cac3158c7 Alerting: Fix alert rule copy to include metadata (#100212)
* copy metadata

* add tests for copy and generator

* extract copy rule to a production method and update usages

* fix tests
2025-02-11 09:46:02 -05:00
Yuri Tseretyan 33b11d5c76 Alerting: Remove ID and OrgID from hash calculation (#100140) 2025-02-05 14:15:02 -05:00
Alexander Akhmetov a0bf9202f5 Alerting: Clear the state cache when the alert routine stops (#99681) 2025-01-28 21:15:19 +02:00
Alexander Akhmetov a28328d764 Alerting: Call the deletion reason provider even if the rule is no longer scheduled (#99571)
Alerting: Call the deletion reason provider even if the rule is not scheduled anymore
2025-01-28 11:34:26 +01:00
Alexander Akhmetov 12bda63871 Alerting: Optional function to find the rule deletion reason (#99422) 2025-01-27 11:35:52 +01:00
Yuri Tseretyan 92d6762a3a Alerting: Store information about user that created\updated alert rule (#99395)
* introduce new fields created_by in rule tables
* update domain model and compat layer to support UpdatedBy
* add alert rule generator mutators for UpdatedBy
* ignore UpdatedBy in diff and hash calculation
* Add user context to alert rule insert/update operations
  Updated InsertAlertRules and UpdateAlertRules methods to accept a user context parameter. This change ensures auditability and better tracking of user actions when creating or updating alert rules. Adjusted all relevant calls and interfaces to pass the user context accordingly.

* set UpdatedBy in PreSave because this is where Updated is set
* Use nil userID for system-initiated updates
This ensures differentiation between system and user-initiated changes for better traceability and clarity in update origins.

---------

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2025-01-24 12:09:17 -05:00
Santiago ea6cb8f139 Alerting: Panic when rule being evaluated has unexpected key (#99002) 2025-01-15 14:59:50 +02:00
Santiago 86e8147df3 Alerting: Use AlertRuleKey for comparison before rule evaluation (#98808)
(WIP) Alerting: Use AlertRuleKey for comparison before rule evaluation
2025-01-10 15:31:03 +01:00
Yuri Tseretyan f851379f7d Alerting: Add traceID to rule evalutor logger (#98549) 2025-01-06 15:00:25 -05:00
Alexander Akhmetov bb713cf8e4 Alerting: Add simplified_notifications_section setting to grafana_alerting_simplified_editor_rules metric (#98053) 2024-12-17 11:13:31 +01:00
Alexander Akhmetov 324503ee8b Alerting: Add simplified_notifications_section field to the alert rule metadata (#95988) 2024-11-14 12:55:54 +01:00
Will Browne 25abd57029 Plugins: Update to latest go plugin SDK (0.256.0) (#95065)
* update to latest go plugin SDK

* make update-workspace

* update alerting tests
2024-10-22 15:44:53 +01:00
Alexander Akhmetov 0a4e6ff86b Alerting: Add SaveAlertInstancesForRule instance store method (#94505)
Alerting: Add SaveAlertInstancesForRule method to the InstanceStore interface
2024-10-11 13:47:44 +02:00
Alexander Weaver 393faa8732 Alerting: Move rule evaluation status logic out of prometheus API and into scheduler (#89141)
* Add health fields to rules and an aggregator method to the scheduler

* Move health, last error, and last eval time in together to minimize state processing

* Wire up a readonly scheduler to prom api

* Extract to exported function

* Use health in api_prometheus and fix up tests

* Rename health struct to status

* Fix tests one more time

* Several new tests

* Handle inactive rules

* Push state mapping into state manager

* rename to StatusReader

* Rectify cyclo complexity rebase

* Convert existing package local status implementation to models one

* fix tests

* undo RuleDefs rename
2024-09-30 16:52:49 -05:00
Alexander Akhmetov 0ed70d0b2f Alerting: Add a metric to track the number of rules with simplified editor settings (#93511)
* Alerting: Add a metric to track the number of rules with simplified editor settings
2024-09-20 17:56:40 +02:00
Alexander Akhmetov 9f5b05f936 Alerting: Add metadata field with editor_settings to alert rule (#93245) 2024-09-19 16:43:41 +02:00
William Wernert efe62086f9 Alerting: Add type label rule_group_rules metric (#91425)
* Add group and type labels to rule_group_rules metric

* Don't include group to avoid high cardinality

* Add comments

* Reset rule_group_rules before recording new values

* Edit description for rule_group_rules

* Include ruleGroup combo key in labels

* Fix lint
2024-09-12 17:27:09 +03:00
Alexander Akhmetov 152d3540db Alerting: Log number of dimensions instead of all evaluation results (#92733) 2024-08-30 12:35:02 +02:00
Alexander Weaver 490d6ba2fd Alerting: Extend scheduler user with datasources:read (#92410)
Add permission
2024-08-26 10:59:54 -05:00
Alexander Akhmetov d32e1e009b Alerting: Update prometheus/client_golang to v1.20 (#92070)
Update prometheus/client_golang to v1.20
2024-08-20 11:26:06 +02:00
Alexander Weaver ac5ebe6e4d Alerting: Add enablement flag for recording rules (#92032)
* Add enablement flag

* Disable if toggle not enabled
2024-08-19 12:01:00 -05:00
Alexander Weaver 34ab5fe1f3 Alerting: Restart rule routines if the type changes (#90867)
* Restart when types change

* Wire up test hooks correctly

* testing
2024-08-14 14:57:47 -05:00