Commit Graph

191 Commits

Author SHA1 Message Date
Alexander Weaver d8f07f40ef [v10.2.x] Alerting: Add setting to distribute rule group evaluations over time (#81404)
* Alerting: Add setting to distribute rule group evaluations over time (#80766)

* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels

* Fix API change in registry.go

* empty commit to kick build
2024-02-28 13:29:30 +01:00
gotjosh c631261681 Alerting: Attempt to retry retryable errors (#79161)
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 20:45:08 +00:00
gotjosh 07915703fe Revert "Alerting: Attempt to retry retryable errors" (#79158)
Revert "Alerting: Attempt to retry retryable errors (#79037)"

This reverts commit 3e51cf0949.
2023-12-06 19:12:01 +00:00
gotjosh 3e51cf0949 Alerting: Attempt to retry retryable errors (#79037)
* Alerting: Attempt to retry retryable errors

Currently in a draft state, but this was the minimal diff I could put together to exemplify how could achieve this.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 16:35:22 +00:00
Santiago 61cb26711e Alerting: Fetch alerts from a remote Alertmanager (#75844)
* Alerting: post alerts to the remote Alertmanager and fetch them

* fix broken tests

* Alerting: Add Mimir Backend image to devenv (blocks)

* add alerting as code owner for mimir_backend block

* Alerting: Use Mimir image to run integration tests for the remote Alertmanager

* skip integration test when running all tests

* skipping integration test when no Alertmanager URL is provided

* fix bad host for mimir_backend

* remove basic auth testing until we have an nginx image in our CI

* add integration tests for alerts

* fix tests

* change SendCtx -> Send, add context.Context to Send, fix CI

* add reover() for functions from the Prometheus Alertmanager HTTP client that could panic

* add TODO to implement PutAlerts in a way that mimicks what Prometheus does

* fix log format
2023-10-19 11:27:37 +02:00
Marcus Efraimsson e4c1a7a141 Tracing: Standardize on otel tracing (#75528) 2023-10-03 14:54:20 +02:00
Steve Simpson 894f420014 Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158) 2023-09-20 15:07:02 +02:00
Will Browne e855efb13d Plugins: Move store and plugin dto to pluginsintegration (#74655)
move store and plugin dto
2023-09-11 13:59:24 +02:00
Ryan McKinley 025b2f3011 Chore: use any rather than interface{} (#74066) 2023-08-30 18:46:47 +03:00
Yuri Tseretyan 938e26b59f Alerting: Add new metrics and tracings to state manager and scheduler (#71398)
* add metrics and tracing to state manager

* propagate tracer to state manager

* add scheduler metrics

* fix backtesting

* add test for state metrics

* remove StateUpdateCount

* update docs

* metrics can be null

* add tracer to new tests
2023-08-16 09:04:18 +02:00
Yuri Tseretyan c7598cc6fb Alerting: Add ability to control scheduler tick interval via config (#71980)
* add ability to control scheduler interval via config
* add feature flag `configurableSchedulerTick`
2023-07-26 12:44:12 -04:00
Will Browne a8577c21ba Plugins: Migrate PluginStore mock to pre-existing fakes package (#71664)
* migrate to existing fakes package

* fix imports
2023-07-17 10:21:44 +00:00
Kyle Brandt f6a28cadbc Alerting: (Chore/Instrumentation) Add traceID to logs with contextual logger (#71289)
Alerting: (Chore) Add traceID to logs with contextual logger
2023-07-11 10:59:52 +02:00
Yuri Tseretyan ada325de2a Alerting: Use unsafe.Slice for hashing a string during rule fingerprint calculation (#71000) 2023-06-30 14:58:23 -04:00
George Robinson 7edbe72483 Alerting: Support concurrent queries for saving alert instances (#70525)
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
2023-06-23 11:36:07 +01:00
SatVeer Singh 1bfa3a0f1e Chore: Replace go-multierror with errors package (#66432)
* code refactor and type assertions added to tests

* no-lint rule added for specific line
2023-06-19 12:29:45 +03:00
Matthew Jacobson ba3994d338 Alerting: Repurpose rule testing endpoint to return potential alerts (#69755)
* Alerting: Repurpose rule testing endpoint to return potential alerts

This feature replaces the existing no-longer in-use grafana ruler testing API endpoint /api/v1/rule/test/grafana. The new endpoint returns a list of potential alerts created by the given alert rule, including built-in + interpolated labels and annotations.

The key priority of this endpoint is that it is intended to be as true as possible to what would be generated by the ruler except that the resulting alerts are not filtered to only Resolved / Firing and ready to be sent.

This means that the endpoint will, among other things:

- Attach static annotations and labels from the rule configuration to the alert instances.
- Attach dynamic annotations from the datasource to the alert instances.
- Attach built-in labels and annotations created by the Grafana Ruler (such as alertname and grafana_folder) to the alert instances.
- Interpolate templated annotations / labels and accept allowed template functions.
2023-06-08 18:59:54 -04:00
Yuri Tseretyan 9eb10bee1f Alerting: Scheduler use rule fingerprint instead of version (#66531)
* implement calculation of fingerprint for ruleWithFolder
* update scheduler to use fingerprint instead of rule's version
2023-04-28 10:42:16 -04:00
Santiago b0881daf23 Alerting: Use URLs in image annotations (#66804)
* use tokens or urls in image annotations

* improve tests, fix some comments

* fix empty tokens

* code review changes, check for url before checking for token (support old token formats)
2023-04-26 13:06:18 -03:00
Kyle Brandt 840fb32ad8 SSE: (Instrumentation) Add Tracing (#66700)
spans are prefixed `SSE.`
2023-04-18 08:04:51 -04:00
Kyle Brandt 2f13c851e4 SSE: (Chore/Instrumentation) Add ds_queries_total metric and move met… (#66695)
* SSE: (Chore/Instrumentation) Add ds_queries_total metric and move metrics to service
2023-04-17 16:12:44 -07:00
Kyle Brandt e78be44e1a SSE: Dataplane Compliance (#65927)
Takes a specific code path for data that identifies itself as dataplane instead of "guessing" what the data is.

The data must identify itself by being in the dataplane by having both the following frame metadata properties:

- TypeVersion property that is greater than 0.0
- 'Type' property

The flag is disableSSEDataplane and disables this functionality and uses the old code for all queries regardless.

See https://github.com/grafana/grafana-plugin-sdk-go/blob/main/data/contract_docs/contract.md for dataplane details.
2023-04-12 12:24:34 -04:00
gotjosh 1c3ce0735f Alerting: Tiny refactor on the eval and schedule packages (#66130)
* Alerting: Tiny refactor on the eval and schedule packages

two very small things:

- We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext`
- The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary.

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

---------

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2023-04-06 16:02:28 +01:00
Alexander Weaver 9bcf8819d3 Alerting: Handful of small adjustments to log levels and parameters (#64572)
Calculate duration earlier in scheduler
2023-03-17 12:15:49 +00:00
Yuri Tseretyan 85a954cd81 Alerting: Update scheduler to get updates only from database (#64635)
* stop using the scheduler's Update and Delete methods all communication must be via the database
* update scheduler's registry to calculate diff before re-setting the cache
* update fetcher to return the diff generated by registry
* update processTick to update rule eval routine if the rule was updated and it is not going to be evaluated at this tick.
* remove references to the scheduler from api package
* remove unused methods in the scheduler
2023-03-14 18:02:51 -04:00
Alex Moreno f60dc4441f Alerting: Add status label to GroupRules metric (#63454)
* Add status label to GroupRules metric

* Add state (active and paused) label to GrouRules

* Add active/paused metrics tests
2023-02-23 12:38:27 +01:00
Steve Simpson 4d1a2c3370 Alerting: Move rule_groups_rules metric from State to Scheduler. (#63144)
The `rule_groups_rules` metric is currently defined and computed by `State`.
It makes more sense for this metric to be computed off of the configured rule
set, not based on the rule evaluation state. There could be an edge condition
where a rule does not have a state yet, and so is uncounted.

Additionally, we would like this metric (and others), to have a `rule_group`
label, and this is much easier to achieve if the metric is produced from the
`Scheduler` package.
2023-02-09 17:05:19 +01:00
Yuri Tseretyan f066e8cdcd Alerting: Update to alerting 20230203015918-0e4e2675d7aa (after refactoring) (#62823)
* add alerting prefix to some packages from alerting that have similar names in prometheus alertmanager
2023-02-03 11:36:49 -05:00
ismail simsek 91221bc436 Expressions: Fixes the issue showing expressions editor (#62510)
* Use suggested value for uid

* update the snapshot

* use __expr__

* replace all -100 with __expr__

* update snapshot

* more changes

* revert redundant change

* Use expr.DatasourceUID where it's possible

* generate files
2023-01-31 18:50:10 +01:00
Alex Moreno 7a465f42a6 Alerting: Allow pausing alerts from provisioning (#62263)
* Allow pausing alerts from provisioning

* Update swagger

* Add IsPaused to provision export endpoints

* Add pause field in sample.yml

* Add exception for reset state in first loop iteration of scheduler if rule is paused

* Update provision definition and swagger docs

* Fix provisioning export tests

* Suggestion: Simplify if condition

* Add more context to a comment
2023-01-30 16:29:05 +01:00
Serge Zaitsev d6d4097567 Chore: Fix goimports grouping in alerting (#62424)
* fix goimports

* fix goimports order
2023-01-30 09:55:35 +01:00
Yuri Tseretyan 05bf241952 Alerting: Update state manager to return StateTransitions when Delete or Reset (#62264)
* update Delete and Reset methods to return state transitions

this will be used by notifier code to decide whether alert needs to be sent or not.

* update scheduler to provide reason to delete states and use transitions

* update FromAlertsStateToStoppedAlert to accept StateTransition and filter by old state

* fixup

* fix tests
2023-01-27 09:46:21 +01:00
Alex Moreno 531b439cf1 Alerting: Add alert pausing feature (#60734)
* Add field in alert_rule model, add state to alert_instance model, and state to eval

* Remove paused state from eval package

* Skip paused alert rules in scheduler

* Add migration to add is_paused field to alert_rule table

* Convert to postable alerts only if not normal, pernding, or paused

* Handle paused eval results in state manager

* Add Paused state to eval package

* Add paused alerts logic in scheduler

* Skip alert on scheduler

* Remove paused status from eval package

* Apply suggestions from code review

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Remove state

* Rethink schedule and manager for paused alerts

* Change return to continue

* Remove unused var

* Rethink alert pausing

* Paused alerts storing annotations

* Only add one state transition

* Revert boolean method renaming refactor

* Revert take image refactor

* Make registry errors public

* Revert method extraction for getting a folder title

* Revert variable renaming refactor

* Undo unnecessary changes

* Revert changes in test

* Remove IsPause check in PatchPartiLAlertRule function

* Use SetNormal to set state

* Fix text by returning to old behaviour on alert rule deletion

* Add test in schedule_unit_test.go to test ticks with paused alerts

* Add coment to clarify usage of context.Background()

* Add comment to clarify resetStateByRuleUID method usage

* Move rule get to a more limited scope

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* rum gofmt on pkg/services/ngalert/schedule/schedule.go

* Remove defer cancel for context

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/testing.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* skip scheduler rule state clean up on paused alert rule

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Fix mock in test

* Add (hopefully) final suggestions

* Use error channel from recordAnnotationsSync to cancel context

* Run make gen-cue

* Place pause alert check in channel update after version check

* Reduce branching un update channel select

* Add if for error and move code inside if in state manager ResetStateByRuleUID

* Add reason to logs

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Do not delete alert rule routine, just exit on eval if is paused

* Reduce branching and create-close a channel to avoid deadlocks

* Separate state deletion and state reset (includes history saving)

* Add current pause state in rule route in scheduler

* Split clearState and bring errCh closer to RecordStatesAsync call

* Change rule to ruleMeta in RecordStatesAsync

* copy state to be able to modify it

* Add timeout to context creation

* Shorten the timeout

* Use resetState is rule is paused and deleteState if rule is not paused

* Remove Empty state reason

* Save every rule change in historian

* Add tests for DeleteStateByRuleUID and ResetStateByRuleUID

* Remove useless line

* Remove outdated comment

Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
2023-01-26 18:29:10 +01:00
Santiago b5fa9e3501 Chore: Fix "manger" typo (#61649)
fix mangers -> managers
2023-01-17 23:13:27 +00:00
Yuri Tseretyan 86b5fbbf60 Alerting: Introduce state manager config structure (#61249) 2023-01-10 16:26:15 -05:00
George Robinson 2a291afbae Alerting: Use consts from alerting package (#61241) 2023-01-10 19:59:13 +00:00
Yuri Tseretyan da18c89e91 Alerting: Scheduler to call DeleteAlertRule once when it stops deleted rules (#61189)
scheduler to call DeleteAlertRule once when it stops deleted rules
2023-01-09 14:39:32 -05:00
Yuri Tseretyan 48f1db63ff Alerting: Add support for tracing to alerting scheduler (#61057) 2023-01-06 21:21:43 -05:00
Yuri Tseretyan c5ee4e4ae1 Alerting: Improve rule validation to check if rule uses backend datasources (#58986)
* validate if rule uses backend datasources

* add backend datasource to test

* fix tests

* another forgotten import

* remove unused var
2022-12-08 10:44:02 +01:00
Yuri Tseretyan abb49d96b5 Alerting: update state manager to return StateTransition instead of State (#58867)
* improve test for stale states
* update state manager return StateTransition
* update scheduler to accept state transitions
2022-12-06 13:07:39 -05:00
Alexander Weaver 9977c7ea43 Alerting: Simplify scheduler configuration and remove dependency on Grafana-wide settings (#59735)
* Make scheduler not depend directly on grafana-wide settings

* Re-add missing interval
2022-12-02 16:02:07 -06:00
Alexander Weaver 2bfdda5b68 Alerting: Break dependency between state and image packages (#58381)
* Refactor state and manager to not depend directly on image interface

* Move generic errors to models package

* Move NotAvailableImageService to state as its only references are in state tests

* Move NoopImageService to state package

* Move mock to state package

* Fix linter error

* Fix comment styling

* Fix a couple added references introduced by rebase

* Empty commit to kick build
2022-11-09 15:06:49 -06:00
Yuri Tseretyan bad4f28d0d Alerting: update test TestAlertingTicker to not rely on clock (#58544)
* extract method processTick
* make processTick return scheduled rules
* move state manager tests to state manager
* update test
* move all tests into one file
* remove unused fields
2022-11-09 15:08:57 -05:00
Yuri Tseretyan 3621cf5a12 Alerting: Update handling of stale state (#58276)
* delete all stale states in one lock
* do not use touched states to detect stale rely only on LastEvaluationTime maintained correctly
* fix tests to use correct eval time
* delete unused method
2022-11-07 11:03:53 -05:00
Neel db1fd10ff1 Alerting: Append org ID to alert notification URLs (#57123) 2022-11-07 16:03:25 +00:00
Yuri Tseretyan 978f1119d7 Alerting: Run state manager as regular sub-service (#58246) 2022-11-04 17:06:47 -04:00
Ryan McKinley e6a9fa1cf9 ServiceAccounts: enable service accounts after IsRealUser change (#58263)
* suppor service accounts

* add: IsServiceAccount to scheduleUser in scheduler

Co-authored-by: eleijonmarck <eric.leijonmarck@gmail.com>
2022-11-04 15:53:35 -04:00
Yuri Tseretyan dce8879145 Alerting: Update state manager to accept rule store as Warm method argument (#58244) 2022-11-04 14:23:08 -04:00
Eric Leijonmarck 72d0c6b428 Auth: add IsServiceAccount to IsRealUser (#58015)
* add: IsServiceAccount to SignedInUser and IsRealUser

* fix: linting error

* refactor: add function IsServiceAccountUser()

By adding the function IsServiceAccountUser() we use it to identify for
ServiceAccounts in the HasUniqueID() since caching is built up on having
a uniqueID, see comment: https://github.com/grafana/grafana/pull/58015#discussion_r1011361880
2022-11-04 12:39:54 +00:00
Yuriy Tseretyan e3a4bde622 Alerting: Condition evaluator with cached pipeline (#57479)
* create rule evaluator
* load header from the context
* init one factory
* update scheduler
2022-11-02 10:13:39 -04:00