Commit Graph

218 Commits

Author SHA1 Message Date
grafana-delivery-bot[bot] 152cce216a [v11.0.x] Alerting: Fix scheduler to sort rules before evaluation (#88021)
* Alerting: Fix scheduler to sort rules before evaluation (#88006)

sort rules scheduled for evaluation to make sure that the order is stable between evaluations.
This is especially important in HA mode.

(cherry picked from commit 05d6813a09)

* use old generators

---------

Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-05-20 17:51:31 +03:00
grafana-delivery-bot[bot] 74b0f223f3 [v11.0.x] Alerting: use logger with same context within rule scheduling loop (#87936)
Alerting: use logger with same context within rule scheduling loop (#87934)

(cherry picked from commit f410c7fca1)

Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
2024-05-15 16:33:32 -04:00
grafana-delivery-bot[bot] e5237e5ac2 [v11.0.x] Alerting: Do not store series values from past evaluations in state manager for no reason (#87845)
Alerting: Do not store series values from past evaluations in state manager for no reason (#87525)

Do not store previous execution results on states

(cherry picked from commit a6a9ab4008)

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2024-05-14 12:56:14 -05:00
grafana-delivery-bot[bot] 681554c376 [v11.0.x] Alerting: Reduce set of fields that could trigger alert state change (#86266)
Alerting: Reduce set of fields that could trigger alert state change (#83496)

We want to avoid too much change of alert state based on change on
alert's fields. For that we ignore some fields from the diff.

(cherry picked from commit 6f38ac6615)

Co-authored-by: Benoit Tigeot <benoittgt@users.noreply.github.com>
2024-04-16 10:35:59 +01:00
Ashley Harrison 20d6961030 [v11.0.x] Chore: Bump golangci-lint v1.57.1 (#86131)
Chore: Bump golangci-lint v1.57.1 (#84998)

* bump golangci-lint v1.57.1

* update setting

* remove goconst

* fix linting issues

* prettier

* fix G601

* go mod tidy
go work sync

(cherry picked from commit 6137c4e0a6)

Co-authored-by: ismail simsek <ismailsimsek09@gmail.com>
2024-04-15 14:40:35 +01:00
grafana-delivery-bot[bot] 56ea839a82 [v11.0.x] Alerting: Fix evaluation metrics to not count retries (#86059)
Alerting: Fix evaluation metrics to not count retries (#85873)

* Change evaluation metrics to only count once per eval, and add new metrics.

* Cosmetic: Move eval total Inc() to orginal place.

(cherry picked from commit ad7f804255)

Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
2024-04-12 19:49:57 +02:00
Alexander Weaver 6c5e94095d Alerting: Scheduler and registry handle rules by an interface (#84044)
* export Evaluation

* Export Evaluation

* Export RuleVersionAndPauseStatus

* export Eval, create interface

* Export update and add to interface

* Export Stop and Run and add to interface

* Registry and scheduler use rule by interface and not concrete type

* Update factory to use interface, update tests to work over public API rather than writing to channels directly

* Rename map in registry

* Rename getOrCreateInfo to not reference a specific implementation

* Genericize alertRuleInfoRegistry into ruleRegistry

* Rename alertRuleInfo to alertRule

* Comments on interface

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>

---------

Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>
2024-03-11 22:57:38 +02:00
Alexander Weaver 201f5d3ac9 Alerting: Extract large closures in ruleRoutine (#84035)
* extract notify

* extract resetState

* move evaluate metrics inside evaluate

* split out evaluate
2024-03-06 16:39:23 -06:00
Alexander Weaver 7a171fd14a Regenerate openapidocs at 1.21.8 to match ci (#84037)
* Regenerate openapidocs at 1.21.8 to match ci

* Adjust trigger to work on the actual outputted files

* Also put go.mod and go.sum in the triggers

* manually fix

* Make an arbitrary change rather than touching the trigger to force a run

* Drop all triggers - run all the time

* Print diff - taken from @papagian's PR

* Manual fixes to swagger doc

---------

Co-authored-by: Ryan McKinley <ryantxu@gmail.com>
2024-03-06 16:08:45 -06:00
Alexander Weaver d5fda06147 Alerting: Decouple rule routine from scheduler (#84018)
* create rule factory for more complicated dep injection into rules

* Rules get direct access to metrics, logs, traces utilities, use factory in tests

* Use clock internal to rule

* Use sender, statemanager, evalfactory directly

* evalApplied and stopApplied

* use schedulableAlertRules behind interface

* loaded metrics reader

* 3 relevant config options

* Drop unused scheduler parameter

* Rename ruleRoutine to run

* Update READMED

* Handle long parameter lists

* remove dead branch
2024-03-06 13:44:53 -06:00
Alexander Weaver 1bb38e8f95 Alerting: Move ruleRoutine to be a method on ruleInfo (#83866)
* Move ruleRoutine to ruleInfo file

* Move tests as well

* swap ruleInfo and scheduler parameters on ruleRoutine

* Fix linter complaint, receiver name
2024-03-04 17:15:55 -06:00
Alexander Weaver f2a9d0a89d Alerting: Refactor ruleRoutine to take an entire ruleInfo instance (#83858)
* Make stop a real method

* ruleRoutine takes a ruleInfo reference directly rather than pieces of it

* Fix whitespace
2024-03-04 15:15:01 -06:00
Alexander Weaver fa51724bc6 Alerting: Move alertRuleInfo and tests to new files (#83854)
Move ruleinfo and tests to new files
2024-03-04 11:24:49 -06:00
William Wernert fabaff9a24 Alerting: Create metric for rules using simple notifications (#82904)
---------

Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
2024-02-16 19:01:49 +02:00
Yuri Tseretyan 1eebd2a4de Alerting: Support for simplified notification settings in rule API (#81011)
* Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping
* Support validation of notification settings when rules are updated

* Implement route generator for Alertmanager configuration. That fetches all notification settings.
* Update multi-tenant Alertmanager to run the generator before applying the configuration.

* Add notification settings labels to state calculation
* update the Multi-tenant Alertmanager to provide validation for notification settings

* update GET API so only admins can see auto-gen
2024-02-15 09:45:10 -05:00
Alexander Weaver d4ae10ecc6 Alerting: Small refactor, move unrelated functions out of fetcher (#82459)
Move unrelated functions out of fetcher
2024-02-14 20:01:32 +02:00
Diego Augusto Molina ff08c0a790 Chore: improve test readability in ngalert/schedule (#82453)
Chore: improve test readability
2024-02-14 14:53:32 -03:00
Diego Augusto Molina 9c29e1a783 Alerting: Fix data races and improve testing (#81994)
* Alerting: fix race condition in (*ngalert/sender.ExternalAlertmanager).Run

* Chore: Fix data races when accessing members of *ngalert/state.FakeInstanceStore

* Chore: Fix data races in tests in ngalert/schedule and enable some parallel tests

* Chore: fix linters

* Chore: add TODO comment to remove loopvar once we move to Go 1.22
2024-02-14 12:45:39 -03:00
Alexander Weaver 5bbe9c6e61 Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle (#82212)
* remove jitter feature flag

* Add an out so users can manually disable jitter

* Pass in cfg

* Add TODO to remove knob in future
2024-02-09 15:53:58 -06:00
Alexander Weaver 843c477899 Alerting: Add exported API to scheduler to access currently loaded rules (#82031)
* Add exported API to fetch rule definitions from scheduler

* Add comment
2024-02-07 09:31:22 -06:00
Ashley Harrison 39057552dc QueryField: Handle autocomplete better (#81484)
* extract out function + add unit tests

* add feature toggle and default it to on
2024-01-31 10:01:20 +00:00
Yuri Tseretyan 131c72d655 Alerting: Fix scheduler to group folders by the unique key (orgID and UID) (#81303) 2024-01-30 17:14:11 -05:00
Alexander Weaver 18b9c8fd5f Alerting: Nilcheck JitterStrategyFrom so it can be used in contexts without feature toggles (#80841)
Nilcheck so tests can have a nil feature toggles
2024-01-18 15:43:41 -06:00
Alexander Weaver 00a260effa Alerting: Add setting to distribute rule group evaluations over time (#80766)
* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels
2024-01-18 12:48:11 -06:00
Jean-Philippe Quéméner 82638d059f feat(alerting): add state persister interface (#80384) 2024-01-17 13:33:13 +01:00
Alexander Weaver 3c796ecc8f Alerting: Add metric counting rule groups per org (#80669)
* Refactor, fix bad map hint

* Count groups per org
2024-01-16 16:35:56 -06:00
Alexander Weaver 542741f748 Alerting: Log scheduler maxAttempts, guard against invalid retry counts, log retry errors (#80234)
* Log maxAttempts, add guard, log retry errors

* fix whitespace

* Initialize evaluator in TestProcessTicks
2024-01-09 13:19:37 -06:00
Yuri Tseretyan f6a46744a6 Alerting: Support hysteresis command expression (#75189)
Backend: 

* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
   - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
  -  add rule_fingerprint column to alert_instance
   - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
   - add AlertingResultsFromRuleState that implements the new interface in eval package
   - update getExprRequest to patch the hysteresis command.

* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.


Frontend:

* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions

* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions

---------

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
2024-01-04 11:47:13 -05:00
gotjosh c631261681 Alerting: Attempt to retry retryable errors (#79161)
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 20:45:08 +00:00
gotjosh 07915703fe Revert "Alerting: Attempt to retry retryable errors" (#79158)
Revert "Alerting: Attempt to retry retryable errors (#79037)"

This reverts commit 3e51cf0949.
2023-12-06 19:12:01 +00:00
gotjosh 3e51cf0949 Alerting: Attempt to retry retryable errors (#79037)
* Alerting: Attempt to retry retryable errors

Currently in a draft state, but this was the minimal diff I could put together to exemplify how could achieve this.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-12-06 16:35:22 +00:00
Santiago 61cb26711e Alerting: Fetch alerts from a remote Alertmanager (#75844)
* Alerting: post alerts to the remote Alertmanager and fetch them

* fix broken tests

* Alerting: Add Mimir Backend image to devenv (blocks)

* add alerting as code owner for mimir_backend block

* Alerting: Use Mimir image to run integration tests for the remote Alertmanager

* skip integration test when running all tests

* skipping integration test when no Alertmanager URL is provided

* fix bad host for mimir_backend

* remove basic auth testing until we have an nginx image in our CI

* add integration tests for alerts

* fix tests

* change SendCtx -> Send, add context.Context to Send, fix CI

* add reover() for functions from the Prometheus Alertmanager HTTP client that could panic

* add TODO to implement PutAlerts in a way that mimicks what Prometheus does

* fix log format
2023-10-19 11:27:37 +02:00
Marcus Efraimsson e4c1a7a141 Tracing: Standardize on otel tracing (#75528) 2023-10-03 14:54:20 +02:00
Steve Simpson 894f420014 Alerting: Pass loggers into SchedulerCfg and ManagerCfg. (#75158) 2023-09-20 15:07:02 +02:00
Will Browne e855efb13d Plugins: Move store and plugin dto to pluginsintegration (#74655)
move store and plugin dto
2023-09-11 13:59:24 +02:00
Ryan McKinley 025b2f3011 Chore: use any rather than interface{} (#74066) 2023-08-30 18:46:47 +03:00
Yuri Tseretyan 938e26b59f Alerting: Add new metrics and tracings to state manager and scheduler (#71398)
* add metrics and tracing to state manager

* propagate tracer to state manager

* add scheduler metrics

* fix backtesting

* add test for state metrics

* remove StateUpdateCount

* update docs

* metrics can be null

* add tracer to new tests
2023-08-16 09:04:18 +02:00
Yuri Tseretyan c7598cc6fb Alerting: Add ability to control scheduler tick interval via config (#71980)
* add ability to control scheduler interval via config
* add feature flag `configurableSchedulerTick`
2023-07-26 12:44:12 -04:00
Will Browne a8577c21ba Plugins: Migrate PluginStore mock to pre-existing fakes package (#71664)
* migrate to existing fakes package

* fix imports
2023-07-17 10:21:44 +00:00
Kyle Brandt f6a28cadbc Alerting: (Chore/Instrumentation) Add traceID to logs with contextual logger (#71289)
Alerting: (Chore) Add traceID to logs with contextual logger
2023-07-11 10:59:52 +02:00
Yuri Tseretyan ada325de2a Alerting: Use unsafe.Slice for hashing a string during rule fingerprint calculation (#71000) 2023-06-30 14:58:23 -04:00
George Robinson 7edbe72483 Alerting: Support concurrent queries for saving alert instances (#70525)
This commit adds support for concurrent queries when saving alert
instances to the database. This is an experimental feature in
response to some customers experiencing delays between rule evaluation
and sending alerts to Alertmanager, resulting in flapping. It is
disabled by default.
2023-06-23 11:36:07 +01:00
SatVeer Singh 1bfa3a0f1e Chore: Replace go-multierror with errors package (#66432)
* code refactor and type assertions added to tests

* no-lint rule added for specific line
2023-06-19 12:29:45 +03:00
Matthew Jacobson ba3994d338 Alerting: Repurpose rule testing endpoint to return potential alerts (#69755)
* Alerting: Repurpose rule testing endpoint to return potential alerts

This feature replaces the existing no-longer in-use grafana ruler testing API endpoint /api/v1/rule/test/grafana. The new endpoint returns a list of potential alerts created by the given alert rule, including built-in + interpolated labels and annotations.

The key priority of this endpoint is that it is intended to be as true as possible to what would be generated by the ruler except that the resulting alerts are not filtered to only Resolved / Firing and ready to be sent.

This means that the endpoint will, among other things:

- Attach static annotations and labels from the rule configuration to the alert instances.
- Attach dynamic annotations from the datasource to the alert instances.
- Attach built-in labels and annotations created by the Grafana Ruler (such as alertname and grafana_folder) to the alert instances.
- Interpolate templated annotations / labels and accept allowed template functions.
2023-06-08 18:59:54 -04:00
Yuri Tseretyan 9eb10bee1f Alerting: Scheduler use rule fingerprint instead of version (#66531)
* implement calculation of fingerprint for ruleWithFolder
* update scheduler to use fingerprint instead of rule's version
2023-04-28 10:42:16 -04:00
Santiago b0881daf23 Alerting: Use URLs in image annotations (#66804)
* use tokens or urls in image annotations

* improve tests, fix some comments

* fix empty tokens

* code review changes, check for url before checking for token (support old token formats)
2023-04-26 13:06:18 -03:00
Kyle Brandt 840fb32ad8 SSE: (Instrumentation) Add Tracing (#66700)
spans are prefixed `SSE.`
2023-04-18 08:04:51 -04:00
Kyle Brandt 2f13c851e4 SSE: (Chore/Instrumentation) Add ds_queries_total metric and move met… (#66695)
* SSE: (Chore/Instrumentation) Add ds_queries_total metric and move metrics to service
2023-04-17 16:12:44 -07:00
Kyle Brandt e78be44e1a SSE: Dataplane Compliance (#65927)
Takes a specific code path for data that identifies itself as dataplane instead of "guessing" what the data is.

The data must identify itself by being in the dataplane by having both the following frame metadata properties:

- TypeVersion property that is greater than 0.0
- 'Type' property

The flag is disableSSEDataplane and disables this functionality and uses the old code for all queries regardless.

See https://github.com/grafana/grafana-plugin-sdk-go/blob/main/data/contract_docs/contract.md for dataplane details.
2023-04-12 12:24:34 -04:00
gotjosh 1c3ce0735f Alerting: Tiny refactor on the eval and schedule packages (#66130)
* Alerting: Tiny refactor on the eval and schedule packages

two very small things:

- We had a constructor on something called a `Context` which is not a `context.Context` so let's just name that constructor `NewContext`
- The user that we use to run query evaluations is the same (with some variation) abstract it to a function so that it can be re-used when necessary.

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>

---------

Co-authored-by: Alexander Weaver <weaver.alex.d@gmail.com>
2023-04-06 16:02:28 +01:00