Commit Graph

134 Commits

Author SHA1 Message Date
Steve Simpson 9effb9a708 Alerting: Allow hooking into request handler functions. (#67000)
* Alerting: Allow hooking into request handler functions.

Adds a facility to AlertNG for hooking into API handlers, allowing the
replacement of request handlers for specific paths. One of goals of this
approach was to allow hooking as late as possible in the request, e.g.
after all middleware has been applied, to simplfiy usage.

* Update pkg/services/ngalert/api/hooks.go

Co-authored-by: gotjosh <josue.abreu@gmail.com>

* Update pkg/services/ngalert/api/hooks.go

Co-authored-by: gotjosh <josue.abreu@gmail.com>

* Update pkg/services/ngalert/ngalert.go

Co-authored-by: gotjosh <josue.abreu@gmail.com>

* Fixes to review comments

* Fix passing logger in

---------

Co-authored-by: gotjosh <josue.abreu@gmail.com>
2023-04-24 18:18:44 +02:00
Alexander Weaver da4832724e Alerting: Delete stub for SQL alert state history backend (#65667)
Delete stub for SQL backend
2023-03-31 11:15:56 -05:00
Alexander Weaver b2abb63286 Alerting: Introduce proper feature toggles for common state history backend combinations (#65497)
* define 3 feature toggles for rollout phases

* Pass feature toggles along

* Implement first feature toggle

* Try a different strategy with fall-throughs to specific configurations

* Apply toggle overrides once outside of backend composition

* Emit log messages when we coerce backends

* Run code generator for feature toggle files

* Improve wording in flag descs

* Re-run generator

* Use code-generated constants instead of plain strings

* Use converted enum values rather than strings for pre-parsing
2023-03-30 13:53:21 -05:00
Alexander Weaver a31672fa40 Alerting: Create new state history "fanout" backend that dispatches to multiple other backends at once (#64774)
* Rename RecordStatesAsync to Record

* Rename QueryStates to Query

* Implement fanout writes

* Implement primary queries

* Simplify error joining

* Add test for query path

* Add tests for writes and error propagation

* Allow fanout backend to be configured

* Touch up log messages and config validation

* Consistent documentation for all backend structs

* Parse and normalize backend names more consistently against an enum

* Touch-ups to documentation

* Improve clarity around multi-record blocking

* Keep primary and secondaries more distinct

* Rename fanout backend to multiple backend

* Simplify config keys for multi backend mode
2023-03-17 12:41:18 -05:00
gotjosh 02a8f62021 Alerting: Fix stats that display alert count when using unified alerting (#64852)
* Alerting: Fix stats when using unified alerting
2023-03-17 11:19:18 +00:00
Yuri Tseretyan 85a954cd81 Alerting: Update scheduler to get updates only from database (#64635)
* stop using the scheduler's Update and Delete methods all communication must be via the database
* update scheduler's registry to calculate diff before re-setting the cache
* update fetcher to return the diff generated by registry
* update processTick to update rule eval routine if the rule was updated and it is not going to be evaluated at this tick.
* remove references to the scheduler from api package
* remove unused methods in the scheduler
2023-03-14 18:02:51 -04:00
Alexander Weaver faef3a8258 Alerting: Log error but don't fail initialization if state history connection test fails (#64699)
Don't return init error if ping fails, add tests
2023-03-13 15:54:46 -05:00
Alexander Weaver 19d01dff91 Alerting: Expose Prometheus metrics for persisting state history (#63157)
* Create historian metrics and dependency inject

* Record counter for total number of state transitions logged

* Track write failures

* Track current number of active write goroutines

* Record histogram of how long it takes to write history data

* Don't copy the registerer

* Adjust naming of write failures metric

* Introduce WritesTotal to complement WritesFailedTotal

* Measure TransitionsFailedTotal to complement TransitionsTotal

* Rename all to state_history

* Remove redundant Total suffix

* Increment totals all the time, not just on success

* Drop ActiveWriteGoroutines

* Drop PersistDuration in favor of WriteDuration

* Drop unused gauge

* Make writes and writesFailed per org

* Add metric indicating backend and a spot for future metadata

* Drop _batch_ from names and update help

* Add metric for bytes written

* Better pairing of total + failure metric updates

* Few tweaks to wording and naming

* Record info metric during composition

* Create fakeRequester and simple happy path test using it

* Blocking test for the full historian and test for happy path metrics

* Add tests for failure case metrics

* Smoke test for full annotation persistence

* Create test for metrics on annotation persistence, both happy and failing paths

* Address linter complaints

* More linter complaints

* Remove unnecessary whitespace

* Consistency improvements to help texts

* Update tests to match new descs
2023-03-06 10:40:37 -06:00
Alexander Weaver e77621649d Alerting: Instrument outgoing state history requests using weaveworks/common (#63600)
* Loki backend and client depend on a requester

* Instrument all requests to loki using weaveworks TimedClient

* Construct collector in metrics package
2023-02-23 17:52:02 -06:00
Alexander Weaver 6ad1cfef38 Alerting: Add endpoint for querying state history (#62166)
* Define endpoint and generate

* Wire up and register endpoint

* Cleanup, define authorization

* Forgot the leading slash

* Wire up query and SignedInUser

* Wire up timerange query params

* Add todo for label queries

* Drop comment

* Update path to rules subtree
2023-02-02 11:34:00 -06:00
Alexander Weaver e7ace4ed62 Alerting: Allow separate read and write path URLs for Loki state history (#62268)
Extract config parsing and add tests
2023-01-30 16:30:05 -06:00
Matthew Jacobson c006df375a Alerting: Create endpoints for exporting in provisioning file format (#58623)
This adds provisioning endpoints for downloading alert rules and alert rule groups in a 
format that is compatible with file provisioning. Each endpoint supports both json and 
yaml response types via Accept header as well as a query parameter 
download=true/false that will set Content-Disposition to recommend initiating a download 
or inline display.

This also makes some package changes to keep structs with potential to drift closer 
together. Eventually, other alerting file structs should also move into this new file 
package, but the rest require some refactoring that is out of scope for this PR.
2023-01-27 11:39:16 -05:00
Alex Moreno 531b439cf1 Alerting: Add alert pausing feature (#60734)
* Add field in alert_rule model, add state to alert_instance model, and state to eval

* Remove paused state from eval package

* Skip paused alert rules in scheduler

* Add migration to add is_paused field to alert_rule table

* Convert to postable alerts only if not normal, pernding, or paused

* Handle paused eval results in state manager

* Add Paused state to eval package

* Add paused alerts logic in scheduler

* Skip alert on scheduler

* Remove paused status from eval package

* Apply suggestions from code review

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Remove state

* Rethink schedule and manager for paused alerts

* Change return to continue

* Remove unused var

* Rethink alert pausing

* Paused alerts storing annotations

* Only add one state transition

* Revert boolean method renaming refactor

* Revert take image refactor

* Make registry errors public

* Revert method extraction for getting a folder title

* Revert variable renaming refactor

* Undo unnecessary changes

* Revert changes in test

* Remove IsPause check in PatchPartiLAlertRule function

* Use SetNormal to set state

* Fix text by returning to old behaviour on alert rule deletion

* Add test in schedule_unit_test.go to test ticks with paused alerts

* Add coment to clarify usage of context.Background()

* Add comment to clarify resetStateByRuleUID method usage

* Move rule get to a more limited scope

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* rum gofmt on pkg/services/ngalert/schedule/schedule.go

* Remove defer cancel for context

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/testing.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/schedule/schedule_unit_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Update pkg/services/ngalert/models/instance_test.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* skip scheduler rule state clean up on paused alert rule

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>

* Fix mock in test

* Add (hopefully) final suggestions

* Use error channel from recordAnnotationsSync to cancel context

* Run make gen-cue

* Place pause alert check in channel update after version check

* Reduce branching un update channel select

* Add if for error and move code inside if in state manager ResetStateByRuleUID

* Add reason to logs

* Update pkg/services/ngalert/schedule/schedule.go

Co-authored-by: George Robinson <george.robinson@grafana.com>

* Do not delete alert rule routine, just exit on eval if is paused

* Reduce branching and create-close a channel to avoid deadlocks

* Separate state deletion and state reset (includes history saving)

* Add current pause state in rule route in scheduler

* Split clearState and bring errCh closer to RecordStatesAsync call

* Change rule to ruleMeta in RecordStatesAsync

* copy state to be able to modify it

* Add timeout to context creation

* Shorten the timeout

* Use resetState is rule is paused and deleteState if rule is not paused

* Remove Empty state reason

* Save every rule change in historian

* Add tests for DeleteStateByRuleUID and ResetStateByRuleUID

* Remove useless line

* Remove outdated comment

Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: Armand Grillet <2117580+armandgrillet@users.noreply.github.com>
2023-01-26 18:29:10 +01:00
George Robinson a7eab8e46e Alerting: Support context.Context in Loki interface (#61979)
This commit adds support for canceleable contexts in the Loki
interface.
2023-01-26 09:31:20 +00:00
Alexander Weaver c10713ea76 Alerting: Create query interface for state history along with annotation-based implementation (#61646) 2023-01-19 10:45:31 +01:00
Jean-Philippe Quéméner 44b11d3228 Alerting: support basic auth for the state history loki client (#61696) 2023-01-18 20:24:40 +01:00
Alexander Weaver 1ac89ea040 Alerting: Add client configuration for remote Loki historian backend and test connection (#61114)
* Create loki client type and ping method

* Expose TestConnection on client

* Configure and ping Loki URL

* Close response body reader if present

* Add 30 second timeout

* Remove duplicate close
2023-01-17 13:58:52 -06:00
Yuri Tseretyan 9d57b1c72e Alerting: Do not persist noop transition from Normal state. (#61201)
* add feature flag `alertingNoNormalState`
* update instance database to support exclusion of state in list operation
* do not save normal state and delete transitions to normal
* update get methods to filter out normal state
2023-01-13 18:29:29 -05:00
Yuri Tseretyan 86b5fbbf60 Alerting: Introduce state manager config structure (#61249) 2023-01-10 16:26:15 -05:00
Yuri Tseretyan 48f1db63ff Alerting: Add support for tracing to alerting scheduler (#61057) 2023-01-06 21:21:43 -05:00
Alexander Weaver eb960d9725 Alerting: Add un-documented toggle for changing state history backend, add shells for remote loki and sql (#61072)
* Add toggle for state history backend and shells

* Extract some shared logic and add tests
2023-01-06 12:06:01 -06:00
Alexander Weaver 8c3a5f6da0 Alerting: Allow state history to be disabled through configuration (#61006)
* Add configuration option for if state history should be enabled

* Inject no-op when history is disabled
2023-01-05 12:21:07 -06:00
Yuri Tseretyan ad09feed83 Alerting: rule backtesting API (#57318)
* Implement backtesting engine that can process regular rule specification (with queries to datasource) as well as special kind of rules that have data frame instead of query.
* declare a new API endpoint and model
* add feature toggle `alertingBacktesting`
2022-12-14 09:44:14 -05:00
Yuri Tseretyan c5ee4e4ae1 Alerting: Improve rule validation to check if rule uses backend datasources (#58986)
* validate if rule uses backend datasources

* add backend datasource to test

* fix tests

* another forgotten import

* remove unused var
2022-12-08 10:44:02 +01:00
Alexander Weaver 9977c7ea43 Alerting: Simplify scheduler configuration and remove dependency on Grafana-wide settings (#59735)
* Make scheduler not depend directly on grafana-wide settings

* Re-add missing interval
2022-12-02 16:02:07 -06:00
Sofia Papagiannaki 9855e74b92 Chore: Refactor quota service (#58643)
Chore: Refactor quota service (#57586)

* Chore: refactore quota service

* Apply suggestions from code review
2022-11-14 21:08:10 +02:00
Sofia Papagiannaki 96cdf77995 Revert "Chore: Refactor quota service (#57586)" (#58394)
This reverts commit 326ea86a57.
2022-11-08 11:52:07 +02:00
Sofia Papagiannaki 326ea86a57 Chore: Refactor quota service (#57586)
* Chore: refactore quota service

* Apply suggestions from code review
2022-11-08 10:25:34 +02:00
Yuri Tseretyan 978f1119d7 Alerting: Run state manager as regular sub-service (#58246) 2022-11-04 17:06:47 -04:00
Yuri Tseretyan dce8879145 Alerting: Update state manager to accept rule store as Warm method argument (#58244) 2022-11-04 14:23:08 -04:00
Yuriy Tseretyan e3a4bde622 Alerting: Condition evaluator with cached pipeline (#57479)
* create rule evaluator
* load header from the context
* init one factory
* update scheduler
2022-11-02 10:13:39 -04:00
Yuriy Tseretyan 0a4121cef8 Alerting: Contextual log provider for rule key (#57476)
* create contextual log context provider
* use contextual provider in scheduler
* init logger in the package
* use context for log context
* use context in state manager
2022-10-26 19:16:02 -04:00
Alexander Weaver de46c1b002 Alerting: Improve logs in state manager and historian (#57374)
* Touch up log statements, fix casing, add and normalize contexts

* Dedicated logger for dashboard resolver

* Avoid injecting logger to historian

* More minor log touch-ups

* Dedicated logger for state manager

* Use rule context in annotation creator

* Rename base logger and avoid redundant contextual loggers
2022-10-21 16:16:51 -05:00
Yuriy Tseretyan f3c219a980 Alerting: update format of logs in scheduler (#57302)
* Change the severity level of the log messages
2022-10-20 13:43:48 -04:00
Alexander Weaver 3ddb28bad9 Find-and-replace 'err' logs to 'error' to match log search conventions (#57309) 2022-10-19 17:36:54 -04:00
Kristin Laemmert 05709ce411 chore: remove sqlstore & mockstore dependencies from (most) packages (#57087)
* chore: add alias for InitTestDB and Session

Adds an alias for the sqlstore InitTestDB and Session, and updates tests using these to reduce dependencies on the sqlstore.Store.

* next pass of removing sqlstore imports
* last little bit
* remove mockstore where possible
2022-10-19 09:02:15 -04:00
Kristin Laemmert c61b5e85b4 chore: replace sqlstore.Store with db.DB (#57010)
* chore: replace sqlstore.SQLStore with db.DB

* more post-sqlstore.SQLStore cleanup
2022-10-14 15:33:06 -04:00
Serge Zaitsev 53baecd71f Chore: Move folder service into a separate package (#56591)
* Chore: move folder service interface into a separate package

* copy implementation into a standalone package

* move implementation and tests to the new folder package

* remove leftovers from wire

* add test doubles for folder service

* fix tests in library panels/elements

* fix provideservice in ngalert
2022-10-10 21:47:53 +02:00
Joe Blubaugh b476ae62fb Alerting: Write and Delete multiple alert instances. (#55350)
Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.

These new transactions are off by default, guarded by the feature toggle "alertingBigTransactions"

Before:

```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8           398           2991381 ns/op         1133537 B/op      27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
    util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok      github.com/grafana/grafana/pkg/services/ngalert/store   1.619s
```

After:

```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8          1440            816484 ns/op          352297 B/op       6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
    util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
    util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok      github.com/grafana/grafana/pkg/services/ngalert/store   1.383s
```

So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
2022-10-06 14:22:58 +08:00
Alexander Weaver 8df830557a Alerting: Move annotation functionality behind a history persistence interface (#56133)
* Move annotation functionality behind a history persistence interface

* Rename to RecordState

* Fix lint error in import aliasing

* One more import linter error
2022-10-05 15:32:20 -05:00
Alexander Weaver d17ab82b98 Alerting: Break up store.RuleStore interface, delete dead code (#55776)
* Refactor state manager to not depend on rule store interface

* Refactor grafana and proxied ruler APIs to not depend on store.RuleStore

* Refactor folder subscription logic to not use store.RuleStore

* Delete dead code

* Delete store.RuleStore
2022-09-27 08:56:30 -05:00
Alexander Weaver a00879ae21 Alerting: Refactor store to not export its own interface for InstanceStore, delete dead dependency injection (#55772)
* Add consumer-side store interface to state manager

* Remove dead dependency

* Delete dead dependency in API struct

* Delete store-layer InstanceStore interface

* Move fake for state's InstanceStore interface to state package
2022-09-26 13:55:05 -05:00
George Robinson bad4f7fec5 Alerting: Change screenshots to use components (#55156)
* Alerting: Change screenshots to use components

This commit changes screenshots to use a number of components instead of a set of functional wrappers.

It moves the uploading of screenshots from the screenshot package to the image package so we can re-use the same code for both uploading screenshots and server-side images; SingleFlight from the screenshot package to the image package so we can use it for both taking and uploading the screenshot, where as before it was used just for taking the screenshot; and it also removes the use of a cache because we know that screenshots can be taken at most once per tick of the scheduler.
2022-09-21 10:25:07 +01:00
Sofia Papagiannaki 754eea20b3 Chore: SQL store split for annotations (#55089)
* Chore: SQL store split for annotations

* Apply suggestion from code review
2022-09-19 10:54:37 +03:00
Marcus Efraimsson 87afd9cadc Plugins: Remove various custom headers logic (#54146)
Removes various custom headers logic sprinkled around in the backend. 
It should automatically be applied to outgoing HTTP requests via the 
CustomHeadersMiddleware.
This also removes decryption of SecureJSONData to populate custom 
headers in ngalert which seemed to have caused a ton of CPU usage.
2022-08-26 11:56:10 +02:00
Karl Persson 5a1b9d2283 RBAC: Remove DeclareFixedRoles wrapper on Access control and inject service (#54153)
* RBAC: Remove DeclareFixedRoles wrapper on Access control and inject service when needed
2022-08-26 09:59:34 +02:00
Yuriy Tseretyan 5fb778814c Alerting: Update rules version when folder title is updated (#53013)
* remove support for bus from scheduler
* rename event to FolderTitleUpdated and fire only if title has changed
* add method to increase version of all rules that belong to a folder
* update ngalert service to subscribe to folder title change event call data store and update scheduler
* add tests
2022-08-01 19:28:38 -04:00
Jean-Philippe Quéméner 50ae42130b Alerting: take datasources as external alertmanagers into consideration (#52534) 2022-07-20 16:50:49 +02:00
Yuriy Tseretyan 6e1e4a4215 Alerting: Update DbStore to use disabled orgs from the config (#52156)
* update DbStore to use UnifiedAlerting settings
* remove disabled orgs from scheduler and use config in db store instead
* remove test
2022-07-15 14:13:30 -04:00
idafurjes 17ec9cac83 Add delete user from other services/stores (#51912)
* Remove user from preferences, stars, orguser, team member

* Fix lint

* Add Delete user from org and dashboard acl

* Delete user from user auth

* Add DeleteUser to quota

* Add test files and adjust user auth store

* Rename package in wire for user auth

* Import Quota Service interface in other services

* do the same in tests

* fix lint tests

* Fix tests

* Add some tests

* Rename InsertUser and DeleteUser to InsertOrgUser and DeleteOrgUser

* Rename DeleteUser to DeleteByUser in quota

* changing a method name in few additional places

* Fix in other places

* Fix lint

* Fix tests

* Rename DeleteOrgUser to DeleteUserFromAll

* Update pkg/services/org/orgimpl/org_test.go

Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>

* Update pkg/services/preference/prefimpl/inmemory_test.go

Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>

* Rename Acl to ACL

* Fix wire after merge with main

* Move test to uni test

Co-authored-by: Emil Tullstedt <emil.tullstedt@grafana.com>
2022-07-15 18:06:44 +02:00