grafana

Author	SHA1	Message	Date
William Wernert	45f298120e	Alerting: Return error when writing recorded metrics instead of default writing NaN (#90743 ) * Return error instead of default writing NaN	2024-07-22 15:47:02 -04:00
Alexander Weaver	418b077c59	Alerting: Integration testing for recording rules including writes (#90390 ) * Add success case and tests for writer using metrics * Use testable version of clock * Assert a specific series was written * Fix linter * Fix manually constructed writer	2024-07-18 17:14:49 -05:00
Alexander Weaver	88ed77e7e8	Alerting: More graceful handling of NoData in recording rules (#90312 ) * Handle NoData as its own case * Debug * Scalars parseable by CollectionReader * fix linter * Orgit add pkg/git add pkg/ not and	2024-07-17 15:24:03 -05:00
Yuri Tseretyan	c3b9c9b239	Alerting: Send information about alert rule to data source in headers (#90344 ) * add support of metadata to condition and adding it to request headers * support for additional metadata when condition is built * add additionall context to conditions: source and folder title * add version * use percent-encoding for header values	2024-07-17 22:55:12 +03:00
Alexander Weaver	111ebd4fb2	Alerting: Create integration testing infra for recording rules (#90306 ) * Create some integration testing infra for RRs * whoops * Require no error in responding * fix linter * Panic, no need to pass testing around * Extend status test	2024-07-11 14:59:52 -05:00
Alexander Weaver	ab32183e18	Alerting: Track recording rule health and last eval info ephemerally (#90247 ) * Track health and last eval info * Read method for status * Minor tests	2024-07-11 14:05:09 -05:00
Yuri Tseretyan	c3b5cabb14	Alerting: Refactor scheduler's rule evaluator to store rule key (#89925 )	2024-07-01 16:43:23 -04:00
Yuri Tseretyan	655e477c20	Alerting: Fix flaky test in scheduler's tests (#89923 )	2024-07-01 13:31:03 -04:00
Matthew Jacobson	47c9259d75	Alerting: Ensure we update State.LastSentAt before persisting (#89427 )	2024-06-25 13:01:26 -04:00
William Wernert	fcfa89f864	Alerting: Implement Prometheus remote write for recording rules (#89189 ) * Fix timestamp recorded by rule * Implement prometheus remote write * Create http client instead of transport * Address PR comments * Remove status code label	2024-06-25 17:23:42 +03:00
Matthew Jacobson	3228b64fe6	Alerting: Resend resolved notifications for ResolvedRetention duration (#88938 ) * Simple replace of State.Resolved with State.ResolvedAt * Retain ResolvedAt time between Normal->Normal transition * Introduce ResolvedRetention to keep sending recently resolved alerts * Make ResolvedRetention configurable with resolved_alert_retention * Tick-based LastSentAt for testing of ResendDelay and ResolvedRetention * Do not reset ResolvedAt during Normal->Pending transition Initially this was done to be inline with Prom ruler. However, Prom ruler doesn't keep track of Inactive->Pending/Alerting using the same alert instance, so it's more understandable that they choose not to retain ResolvedAt. In our case, since we use the same cached instance to represent the transition, it makes more sense to retain it. This should help alleviate some odd situations where temporarily entering Pending will stop future resolved notifications that would have happened because of ResolvedRetention. * Pointers for ResolvedAt & LastSentAt To avoid awkward time.Time{}.Unix() defaults on persist	2024-06-20 16:33:03 -04:00
William Wernert	c62cc25513	Alerting: Configure recording rule writer from config.ini (#89056 )	2024-06-12 16:04:46 -04:00
Alexander Akhmetov	667fea6623	Alerting: use hash of labels instead of labels string as the alert state cache key (#88956 ) * Alerting: use hash instead of labels as the cache key * Use data.Labels.Fingerprint to calculate the cache key	2024-06-11 18:34:58 +02:00
Alexander Weaver	d004f8a98d	Alerting: Recording rules understands errors embedded in dataframes (#88946 ) * Make MakeDependencyError public for tests in another package * Create tests for errors in eval results * Extract logic to pull frame errors out into exported function * Maybe we can drop cyclomatic complexity lint suppression now? * extract frame errors and fail recording rules if frames contain error * Fix up retry logic to actually work * Do not retry non retryable errors	2024-06-11 10:37:10 -05:00
Alexander Weaver	58fdb24b0b	Alerting: Recording rules appear as type=recording in Prometheus API + better abstraction for type (#88805 ) * Wire status through to prom API * Regenerate swagger	2024-06-07 11:24:06 -05:00
Alexander Weaver	a2e21d61f8	Alerting: Remove dead `evalRunning` guard in rule routine (#88312 ) Remove dead guard	2024-06-06 16:15:01 -05:00
William Wernert	5de7d4d06d	Alerting: Create writer interface for recording rules (#88459 ) * Create writer interface for recording rules Also create fake impl + use it for stub in scheduler	2024-05-29 22:38:33 +03:00
Alexander Weaver	b926b6336d	Alerting: Scheduled recording rules execute their queries (#88309 ) * Basic eval flow * Wiring-up * fix * Extend todo * Start with tests * Include some relevant tests, skip ones that seem to have timing-based race conditions * Some tests, touch up linter and todo * Solve TODO * Add tracing * Tests to make sure an eval went through * Wire up feature toggles * Update pkg/services/ngalert/schedule/recording_rule.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> * Update pkg/services/ngalert/schedule/recording_rule_test.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> * Update pkg/services/ngalert/schedule/recording_rule_test.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> * Update pkg/services/ngalert/schedule/recording_rule_test.go Co-authored-by: Steve Simpson <steve.simpson@grafana.com> --------- Co-authored-by: Steve Simpson <steve.simpson@grafana.com>	2024-05-28 10:59:21 -05:00
Alexander Weaver	89b54d06e9	Alerting: Schedule a shim implementation for recording rules (#87939 ) * Add shim rule implementation for recording rules * Give ruleFactory access to the original rule definition * Schedule shim implementation if the rule is a recording rule * Fix or suppress linter * Fix nolint	2024-05-21 16:42:58 -05:00
Yuri Tseretyan	05d6813a09	Alerting: Fix scheduler to sort rules before evaluation (#88006 ) sort rules scheduled for evaluation to make sure that the order is stable between evaluations. This is especially important in HA mode.	2024-05-17 11:38:19 -04:00
Yuri Tseretyan	f410c7fca1	Alerting: use logger with same context within rule scheduling loop (#87934 )	2024-05-15 15:38:00 -04:00
Alexander Weaver	a6a9ab4008	Alerting: Do not store series values from past evaluations in state manager for no reason (#87525 ) Do not store previous execution results on states	2024-05-09 15:51:55 -05:00
Alexander Weaver	36ef611cf4	Alerting: Add database migration for recording rule fields (#87012 ) * Create recording rule fields in model * Add migration * Write to database, support in version table * extend fingerprint * Force fields to be empty on validate * Another storage spot, tests for fingerprint * Explicitly set defaults in provisioning API * Tests for main API validation * Add diff tests even though fields are unpopulated for now * Use struct tag approach instead of FromDB/ToDB hooks as it better handles nulls when deserializing * test for deser * Backout RecordTo for now since it's not decided in the doc * back out of migration too * Drop datasourceref for now * address linter complaints * Try a single outer struct with all fields embedded	2024-05-09 12:12:44 -05:00
Yuri Tseretyan	052082a927	Alerting: Refactor Alert Rule Generators (#86813 )	2024-04-29 21:52:15 -04:00
Steve Simpson	ad7f804255	Alerting: Fix evaluation metrics to not count retries (#85873 ) * Change evaluation metrics to only count once per eval, and add new metrics. * Cosmetic: Move eval total Inc() to orginal place.	2024-04-12 16:20:46 +02:00
Dave Henderson	5687243d0b	Feature Flags: use FeatureToggles interface where possible (#85131 ) * Feature Flags: use FeatureToggles interface where possible Signed-off-by: Dave Henderson <dave.henderson@grafana.com> * Replace TestFeatureToggles with existing WithFeatures Signed-off-by: Dave Henderson <dave.henderson@grafana.com> --------- Signed-off-by: Dave Henderson <dave.henderson@grafana.com>	2024-04-04 12:22:31 -04:00
Benoit Tigeot	6f38ac6615	Alerting: Reduce set of fields that could trigger alert state change (#83496 ) We want to avoid too much change of alert state based on change on alert's fields. For that we ignore some fields from the diff.	2024-03-26 12:35:30 -04:00
ismail simsek	6137c4e0a6	Chore: Bump golangci-lint v1.57.1 (#84998 ) * bump golangci-lint v1.57.1 * update setting * remove goconst * fix linting issues * prettier * fix G601 * go mod tidy go work sync	2024-03-25 15:28:24 +01:00
Alexander Weaver	6c5e94095d	Alerting: Scheduler and registry handle rules by an interface (#84044 ) * export Evaluation * Export Evaluation * Export RuleVersionAndPauseStatus * export Eval, create interface * Export update and add to interface * Export Stop and Run and add to interface * Registry and scheduler use rule by interface and not concrete type * Update factory to use interface, update tests to work over public API rather than writing to channels directly * Rename map in registry * Rename getOrCreateInfo to not reference a specific implementation * Genericize alertRuleInfoRegistry into ruleRegistry * Rename alertRuleInfo to alertRule * Comments on interface * Update pkg/services/ngalert/schedule/schedule.go Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com> --------- Co-authored-by: Jean-Philippe Quéméner <JohnnyQQQQ@users.noreply.github.com>	2024-03-11 22:57:38 +02:00
Alexander Weaver	201f5d3ac9	Alerting: Extract large closures in ruleRoutine (#84035 ) * extract notify * extract resetState * move evaluate metrics inside evaluate * split out evaluate	2024-03-06 16:39:23 -06:00
Alexander Weaver	7a171fd14a	Regenerate openapidocs at 1.21.8 to match ci (#84037 ) * Regenerate openapidocs at 1.21.8 to match ci * Adjust trigger to work on the actual outputted files * Also put go.mod and go.sum in the triggers * manually fix * Make an arbitrary change rather than touching the trigger to force a run * Drop all triggers - run all the time * Print diff - taken from @papagian's PR * Manual fixes to swagger doc --------- Co-authored-by: Ryan McKinley <ryantxu@gmail.com>	2024-03-06 16:08:45 -06:00
Alexander Weaver	d5fda06147	Alerting: Decouple rule routine from scheduler (#84018 ) * create rule factory for more complicated dep injection into rules * Rules get direct access to metrics, logs, traces utilities, use factory in tests * Use clock internal to rule * Use sender, statemanager, evalfactory directly * evalApplied and stopApplied * use schedulableAlertRules behind interface * loaded metrics reader * 3 relevant config options * Drop unused scheduler parameter * Rename ruleRoutine to run * Update READMED * Handle long parameter lists * remove dead branch	2024-03-06 13:44:53 -06:00
Alexander Weaver	1bb38e8f95	Alerting: Move ruleRoutine to be a method on ruleInfo (#83866 ) * Move ruleRoutine to ruleInfo file * Move tests as well * swap ruleInfo and scheduler parameters on ruleRoutine * Fix linter complaint, receiver name	2024-03-04 17:15:55 -06:00
Alexander Weaver	f2a9d0a89d	Alerting: Refactor ruleRoutine to take an entire ruleInfo instance (#83858 ) * Make stop a real method * ruleRoutine takes a ruleInfo reference directly rather than pieces of it * Fix whitespace	2024-03-04 15:15:01 -06:00
Alexander Weaver	fa51724bc6	Alerting: Move alertRuleInfo and tests to new files (#83854 ) Move ruleinfo and tests to new files	2024-03-04 11:24:49 -06:00
William Wernert	fabaff9a24	Alerting: Create metric for rules using simple notifications (#82904 ) --------- Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>	2024-02-16 19:01:49 +02:00
Yuri Tseretyan	1eebd2a4de	Alerting: Support for simplified notification settings in rule API (#81011 ) * Add notification settings to storage\domain and API models. Settings are a slice to workaround XORM mapping * Support validation of notification settings when rules are updated * Implement route generator for Alertmanager configuration. That fetches all notification settings. * Update multi-tenant Alertmanager to run the generator before applying the configuration. * Add notification settings labels to state calculation * update the Multi-tenant Alertmanager to provide validation for notification settings * update GET API so only admins can see auto-gen	2024-02-15 09:45:10 -05:00
Alexander Weaver	d4ae10ecc6	Alerting: Small refactor, move unrelated functions out of fetcher (#82459 ) Move unrelated functions out of fetcher	2024-02-14 20:01:32 +02:00
Diego Augusto Molina	ff08c0a790	Chore: improve test readability in ngalert/schedule (#82453 ) Chore: improve test readability	2024-02-14 14:53:32 -03:00
Diego Augusto Molina	9c29e1a783	Alerting: Fix data races and improve testing (#81994 ) * Alerting: fix race condition in (ngalert/sender.ExternalAlertmanager).Run Chore: Fix data races when accessing members of ngalert/state.FakeInstanceStore Chore: Fix data races in tests in ngalert/schedule and enable some parallel tests * Chore: fix linters * Chore: add TODO comment to remove loopvar once we move to Go 1.22	2024-02-14 12:45:39 -03:00
Alexander Weaver	5bbe9c6e61	Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle (#82212 ) * remove jitter feature flag * Add an out so users can manually disable jitter * Pass in cfg * Add TODO to remove knob in future	2024-02-09 15:53:58 -06:00
Alexander Weaver	843c477899	Alerting: Add exported API to scheduler to access currently loaded rules (#82031 ) * Add exported API to fetch rule definitions from scheduler * Add comment	2024-02-07 09:31:22 -06:00
Ashley Harrison	39057552dc	QueryField: Handle autocomplete better (#81484 ) * extract out function + add unit tests * add feature toggle and default it to on	2024-01-31 10:01:20 +00:00
Yuri Tseretyan	131c72d655	Alerting: Fix scheduler to group folders by the unique key (orgID and UID) (#81303 )	2024-01-30 17:14:11 -05:00
Alexander Weaver	18b9c8fd5f	Alerting: Nilcheck JitterStrategyFrom so it can be used in contexts without feature toggles (#80841 ) Nilcheck so tests can have a nil feature toggles	2024-01-18 15:43:41 -06:00
Alexander Weaver	00a260effa	Alerting: Add setting to distribute rule group evaluations over time (#80766 ) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels	2024-01-18 12:48:11 -06:00
Jean-Philippe Quéméner	82638d059f	feat(alerting): add state persister interface (#80384 )	2024-01-17 13:33:13 +01:00
Alexander Weaver	3c796ecc8f	Alerting: Add metric counting rule groups per org (#80669 ) * Refactor, fix bad map hint * Count groups per org	2024-01-16 16:35:56 -06:00
Alexander Weaver	542741f748	Alerting: Log scheduler maxAttempts, guard against invalid retry counts, log retry errors (#80234 ) * Log maxAttempts, add guard, log retry errors * fix whitespace * Initialize evaluator in TestProcessTicks	2024-01-09 13:19:37 -06:00
Yuri Tseretyan	f6a46744a6	Alerting: Support hysteresis command expression (#75189 ) Backend: * Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command. - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels. - add rule_fingerprint column to alert_instance - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it. - add AlertingResultsFromRuleState that implements the new interface in eval package - update getExprRequest to patch the hysteresis command. * Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition. Frontend: * Add hysteresis option to Threshold in UI. It's called "Recovery Threshold" * Add test for getUnloadEvaluatorTypeFromCondition * Hide hysteresis in panel expressions * Refactor isInvalid and add test for it * Remove unnecesary React.memo * Add tests for updateEvaluatorConditions --------- Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>	2024-01-04 11:47:13 -05:00

1 2 3 4 5

240 Commits