Commit Graph

2027 Commits

Author SHA1 Message Date
Yuri Tseretyan
6603abc873 Alerting: Support for imported receivers in API (#112138)
* add support for converting Mimir integrations to Integration
* implement imported config revision
* update service to load staged receivers if configured
* make sure non-Grafana origin cannot be mutated
* set access control metadata for imported origin
* set includeImported from feature flag. Disabled for service used by provisioning
* add tests for new functionality
* add snapshot-based integration test
2025-11-13 15:35:21 +00:00
Alexander Akhmetov
44a92d252b Alerting: Support rule title search on the backend (#113738) 2025-11-13 15:52:14 +01:00
Moustafa Baiou
559dab8b1b Alerting: Fix error when updating Alertmanager config with autogenerated receivers (#113710)
If an alert rule with an invalid receiver is created it breaks the entire alertmanager configuration rather than preventing the save.

This fixes the issue by erroring on save and apply, and logging invalid receivers only when applying the config after an update.

Introduced in #111838
2025-11-11 16:53:36 +00:00
Seunghun Shin
c784de6ef5 Alerting: Add compressed periodic save for alert instances (#111803)
What is this feature?

This PR implements compressed periodic save for alert state storage, providing a more efficient alternative to regular periodic saves by grouping alert instances by rule UID and storing them using protobuf and snappy compression. When enabled via the state_compressed_periodic_save_enabled configuration option, the system groups alert instances by their alert rule, compresses each group using protobuf serialization and snappy compression, and processes all rules within a single database transaction at specified intervals instead of syncing after every alert evaluation cycle.

Why do we need this feature?

During discussions in PR #111357, we identified the need for a compressed approach to periodic alert state storage that could further reduce database load beyond the jitter mechanism. While the jitter feature distributes database operations over time, this compressed periodic save approach reduces the frequency of database operations by batching alert state updates at explicitly declared intervals rather than syncing after every alert evaluation cycle.
This approach provides several key benefits:

- Reduced Database Frequency: Instead of frequent sync operations tied to alert evaluation cycles, updates occur only at configured intervals
- Storage Efficiency: Rule-based grouping with protobuf and snappy compression significantly reduces storage requirements

The compressed periodic save complements the existing jitter mechanism by providing an alternative strategy focused on reducing overall database interaction frequency while maintaining data integrity through compression and batching.

Who is this feature for?

- Platform/Infrastructure teams managing large-scale Grafana deployments with high alert cardinality
- Organizations looking to optimize storage costs and database performance for alerting workloads
- Production environments with 1000+ alert rules where database write frequency is a concern
2025-11-07 11:51:48 +01:00
Moustafa Baiou
acf0da9b80 Make the ordering of test on case-sensitivity consistent across databases and charsets 2025-11-03 11:36:18 -05:00
Moustafa Baiou
6f7c525213 Alerting: Ensure case-sensitive ordering for alert rule group column
The query which fetches alert rules in a paginated manner ordered by `rule_group` can result in strange and inconsistent ordering when the database uses a case-insensitive collation for the `rule_group` column. This can lead to scenarios where rules from different groups are interleaved in the results, making pagination unreliable and the returned number of rule_groups incorrect.

Related to #88990
2025-11-03 11:36:18 -05:00
Yuri Tseretyan
a4df6c8bb9 Alerting: Prohibit receivers with empty name (#113064) 2025-10-29 16:30:38 -04:00
William Wernert
75fb832826 Alerting: Ensure state history client has external labels set (#113101)
* Ensure state history client has external labels set

* Run `make update-workspace`

* Add dep owner
2025-10-28 11:35:54 -04:00
Moustafa Baiou
ce246936c4 Alerting: Surface remote AM silence creation errors properly
When creating silences in remote Alertmanager instances, all 4xx errors were treated as 500s.

This change ensures that 4xx errors are properly surfaced as bad payload errors, allowing callers to handle them appropriately.
2025-10-27 14:21:46 -04:00
Yuri Tseretyan
5673d0b532 Alerting: Skip logging in case of invalid receivers during auto generating policies (#111838)
* skip logging of invalid receivers during autogen
* log warn instead of error
2025-10-27 11:03:06 -04:00
Denis Vodopianov
81683d554d chore : Deprecating FeatureToggles.IsEnabledGlobally (#112885)
* add deprecation on featuremgmt.IsEnabledGlobally

* add nolint reason

* add reasonable deprecation message

* remove junk edits

* add more nolints

* addressing review comments

* Update pkg/services/featuremgmt/models.go

Co-authored-by: Dave Henderson <dave.henderson@grafana.com>

---------

Co-authored-by: Dave Henderson <dave.henderson@grafana.com>
2025-10-24 12:02:53 -04:00
Yuri Tseretyan
8b7f119cad Alerting: Provisioning to fix contact point type on save (#112246)
fix contact point type on create\update
2025-10-23 11:11:36 -04:00
Yuri Tseretyan
5f9a51418c Alerting: Fix unmarshalling of GettableStatus to include time intervals (#112602)
* move test files into test-data

* add test for the bug

* populate time-intervals of gettableStatus config
2025-10-21 09:28:04 -04:00
Ieva
acbbfde256 AuthZ service: Expand the logic to also evaluate action sets (#112124)
* expand AuthZ service logic to also evaluate action sets

* handle folder creation

* fix test

* simplify mapper code

Co-authored-by: gamab <gabi.mabs@gmail.com>

* more accurate variable name Co-authored-by: gamab <gabi.mabs@gmail.com>

* break alerting import cycle

* Apply suggestion from @gamab

---------

Co-authored-by: gamab <gabi.mabs@gmail.com>
Co-authored-by: Gabriel MABILLE <gamab@users.noreply.github.com>
2025-10-08 13:37:12 +01:00
Santiago
3f4c9879c9 Remote Alertmanager: Add timeout to the remoteClient (#112157) 2025-10-08 11:13:02 +00:00
Yuri Tseretyan
7d1c6b6bd2 Alerting: Replace IntegrationConfig with IntegrationSchemaVersion (#112010)
* remove unused compat functions

* update to alerting module from pr

* replace IntegrationConfig with IntegrationSchemaVersion

* safely resolve a string into integration type

* change usages of integration config
2025-10-07 11:08:16 -04:00
Tito Lins
7e63a01a79 alerting: omit optional notification settings fields (#112049) 2025-10-06 14:23:21 +02:00
Alexander Akhmetov
cd889fef9b Alerting: Keep extra configurations on main config update (#106958) 2025-10-06 09:28:38 +02:00
Yuri Tseretyan
d0f79ee60d Alerting: Update alerting module + refactor (#111761)
* update alerting module
* replace compat with ones from alerting
* update type references Receiver and Integration to *Status
* update route in provisioning test that is invalid after recent change
* use right type for LINE ingtegration
2025-10-03 10:37:49 -04:00
Yuri Tseretyan
22173da78d Alerting: Use empty feature manager for creating test state (#111964) 2025-10-02 19:46:59 +00:00
Alexander Akhmetov
169bf2ce73 Alerting: Add feature toggle to use the old simplified routing hash generation (#111900)
* Revert "Alerting: Generate simplified routing routes with old fingerprint function (#111893)"

This reverts commit 0da9d49896.

* Add alertingUseNewSimplifiedRoutingHashAlgorithm flag

* Alerting: Add feature toggle to use the old simplified routing hash generation
2025-10-01 15:21:33 -04:00
Alexander Akhmetov
0da9d49896 Alerting: Generate simplified routing routes with old fingerprint function (#111893) 2025-10-01 18:45:36 +02:00
Seunghun Shin
512c292e04 Alerting: Add jitter support for periodic alert state storage to reduce database load spikes (#111357)
What is this feature?

This PR implements a jitter mechanism for periodic alert state storage to distribute database load over time instead of processing all alert instances simultaneously. When enabled via the state_periodic_save_jitter_enabled configuration option, the system spreads batch write operations across 85% of the save interval window, preventing database load spikes in high-cardinality alerting environments.

Why do we need this feature?

In production environments with high alert cardinality, the current periodic batch storage can cause database performance issues by processing all alert instances simultaneously at fixed intervals. Even when using periodic batch storage to improve performance, concentrating all database operations at a single point in time can overwhelm database resources, especially in resource-constrained environments.

Rather than performing all INSERT operations at once during the periodic save, distributing these operations across the time window until the next save cycle can maintain more stable service operation within limited database resources. This approach prevents resource saturation by spreading the database load over the available time interval, allowing the system to operate more gracefully within existing resource constraints.

For example, with 200,000 alert instances using a 5-minute interval and 4,000 batch size, instead of executing 50 batch operations simultaneously, the jitter mechanism distributes these operations across approximately 4.25 minutes (85% of 5 minutes), with each batch executed roughly every 5.2 seconds.

This PR provides system-level protection against such load spikes by distributing operations across time, reducing peak resource usage while maintaining the benefits of periodic batch storage. The jitter mechanism is particularly valuable in resource-constrained environments where maintaining consistent database performance is more critical than precise timing of state updates.
2025-09-29 11:22:36 +02:00
Yuri Tseretyan
b8f23eacd4 Alerting: Migrate to integration schema (#111643)
* update tests to assert against snapshot
* remove channel_config package replaced by schemas from alerting module
* update  references to use new schema
2025-09-26 09:31:50 -04:00
Yuri Tseretyan
24c10b4fb9 Alerting: Remove usages of ReceiverType (#111508)
* remove usages of ReceiverType
2025-09-25 16:09:54 -04:00
Santiago
dab39c873f Remote Alertmanager: Use the correct OrgID when creating the store (#111634)
* Remote Alertmanager: Use the correct OrgID when creating the store

* fix test
2025-09-25 16:53:07 +00:00
Santiago
345b72227f Alert State History: Remove redundant JSON serialization when merging Loki streams (#111443) 2025-09-23 20:56:37 +02:00
Santiago
04bc71fa6d Alert State History: Skip invalid entries when merging streams (#111387) 2025-09-22 12:29:39 +02:00
Yuri Tseretyan
f166968357 Alerting: Refactoring ConfigRevision methods (#111192)
* make validateReceiver private

* make functions and type alias private

* move EncryptedReceivers and DecryptedReceivers to notifier package

to reduce exposure of definitions package via legacy_storage

* return receivers with Grafana origin after create\update

* add tests for ConfigRevision methods
2025-09-19 09:46:35 -04:00
Santiago
8f9d8f1154 Remote Alertmanager: Fix log line in the Mimir client (#111293) 2025-09-18 10:07:16 +00:00
Yuri Tseretyan
c36b2ae191 Alerting: v0 schema for integrations (mimir) (#110908)
* generate schema for mimir integrations from schema on front-end
* review and fix the settings
* Update GetAvailableNotifiersV2 to return mimir as v0
* add version argument to GetSecretKeysForContactPointType
* update TestGetSecretKeysForContactPointType to include v0
* add type alias field to contain alternate types that different from Grafana's
* add support for msteamsv2
* update ConfigForIntegrationType to look for alternate type
* update IntegrationConfigFromType to use new result of ConfigForIntegrationType
* add reference to parent plugin to NotifierPluginVersion to allow getting plugin type by it's alias
* add tests to ensure consistency
* make API response stable
* add tests against snapshot + omit optional fields
2025-09-17 09:25:56 -04:00
Yuri Tseretyan
356521c9b9 Alerting: Annotation CanUse for receiver resource (#110839)
* add origin to receiver
* populate origin of the receiver
* set CanUse to false if origin is not Grafana
* set provenance if origin is imported
* set Grafana origin by default in conversion API
* set canUse annotation
* reject update\delete operations on resources with origin other than Grafana
* fail to create with wrong origin
2025-09-16 09:32:04 -04:00
Vadim Stepanov
d4bad37853 Alerting: Move notification historian to grafana/alerting (#109078)
* Move notification historian to grafana/alerting

* wip

* golangci-lint

* Revert "golangci-lint"

This reverts commit 10ccebad41.

* JSONEncoder

* alertingInstrument

* go mod tidy

* go.work.sum

* make update-workspace

* merge

* revert go.mod changes

* github.com/grafana/alerting

* make update-workspace

* update github.com/grafana/alerting

* merge
2025-09-15 15:23:51 +01:00
Ryan McKinley
afc08dbbbc Chore: go.mod updates (#110957) 2025-09-15 09:01:45 +00:00
Ryan McKinley
9a54243f09 Chore: update golang.org/x/exp (#110980) 2025-09-11 22:13:07 +03:00
Alexander Akhmetov
fc3636acf2 Alerting: Fix bug where rules with identical mute/active intervals produced conflicting routes (#110935)
Alerting: Fix hash collision in NotificationSettings fingerprint
2025-09-11 13:44:06 +02:00
Moustafa Baiou
f65e219b21 Alerting: Update prometheus api to reuse list query logic
This lets the prometheus api respect NoGroup query logic and treat non-grouped rules consistently.

Co-authored-by: William Wernert <william.wernert@grafana.com>
2025-09-10 09:30:56 -04:00
Moustafa Baiou
ca8324e62a Alerting: Add support for alpha rules apis in legacy storage
Rules created in the new api makes the rule have no group in the database, but the rule is returned in the old group api with a sentinel group name formatted with the rule uid for compatiblity with the old api.
This makes the UI continue to work with the rules without a group, and the ruler will continue to work with the rules without a group.

Rules are not allowed to be created in the provisioning api with a NoGroup sentinel mask, but NoGroup rules can be manipulated through both the new and old apis.

Co-authored-by: William Wernert <william.wernert@grafana.com>
2025-09-10 09:30:56 -04:00
Peter Štibraný
c32650e9d8 Replace remaining calls to testing.Short where possible. (#110765)
* Replace remaining calls to testing.Short where possible.
* Update style guide.
* Revert change in TestAlertmanager_ExtraDedupStage, as it doesn't work.
* Make TestAlertRulePostExport into integration test.
2025-09-09 08:16:12 +00:00
William Wernert
61adae16f2 Alerting: Ensure failed query validation returns the proper error code (#110717)
Ensure presave error is a validation error
2025-09-08 13:51:22 -04:00
Ryan McKinley
7c95d3c8a9 Folders: Split legacy out of folder.Service (and remove folder.FolderStore) (#110734) 2025-09-08 18:27:49 +03:00
Fayzal Ghantiwala
22ed5499a2 Alerting: Check if TimeInterval is used in ActiveTimings when deleting (#110691)
* check for active timing in route

* Update test

* Add integration test
2025-09-08 15:04:40 +01:00
Peter Štibraný
7fd9ab9481 Replace check for integration tests. (#110707)
* Replace check for integration tests.
* Revert changes in pkg/tsdb/mysql packages.
* Fix formatting of few tests.
2025-09-08 15:49:49 +02:00
Matthew Jacobson
d21178e348 Alerting: Fix field names on webhook HMAC/TLS config HCL export (#110722)
tlsConfig -> tls_config
hmacConfig -> hmac_config

tls_config export still does not match TF provider, as the provider currently
treats tls_config as a schemaless map. Once this is improved, they will now
match.
2025-09-05 19:58:11 -04:00
Moustafa Baiou
a459d43746 Alerting: Refactor prometheus api functions
Make state and health filters public

Co-authored-by: William Wernert <william.wernert@grafana.com>
Co-authored-by: Fayzal Ghantiwala <fayzal.ghantiwala@grafana.com>
2025-09-05 10:59:16 -04:00
Yuri Tseretyan
ce55d70fa5 Alerting: Refactor notification legacy storage (#110619)
* make legacy store expose only model.Receiver
* use integration as provenance type provider
* use revision RenameReceiverInRoutes
* introduce function GetReceiversNames in config revision

---------

Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
2025-09-05 14:46:46 +00:00
Alexander Akhmetov
100528e274 Alerting: Support retry with backoff in alert rule evaluation (#99710) 2025-09-04 13:56:03 +02:00
Yuri Tseretyan
7d32640179 Alerting: Fix ticker tests to not fail if channel is empty (#110538) 2025-09-03 16:21:47 -04:00
Yuri Tseretyan
1e0aaa29af Alerting: Comprehensive payload for Alertmanager convert API tests (#110485)
* do not remove global config
* create more comprehensive payload for mimir alertmanager testing
2025-09-03 12:11:55 -04:00
Alexander Akhmetov
8a7c1f595a Alerting: Backend state filtering for history UI (#109647) 2025-09-03 17:47:03 +02:00