* parse via sse
I need to figure out how to handle the pipeline.execute with our own
client. I think this is important for MT reasons, just like using our
own cache (via legacy) is important.
parsing is done though!
* WIP nonsense
* horrible code but i think it works
* Add support for sql expressions config settings
* Cleanup:
- remove spew from nodes.go
- uncomment out plugin context and use in single tenant flow
- make code more readable and add comments
* Cleanup:
- create separate file for mt ds client builder
- ensure error handling is the same for both expressions and regular queries
- other cleanup
* not working but good thoughts
* WIP, vector not working for non sse
* super hacky but i think vectors work now
* delete delete delete
* Comments for future ref
* break out query handling and start test
* add prom debugger
* clean up: remove comments and commented out bits
* fix query_test
* add prom debugger
* create table-driven tests with testsdata files
* Fix test
* Add test
* go mod??
* idk
* Remove comment
* go enterprise issue maybe
* Fix codeowners
* Delete
* Remove test data
* Clean up
* logger
* Remove go changes hopefully
* idk go man
* sad
* idk i ran go mod tidy and this is what it wants
* Fix readme, with much help from adam
* some linting and testing errors
* lint
* fix lint
* fix lint register.go
* another lint
* address lint in test
* fix dead code and linters for query_test
* Go mod?
* Struggling with go mod
* Fix test
* Fix another test
* Revert headers change
* Its difficult to test this in OSS as it depends on functionality defined in enterprise, let's bring these tests back in some form in enterprise
* Fix codeowners
---------
Co-authored-by: Adam Simpson <adam@adamsimpson.net>
* Add FiredAt field to the State
* Update featuretoggle files
* Fix lint errors
* Fix test compilation
* Remove random print line + formatting
* Address PR comments
What is this feature?
This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation).
keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again:
Before
+----------+ +----------+
| Alerting |---->| Normal |
+----------+ +----------+
-----
After
+----------+ +------------+ +----------+
| Alerting |----->| Recovering |---->| Normal |
+----------+ +------------+ +----------+
Why do we need this feature?
This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert
The remote write path differs based on whether the data source is actually
Prometheus, Mimir, Cortex, or an older version of Cortex. We do not want
users to have to specify the path, so this change determines the path as
best it can.
It may be in the future we have to make this configurable per-datasource
to cater for setups where it's impossible to determine the correct path.
The change to use WriteDatasource was done in a previous commit, this adds a
test case using DatasourceWriter, in addition to the one using PrometheusWriter.
* Alerting: Refactor NewPrometheusWriter function.
In order to re-use PrometheusWriter, changing the function take a
PrometheusWriterConfig instead of RecordingRulesSettings, and adapt the old
interface onto the new interface.
* Make linter happy
Extend the recording rule definition to include the target data source, allowing
configuration of where the output of the recording rule is written to. Also
extends the relevant interfaces in preparation for the next set of changes.
* add column guid to alert rule table and rule_guid to rule version table
+ populate the new field with UUID
* update storage and domain models
* patch GUID
* ignore GUID in fingerprint tests
update scheduler's aler rule to accept regular Evaluation in update channel
This makes it accept the full rule definition, which is required in reset state.
* introduce new fields created_by in rule tables
* update domain model and compat layer to support UpdatedBy
* add alert rule generator mutators for UpdatedBy
* ignore UpdatedBy in diff and hash calculation
* Add user context to alert rule insert/update operations
Updated InsertAlertRules and UpdateAlertRules methods to accept a user context parameter. This change ensures auditability and better tracking of user actions when creating or updating alert rules. Adjusted all relevant calls and interfaces to pass the user context accordingly.
* set UpdatedBy in PreSave because this is where Updated is set
* Use nil userID for system-initiated updates
This ensures differentiation between system and user-initiated changes for better traceability and clarity in update origins.
---------
Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
* Add health fields to rules and an aggregator method to the scheduler
* Move health, last error, and last eval time in together to minimize state processing
* Wire up a readonly scheduler to prom api
* Extract to exported function
* Use health in api_prometheus and fix up tests
* Rename health struct to status
* Fix tests one more time
* Several new tests
* Handle inactive rules
* Push state mapping into state manager
* rename to StatusReader
* Rectify cyclo complexity rebase
* Convert existing package local status implementation to models one
* fix tests
* undo RuleDefs rename
* Add group and type labels to rule_group_rules metric
* Don't include group to avoid high cardinality
* Add comments
* Reset rule_group_rules before recording new values
* Edit description for rule_group_rules
* Include ruleGroup combo key in labels
* Fix lint