* Capture error_type label on metrics/traces
* Make error messages more helpful to user
* Use errutil, categorized errors, and tie them to error_type (category in code)
* Misc trace fixes
* Add metric to track SQL input conversion
* parse via sse
I need to figure out how to handle the pipeline.execute with our own
client. I think this is important for MT reasons, just like using our
own cache (via legacy) is important.
parsing is done though!
* WIP nonsense
* horrible code but i think it works
* Add support for sql expressions config settings
* Cleanup:
- remove spew from nodes.go
- uncomment out plugin context and use in single tenant flow
- make code more readable and add comments
* Cleanup:
- create separate file for mt ds client builder
- ensure error handling is the same for both expressions and regular queries
- other cleanup
* not working but good thoughts
* WIP, vector not working for non sse
* super hacky but i think vectors work now
* delete delete delete
* Comments for future ref
* break out query handling and start test
* add prom debugger
* clean up: remove comments and commented out bits
* fix query_test
* add prom debugger
* create table-driven tests with testsdata files
* Fix test
* Add test
* go mod??
* idk
* Remove comment
* go enterprise issue maybe
* Fix codeowners
* Delete
* Remove test data
* Clean up
* logger
* Remove go changes hopefully
* idk go man
* sad
* idk i ran go mod tidy and this is what it wants
* Fix readme, with much help from adam
* some linting and testing errors
* lint
* fix lint
* fix lint register.go
* another lint
* address lint in test
* fix dead code and linters for query_test
* Go mod?
* Struggling with go mod
* Fix test
* Fix another test
* Revert headers change
* Its difficult to test this in OSS as it depends on functionality defined in enterprise, let's bring these tests back in some form in enterprise
* Fix codeowners
---------
Co-authored-by: Adam Simpson <adam@adamsimpson.net>
What is this feature?
This PR changes the behavior of the $value and .Value variables in alerting templating to be more compatible with Prometheus templating. When a single datasource is used in the alerting rule, these variables will now return the numeric value from the query instead of the evaluation string.
Why do we need this feature?
It makes Grafana templating more compatible with Prometheus templates. In Prometheus, $value returns the numeric value of the query, but in Grafana it's the evaluation string: [ var='A' labels={instance=instance1} value=81.234 ]. This is because in Grafana multiple datasources can be used in the alert rule, and it's not always possible to get a single value.
This change makes Grafana's behavior consistent with Prometheus when a single datasource is used, and in case when multiple datasources are used in the query, it keeps the old behaviour.
Both $value and .Value are not recommended to use (documentation), and it's better to use .Values instead.
What is this feature?
This PR introduces a new alert rule configuration option, keep_firing_for (Prometheus documentation).
keep_firing_for prevents alerts from resolving immediately after the alert condition returns to normal. Instead, they transition into a "Recovering" state and are not considered resolved by the Alertmanager. Once the recovery period ends (or after the next evaluation if it is bigger than keep_firing_for), the alert transitions to "Normal" if it doesn't start alerting again:
Before
+----------+ +----------+
| Alerting |---->| Normal |
+----------+ +----------+
-----
After
+----------+ +------------+ +----------+
| Alerting |----->| Recovering |---->| Normal |
+----------+ +------------+ +----------+
Why do we need this feature?
This feature prevents flapping alerts by adding a recovery period. This helps avoid false resolutions caused by brief alert
When you use a math expression with out any operators, the dataFrame pointer is identical between the expression result and the input query/expression.
This was resulting in the values returned from an evaluation overshadowing each other, depending on the order of the processing of the result map.
For example:
```
A: some_metric
B: reduce of A
C: math expression -> "${B}"
D: Threshold evaluation of C -> "C > 0"
```
With a value of 1 for `some_metric`, might result in a evaluation result of one of the following (somewhat at random):
1. { B: 1, D: 1 }
2. { C: 1, D: 1}
While you would expect to see:
{ B: 1, C: 1, D: 1 }
* handle metadata map nil
* remove double context
* clean up logging in scheduler
* do not reuse loggers from previous ticks
* log the dropped tick
* log tick instead of ticknum
* replace with processing tick logs
* log sending notifications
* update logging in persister to fetch context
* logs to historian
moved them upstream to be able to log when store is overridden
* add support of metadata to condition and adding it to request headers
* support for additional metadata when condition is built
* add additionall context to conditions: source and folder title
* add version
* use percent-encoding for header values
* Unify values
* Fix with latest changes on main
* Fix up NaN test
* Keep refIDs with -1 as value
* Test that refIDs are preserved on Normal to Error transition
* Alerting to err test too
* Add a blurb to docs about this behavior
* Alerting: Add setting for maximum allowed rule evaluation results
Added a new configuration setting `quota.alerting_rule_evaluation_results` to set the maximum number of alert rule evaluation results per rule. If the limit is exceeded, the evaluation will result in an error.
* Make MakeDependencyError public for tests in another package
* Create tests for errors in eval results
* Extract logic to pull frame errors out into exported function
* Maybe we can drop cyclomatic complexity lint suppression now?
* extract frame errors and fail recording rules if frames contain error
* Fix up retry logic to actually work
* Do not retry non retryable errors
* Move scope type vars to testutil package
* Expose parts of state historian for use in annotation backend
* Implement Loki ASH Annotation store
This store will only implement the `Get` method of a RepositoryImpl since alert state history
writes to Loki elsewhere.
* Use interface for Loki HTTP Client
* Add tests for Loki ASH Annotation store
* Add missing test
* Fix lint
* Organize tests
* Add filter tests
* Improve tests
* Move filter logic into outer function
* Fix lint
* Add comment
* Fix tests
* Fix lint
* Rename historian store + refactor
* Cleanup historian store
* Fix tests
* Minor cleanup
* Use new `ShouldRecordAnnotation` filter
* Fix logic and add tests for this check
* Fix typos, remove unused variables, `< 1` -> `== 0`
* More closely mimic RBAC filter from xorm to ensure correct logic
* Move off weaveworks client
* Address PR comments
Backend:
* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
- add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
- add rule_fingerprint column to alert_instance
- update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
- add AlertingResultsFromRuleState that implements the new interface in eval package
- update getExprRequest to patch the hysteresis command.
* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.
Frontend:
* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions
* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions
---------
Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
* Alerting: Attempt to retry retryable errors
Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.
I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.
There's two small differences between how retries work now and how they used to work in legacy alerting.
Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Alerting: Attempt to retry retryable errors
Currently in a draft state, but this was the minimal diff I could put together to exemplify how could achieve this.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Changes SSE to not always fail all queries when one fails. Now only the query itself, and nodes that depend on it will error.
---------
Co-authored-by: Gilles De Mey <gilles.de.mey@gmail.com>
* calculate cacheID instead of literals
* use mocked clocks
* advance clocks with the eval results
* use clearer timestamp aliases
* make expected state labels be more clear to read
Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
This commit updates eval.go to improve the performance of matching
captures in the general case. In some cases we have reduced the
runtime of the function from 10s of minutes to a couple 100ms.
In the case where no capture matches the exact labels, we revert to
the current subset/superset match, but with a reduced search space
due to grouping captures.
This commit changes extractEvalString to sort NumberCaptureValues
in ascending order of Var before building the output string. This
means that users will see EvaluationString in a consistent order,
but also make it possible to assert its output in tests.
* introduce a new node-type ML and implement a command outlier that uses ML plugin as a source of data.
* add feature flag mlExpressions that guards the feature
* add NodeTypeFromDatasourceUID and DataSourceModelFromNodeType()
* deprecate expr.DataSourceModel
* replace usages of IsDataSource to NodeTypeFromDatasourceUID
* replace usages of DataSourceModel to DataSourceModelFromNodeType()
This commit fixes a bug where DatasourceUID and RefID annotations are
missing for DatasourceNoData alerts in Grafana 9.5. This bug affects
datasource plugins that have moved to using the data plane contract.