public/grafana

Fork 0

Files

T

Larissa Wandzura 4e89729e2f ran prettier

2026-01-05 11:31:50 -06:00

16 KiB

Raw Blame History

aliases, labels, menuTitle, title, description, weight, refs

aliases

labels

menuTitle

title

description

weight

refs

products

cloud

enterprise

oss

Expressions examples

Practical expression examples from basic to advanced for common monitoring scenarios

grafana-expressions

grafana-alerting

pattern	destination
/docs/grafana/	/docs/grafana/<GRAFANA_VERSION>/visualizations/panels-visualizations/query-transform-data/expression-queries/

pattern	destination
/docs/grafana-cloud/	/docs/grafana/<GRAFANA_VERSION>/visualizations/panels-visualizations/query-transform-data/expression-queries/

pattern	destination
/docs/grafana/	/docs/grafana/<GRAFANA_VERSION>/alerting/

pattern	destination
/docs/grafana-cloud/	/docs/grafana/<GRAFANA_VERSION>/alerting/

Expressions examples

This document provides practical expression examples for common monitoring and visualization scenarios. Examples progress from basic to advanced, showing you how to solve real-world problems with Grafana Expressions.

For foundational concepts, refer to Grafana expressions.

Basic examples

Start here if you're new to expressions. These examples demonstrate fundamental patterns you'll use frequently.

Convert units

Scenario: Your metrics are in bytes, but you want to display them in gigabytes.

Setup:

Query A (Prometheus): node_memory_MemTotal_bytes
Expression B (Math): $A / 1024 / 1024 / 1024

Result: Memory values converted from bytes to gigabytes.

Variations:

Bytes to megabytes: $A / 1024 / 1024
Bytes to terabytes: $A / 1024 / 1024 / 1024 / 1024
Milliseconds to seconds: $A / 1000
Celsius to Fahrenheit: $A * 9 / 5 + 32

Calculate a simple percentage

Scenario: Show what percentage of total memory is being used.

Setup:

Query A (Prometheus): node_memory_MemTotal_bytes
Query B (Prometheus): node_memory_MemAvailable_bytes
Expression C (Math): ($A - $B) / $A * 100

Result: Memory usage as a percentage (0-100).

Tip: This pattern works for any "used / total * 100" calculation.

Get the current (latest) value

Scenario: Display the most recent temperature reading in a stat panel.

Setup:

Query A (InfluxDB): Temperature sensor time series data
Expression B (Reduce): Input $A, Function: Last

Result: Single number showing the most recent value from the time series.

When to use: Stat panels, gauges, or any visualization that needs a single current value.

Calculate an average over time

Scenario: Show the average CPU usage over the dashboard time range.

Setup:

Query A (Prometheus): node_cpu_seconds_total{mode="idle"}
Expression B (Reduce): Input $A, Function: Mean

Result: Average CPU value across the selected time range.

Note: Each series (each CPU core, each host) produces its own average, preserving labels.

Find maximum or minimum values

Scenario: Identify the peak memory usage in the last 24 hours.

Setup:

Query A (Prometheus): node_memory_MemUsed_bytes (last 24 hours)
Expression B (Reduce): Input $A, Function: Max

Result: Peak memory usage value for each host.

Variations:

Use Min to find the lowest value
Use Count to see how many data points exist

Simple threshold check

Scenario: Create a binary indicator showing whether CPU is above 80%.

Setup:

Query A (Prometheus): 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Expression B (Math): $A > 80

Result: Returns 1 when CPU exceeds 80%, 0 otherwise. Useful for alerting or status indicators.

Intermediate examples

These examples combine multiple operations and handle more complex scenarios.

Calculate error rate percentage

Scenario: Display HTTP error rate as a percentage of total requests.

Setup:

Query A (Prometheus): sum(rate(http_requests_total{status=~"5.."}[5m]))
Query B (Prometheus): sum(rate(http_requests_total[5m]))
Expression C (Math): $A / $B * 100

Result: Error rate percentage across all endpoints.

Handling division by zero: If there are zero requests, this produces infinity. To handle this:

Expression C (Math): $B > 0 ? ($A / $B * 100) : 0

This returns 0 when there are no requests instead of infinity.

Calculate available disk space

Scenario: Show available disk space as a percentage for capacity planning.

Setup:

Query A (Prometheus): node_filesystem_size_bytes{mountpoint="/"}
Query B (Prometheus): node_filesystem_avail_bytes{mountpoint="/"}
Expression C (Math): $B / $A * 100

Result: Percentage of disk space available (not used) for each host's root filesystem.

For alerting: Add an alert when available space drops below 10%:

Expression D (Math): $C < 10

Aggregate across multiple servers

Scenario: Calculate total requests per second across all web servers.

Setup:

Query A (Prometheus): rate(http_requests_total{job="webservers"}[5m])
Expression B (Reduce): Input $A, Function: Sum

Result: Total requests per second across all servers combined into a single value.

Alternative: To get the average per server instead:

Expression B (Reduce): Input $A, Function: Mean

Combine metrics from different data sources

Scenario: Calculate efficiency by dividing application throughput (Prometheus) by infrastructure cost metric (CloudWatch).

Setup:

Query A (Prometheus): sum(rate(processed_jobs_total[5m]))
Query B (CloudWatch): EC2 instance cost metric
Expression C (Resample): Input $A, Resample to: 1m, Downsample: Mean
Expression D (Resample): Input $B, Resample to: 1m, Downsample: Mean
Expression E (Math): $C / $D

Result: Jobs processed per dollar (or cost unit), showing application efficiency.

Why resample: Different data sources often have different collection intervals. Resampling ensures timestamps align for math operations.

Compare hosts to fleet average

Scenario: Identify hosts performing worse than the fleet average.

Setup:

Query A (Prometheus): node_cpu_usage_percent (returns one series per host)
Expression B (Reduce): Input $A, Function: Mean (fleet average)
Expression C (Math): $A - $B

Result: Each host shows how much above or below the fleet average they are. Positive values indicate above-average CPU usage.

Filter invalid data

Scenario: Calculate average response time, ignoring any null or NaN values in the data.

Setup:

Query A (Time series): Response time data with occasional gaps
Expression B (Reduce): Input $A, Function: Mean, Mode: Drop non-numeric

Result: Clean average that ignores invalid data points.

Alternative modes:

Strict: Returns NaN if any value is invalid (use when data quality matters)
Replace non-numeric: Substitutes a specific value for invalid data points

Calculate rate of change

Scenario: Show how quickly memory usage is increasing or decreasing.

Setup:

Query A (Prometheus): node_memory_MemUsed_bytes
Query B (Prometheus): node_memory_MemUsed_bytes offset 5m
Expression C (Math): $A - $B

Result: Bytes of memory change over the last 5 minutes. Positive = increasing, negative = decreasing.

As a percentage change:

Expression C (Math): ($A - $B) / $B * 100

Advanced examples

These examples demonstrate complex multi-step calculations and sophisticated alerting patterns.

Compare current value to 24-hour average

Scenario: Highlight when current traffic is significantly above or below the daily norm.

Setup:

Query A (Prometheus): sum(rate(http_requests_total[24h])) (historical average)
Query B (Prometheus): sum(rate(http_requests_total[5m])) (current rate)
Expression C (Reduce): Input $A, Function: Mean
Expression D (Math): ($B - $C) / $C * 100

Result: Percentage difference from the 24-hour average. +50 means 50% above normal, -30 means 30% below normal.

Use cases:

Detect traffic anomalies
Identify unusual load patterns
Trigger alerts for significant deviations

Calculate service level indicator (SLI)

Scenario: Calculate the percentage of requests meeting your latency target (under 200ms).

Setup:

Query A (Prometheus): sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
Query B (Prometheus): sum(rate(http_request_duration_seconds_count[5m]))
Expression C (Math): $A / $B * 100

Result: Percentage of requests completing in under 200ms (your SLI).

For SLO alerting: Alert when SLI drops below 99%:

Expression D (Reduce): Input $C, Function: Mean
Expression E (Math): $D < 99

Multi-host alerts with reduction

Scenario: Alert when average CPU across all production servers exceeds 80%.

Setup:

Query A (Prometheus): 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle",env="production"}[5m])) * 100)
Expression B (Reduce): Input $A, Function: Mean (average across all hosts)
Expression C (Math): $B > 80

Result: Single alert that fires when the fleet average crosses the threshold, not individual host alerts.

Alternative - alert on any host:

Expression B (Reduce): Input $A, Function: Max

This alerts when any single host exceeds 80%.

Calculate compound metrics

Scenario: Calculate Apdex score (Application Performance Index) from response time buckets.

Setup:

Query A (Prometheus): sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) (satisfied: <500ms)
Query B (Prometheus): sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m])) (tolerating: <2s)
Query C (Prometheus): sum(rate(http_request_duration_seconds_count[5m])) (total)
Expression D (Math): ($A + ($B - $A) / 2) / $C

Result: Apdex score from 0 to 1, where 1 is perfect user satisfaction.

Formula explained: Apdex = (Satisfied + Tolerating/2) / Total

Detect sustained conditions

Scenario: Alert only when CPU has been high for at least 5 minutes, not just a brief spike.

Setup:

Query A (Prometheus): avg_over_time(node_cpu_usage_percent[5m])
Expression B (Reduce): Input $A, Function: Mean
Expression C (Math): $B > 80

Result: Alerts only fire when the 5-minute average exceeds the threshold, filtering out brief spikes.

Alternative approach using count:

Query A: node_cpu_usage_percent
Expression B (Math): $A > 80
Expression C (Reduce): Input $B, Function: Sum (counts "1" values where condition is true)
Expression D (Math): $C > 5

This alerts when more than 5 data points in the range exceed the threshold.

Correlate metrics across systems

Scenario: Calculate orders processed per database query to measure backend efficiency.

Setup:

Query A (Prometheus - App metrics): sum(rate(orders_processed_total[5m]))
Query B (MySQL data source): Database queries per second from performance schema
Expression C (Resample): Input $A, Resample to: 30s, Downsample: Mean
Expression D (Resample): Input $B, Resample to: 30s, Downsample: Mean
Expression E (Math): $C / $D

Result: Orders per database query, showing how efficiently your backend processes orders.

Lower is better: Fewer queries per order means more efficient database usage.

Ratio-based alerts with baseline

Scenario: Alert when error ratio increases by more than 2x compared to yesterday's baseline.

Setup:

Query A (Prometheus): sum(rate(http_errors_total[5m])) (current errors)
Query B (Prometheus): sum(rate(http_requests_total[5m])) (current requests)
Query C (Prometheus): sum(rate(http_errors_total[5m] offset 24h)) (yesterday's errors)
Query D (Prometheus): sum(rate(http_requests_total[5m] offset 24h)) (yesterday's requests)
Expression E (Math): $A / $B (current error rate)
Expression F (Math): $C / $D (baseline error rate)
Expression G (Reduce): Input $E, Function: Mean
Expression H (Reduce): Input $F, Function: Mean
Expression I (Math): $G / $H > 2

Result: Alerts when today's error rate is more than double yesterday's rate.

Why this matters: Absolute thresholds don't account for normal variation. Ratio-based alerting adapts to your system's baseline behavior.

Calculate percentile-based thresholds

Scenario: Alert when response time exceeds the 95th percentile baseline.

Setup:

Query A (Prometheus): histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Query B (Prometheus): histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
Expression C (Reduce): Input $A, Function: Last (current p95)
Expression D (Reduce): Input $B, Function: Mean (baseline p95)
Expression E (Math): $C > $D * 1.5

Result: Alerts when current p95 latency exceeds 1.5x the hourly baseline.

Weighted scores across metrics

Scenario: Create a composite health score from multiple metrics (CPU, memory, disk, network).

Setup:

Query A: CPU usage percentage (0-100)
Query B: Memory usage percentage (0-100)
Query C: Disk usage percentage (0-100)
Query D: Network saturation percentage (0-100)
Expression E (Reduce): Input $A, Function: Mean
Expression F (Reduce): Input $B, Function: Mean
Expression G (Reduce): Input $C, Function: Mean
Expression H (Reduce): Input $D, Function: Mean
Expression I (Math): ($E * 0.3) + ($F * 0.25) + ($G * 0.25) + ($H * 0.2)

Result: Weighted health score from 0-100 where lower is healthier. Weights reflect relative importance (CPU 30%, Memory 25%, Disk 25%, Network 20%).

For alerting:

Expression J (Math): $I > 70

Alert when composite score indicates degraded health.

Conditional logic with fallbacks

Scenario: Show error rate, but display 0 instead of infinity when there are no requests.

Setup:

Query A (Prometheus): sum(rate(http_errors_total[5m]))
Query B (Prometheus): sum(rate(http_requests_total[5m]))
Expression C (Math): $B > 0 ? ($A / $B * 100) : 0

Result: Error rate percentage that safely handles zero-request periods.

Conditional syntax: condition ? value_if_true : value_if_false

More examples:

Cap values at 100: $A > 100 ? 100 : $A
Convert negative to zero: $A < 0 ? 0 : $A
Binary classification: $A > threshold ? 1 : 0

Time-window comparison for trend detection

Scenario: Detect if metrics are trending up or down by comparing recent data to slightly older data.

Setup:

Query A (Prometheus): avg_over_time(http_requests_total[5m])
Query B (Prometheus): avg_over_time(http_requests_total[5m] offset 10m)
Expression C (Reduce): Input $A, Function: Mean
Expression D (Reduce): Input $B, Function: Mean
Expression E (Math): ($C - $D) / $D * 100

Result: Percentage change in requests between the last 5 minutes and the previous 5-minute window.

Interpretation:

Positive values: Traffic increasing
Negative values: Traffic decreasing
Values near 0: Traffic stable

Use case: Detect rapid traffic changes that might indicate problems or attacks.

Tips for expression development

Follow these best practices to build reliable, maintainable expressions in your visualizations and alerts.

Start simple and iterate

Begin with basic operations and verify each step works before adding complexity. Use the Query Inspector to see intermediate results.

Name your queries clearly

While RefIDs default to letters, you can use descriptive names. Referencing ${errors} and ${total_requests} is clearer than $A and $B.

Test with realistic time ranges

Expressions may behave differently with various time ranges. Test with the same ranges you'll use in production dashboards or alerts.

Handle edge cases

Consider what happens when:

Data is missing (NoData)
Values are zero (division by zero)
Metrics haven't been collected yet
Time series have different numbers of points

Document complex expressions

Add panel descriptions or annotation text explaining what complex expressions calculate and why.

Monitor expression performance

If dashboards become slow, check if expressions are processing too much data. Consider moving heavy aggregations to recording rules or data source queries.

16 KiB Raw Blame History

Expressions examples

Basic examples

Convert units

Calculate a simple percentage

Get the current (latest) value

Calculate an average over time

Find maximum or minimum values

Simple threshold check

Intermediate examples

Calculate error rate percentage

Calculate available disk space

Aggregate across multiple servers

Combine metrics from different data sources

Compare hosts to fleet average

Filter invalid data

Calculate rate of change

Advanced examples

Compare current value to 24-hour average

Calculate service level indicator (SLI)

Multi-host alerts with reduction

Calculate compound metrics

Detect sustained conditions

Correlate metrics across systems

Ratio-based alerts with baseline

Calculate percentile-based thresholds

Weighted scores across metrics

Conditional logic with fallbacks

Time-window comparison for trend detection

Tips for expression development

Start simple and iterate

Name your queries clearly

Test with realistic time ranges

Handle edge cases

Document complex expressions

Monitor expression performance

16 KiB

Raw Blame History