16 KiB
aliases, labels, menuTitle, title, description, weight, refs
| aliases | labels | menuTitle | title | description | weight | refs | |||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Expressions examples | Expressions examples | Practical expression examples from basic to advanced for common monitoring scenarios | 55 |
|
Expressions examples
This document provides practical expression examples for common monitoring and visualization scenarios. Examples progress from basic to advanced, showing you how to solve real-world problems with Grafana Expressions.
For foundational concepts, refer to Grafana expressions.
Basic examples
Start here if you're new to expressions. These examples demonstrate fundamental patterns you'll use frequently.
Convert units
Scenario: Your metrics are in bytes, but you want to display them in gigabytes.
Setup:
- Query A (Prometheus):
node_memory_MemTotal_bytes - Expression B (Math):
$A / 1024 / 1024 / 1024
Result: Memory values converted from bytes to gigabytes.
Variations:
- Bytes to megabytes:
$A / 1024 / 1024 - Bytes to terabytes:
$A / 1024 / 1024 / 1024 / 1024 - Milliseconds to seconds:
$A / 1000 - Celsius to Fahrenheit:
$A * 9 / 5 + 32
Calculate a simple percentage
Scenario: Show what percentage of total memory is being used.
Setup:
- Query A (Prometheus):
node_memory_MemTotal_bytes - Query B (Prometheus):
node_memory_MemAvailable_bytes - Expression C (Math):
($A - $B) / $A * 100
Result: Memory usage as a percentage (0-100).
Tip: This pattern works for any "used / total * 100" calculation.
Get the current (latest) value
Scenario: Display the most recent temperature reading in a stat panel.
Setup:
- Query A (InfluxDB): Temperature sensor time series data
- Expression B (Reduce): Input
$A, Function: Last
Result: Single number showing the most recent value from the time series.
When to use: Stat panels, gauges, or any visualization that needs a single current value.
Calculate an average over time
Scenario: Show the average CPU usage over the dashboard time range.
Setup:
- Query A (Prometheus):
node_cpu_seconds_total{mode="idle"} - Expression B (Reduce): Input
$A, Function: Mean
Result: Average CPU value across the selected time range.
Note: Each series (each CPU core, each host) produces its own average, preserving labels.
Find maximum or minimum values
Scenario: Identify the peak memory usage in the last 24 hours.
Setup:
- Query A (Prometheus):
node_memory_MemUsed_bytes(last 24 hours) - Expression B (Reduce): Input
$A, Function: Max
Result: Peak memory usage value for each host.
Variations:
- Use Min to find the lowest value
- Use Count to see how many data points exist
Simple threshold check
Scenario: Create a binary indicator showing whether CPU is above 80%.
Setup:
- Query A (Prometheus):
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - Expression B (Math):
$A > 80
Result: Returns 1 when CPU exceeds 80%, 0 otherwise. Useful for alerting or status indicators.
Intermediate examples
These examples combine multiple operations and handle more complex scenarios.
Calculate error rate percentage
Scenario: Display HTTP error rate as a percentage of total requests.
Setup:
- Query A (Prometheus):
sum(rate(http_requests_total{status=~"5.."}[5m])) - Query B (Prometheus):
sum(rate(http_requests_total[5m])) - Expression C (Math):
$A / $B * 100
Result: Error rate percentage across all endpoints.
Handling division by zero: If there are zero requests, this produces infinity. To handle this:
- Expression C (Math):
$B > 0 ? ($A / $B * 100) : 0
This returns 0 when there are no requests instead of infinity.
Calculate available disk space
Scenario: Show available disk space as a percentage for capacity planning.
Setup:
- Query A (Prometheus):
node_filesystem_size_bytes{mountpoint="/"} - Query B (Prometheus):
node_filesystem_avail_bytes{mountpoint="/"} - Expression C (Math):
$B / $A * 100
Result: Percentage of disk space available (not used) for each host's root filesystem.
For alerting: Add an alert when available space drops below 10%:
- Expression D (Math):
$C < 10
Aggregate across multiple servers
Scenario: Calculate total requests per second across all web servers.
Setup:
- Query A (Prometheus):
rate(http_requests_total{job="webservers"}[5m]) - Expression B (Reduce): Input
$A, Function: Sum
Result: Total requests per second across all servers combined into a single value.
Alternative: To get the average per server instead:
- Expression B (Reduce): Input
$A, Function: Mean
Combine metrics from different data sources
Scenario: Calculate efficiency by dividing application throughput (Prometheus) by infrastructure cost metric (CloudWatch).
Setup:
- Query A (Prometheus):
sum(rate(processed_jobs_total[5m])) - Query B (CloudWatch): EC2 instance cost metric
- Expression C (Resample): Input
$A, Resample to:1m, Downsample: Mean - Expression D (Resample): Input
$B, Resample to:1m, Downsample: Mean - Expression E (Math):
$C / $D
Result: Jobs processed per dollar (or cost unit), showing application efficiency.
Why resample: Different data sources often have different collection intervals. Resampling ensures timestamps align for math operations.
Compare hosts to fleet average
Scenario: Identify hosts performing worse than the fleet average.
Setup:
- Query A (Prometheus):
node_cpu_usage_percent(returns one series per host) - Expression B (Reduce): Input
$A, Function: Mean (fleet average) - Expression C (Math):
$A - $B
Result: Each host shows how much above or below the fleet average they are. Positive values indicate above-average CPU usage.
Filter invalid data
Scenario: Calculate average response time, ignoring any null or NaN values in the data.
Setup:
- Query A (Time series): Response time data with occasional gaps
- Expression B (Reduce): Input
$A, Function: Mean, Mode: Drop non-numeric
Result: Clean average that ignores invalid data points.
Alternative modes:
- Strict: Returns NaN if any value is invalid (use when data quality matters)
- Replace non-numeric: Substitutes a specific value for invalid data points
Calculate rate of change
Scenario: Show how quickly memory usage is increasing or decreasing.
Setup:
- Query A (Prometheus):
node_memory_MemUsed_bytes - Query B (Prometheus):
node_memory_MemUsed_bytes offset 5m - Expression C (Math):
$A - $B
Result: Bytes of memory change over the last 5 minutes. Positive = increasing, negative = decreasing.
As a percentage change:
- Expression C (Math):
($A - $B) / $B * 100
Advanced examples
These examples demonstrate complex multi-step calculations and sophisticated alerting patterns.
Compare current value to 24-hour average
Scenario: Highlight when current traffic is significantly above or below the daily norm.
Setup:
- Query A (Prometheus):
sum(rate(http_requests_total[24h]))(historical average) - Query B (Prometheus):
sum(rate(http_requests_total[5m]))(current rate) - Expression C (Reduce): Input
$A, Function: Mean - Expression D (Math):
($B - $C) / $C * 100
Result: Percentage difference from the 24-hour average. +50 means 50% above normal, -30 means 30% below normal.
Use cases:
- Detect traffic anomalies
- Identify unusual load patterns
- Trigger alerts for significant deviations
Calculate service level indicator (SLI)
Scenario: Calculate the percentage of requests meeting your latency target (under 200ms).
Setup:
- Query A (Prometheus):
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) - Query B (Prometheus):
sum(rate(http_request_duration_seconds_count[5m])) - Expression C (Math):
$A / $B * 100
Result: Percentage of requests completing in under 200ms (your SLI).
For SLO alerting: Alert when SLI drops below 99%:
- Expression D (Reduce): Input
$C, Function: Mean - Expression E (Math):
$D < 99
Multi-host alerts with reduction
Scenario: Alert when average CPU across all production servers exceeds 80%.
Setup:
- Query A (Prometheus):
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle",env="production"}[5m])) * 100) - Expression B (Reduce): Input
$A, Function: Mean (average across all hosts) - Expression C (Math):
$B > 80
Result: Single alert that fires when the fleet average crosses the threshold, not individual host alerts.
Alternative - alert on any host:
- Expression B (Reduce): Input
$A, Function: Max
This alerts when any single host exceeds 80%.
Calculate compound metrics
Scenario: Calculate Apdex score (Application Performance Index) from response time buckets.
Setup:
- Query A (Prometheus):
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))(satisfied: <500ms) - Query B (Prometheus):
sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))(tolerating: <2s) - Query C (Prometheus):
sum(rate(http_request_duration_seconds_count[5m]))(total) - Expression D (Math):
($A + ($B - $A) / 2) / $C
Result: Apdex score from 0 to 1, where 1 is perfect user satisfaction.
Formula explained: Apdex = (Satisfied + Tolerating/2) / Total
Detect sustained conditions
Scenario: Alert only when CPU has been high for at least 5 minutes, not just a brief spike.
Setup:
- Query A (Prometheus):
avg_over_time(node_cpu_usage_percent[5m]) - Expression B (Reduce): Input
$A, Function: Mean - Expression C (Math):
$B > 80
Result: Alerts only fire when the 5-minute average exceeds the threshold, filtering out brief spikes.
Alternative approach using count:
- Query A:
node_cpu_usage_percent - Expression B (Math):
$A > 80 - Expression C (Reduce): Input
$B, Function: Sum (counts "1" values where condition is true) - Expression D (Math):
$C > 5
This alerts when more than 5 data points in the range exceed the threshold.
Correlate metrics across systems
Scenario: Calculate orders processed per database query to measure backend efficiency.
Setup:
- Query A (Prometheus - App metrics):
sum(rate(orders_processed_total[5m])) - Query B (MySQL data source): Database queries per second from performance schema
- Expression C (Resample): Input
$A, Resample to:30s, Downsample: Mean - Expression D (Resample): Input
$B, Resample to:30s, Downsample: Mean - Expression E (Math):
$C / $D
Result: Orders per database query, showing how efficiently your backend processes orders.
Lower is better: Fewer queries per order means more efficient database usage.
Ratio-based alerts with baseline
Scenario: Alert when error ratio increases by more than 2x compared to yesterday's baseline.
Setup:
- Query A (Prometheus):
sum(rate(http_errors_total[5m]))(current errors) - Query B (Prometheus):
sum(rate(http_requests_total[5m]))(current requests) - Query C (Prometheus):
sum(rate(http_errors_total[5m] offset 24h))(yesterday's errors) - Query D (Prometheus):
sum(rate(http_requests_total[5m] offset 24h))(yesterday's requests) - Expression E (Math):
$A / $B(current error rate) - Expression F (Math):
$C / $D(baseline error rate) - Expression G (Reduce): Input
$E, Function: Mean - Expression H (Reduce): Input
$F, Function: Mean - Expression I (Math):
$G / $H > 2
Result: Alerts when today's error rate is more than double yesterday's rate.
Why this matters: Absolute thresholds don't account for normal variation. Ratio-based alerting adapts to your system's baseline behavior.
Calculate percentile-based thresholds
Scenario: Alert when response time exceeds the 95th percentile baseline.
Setup:
- Query A (Prometheus):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) - Query B (Prometheus):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) - Expression C (Reduce): Input
$A, Function: Last (current p95) - Expression D (Reduce): Input
$B, Function: Mean (baseline p95) - Expression E (Math):
$C > $D * 1.5
Result: Alerts when current p95 latency exceeds 1.5x the hourly baseline.
Weighted scores across metrics
Scenario: Create a composite health score from multiple metrics (CPU, memory, disk, network).
Setup:
- Query A: CPU usage percentage (0-100)
- Query B: Memory usage percentage (0-100)
- Query C: Disk usage percentage (0-100)
- Query D: Network saturation percentage (0-100)
- Expression E (Reduce): Input
$A, Function: Mean - Expression F (Reduce): Input
$B, Function: Mean - Expression G (Reduce): Input
$C, Function: Mean - Expression H (Reduce): Input
$D, Function: Mean - Expression I (Math):
($E * 0.3) + ($F * 0.25) + ($G * 0.25) + ($H * 0.2)
Result: Weighted health score from 0-100 where lower is healthier. Weights reflect relative importance (CPU 30%, Memory 25%, Disk 25%, Network 20%).
For alerting:
- Expression J (Math):
$I > 70
Alert when composite score indicates degraded health.
Conditional logic with fallbacks
Scenario: Show error rate, but display 0 instead of infinity when there are no requests.
Setup:
- Query A (Prometheus):
sum(rate(http_errors_total[5m])) - Query B (Prometheus):
sum(rate(http_requests_total[5m])) - Expression C (Math):
$B > 0 ? ($A / $B * 100) : 0
Result: Error rate percentage that safely handles zero-request periods.
Conditional syntax: condition ? value_if_true : value_if_false
More examples:
- Cap values at 100:
$A > 100 ? 100 : $A - Convert negative to zero:
$A < 0 ? 0 : $A - Binary classification:
$A > threshold ? 1 : 0
Time-window comparison for trend detection
Scenario: Detect if metrics are trending up or down by comparing recent data to slightly older data.
Setup:
- Query A (Prometheus):
avg_over_time(http_requests_total[5m]) - Query B (Prometheus):
avg_over_time(http_requests_total[5m] offset 10m) - Expression C (Reduce): Input
$A, Function: Mean - Expression D (Reduce): Input
$B, Function: Mean - Expression E (Math):
($C - $D) / $D * 100
Result: Percentage change in requests between the last 5 minutes and the previous 5-minute window.
Interpretation:
- Positive values: Traffic increasing
- Negative values: Traffic decreasing
- Values near 0: Traffic stable
Use case: Detect rapid traffic changes that might indicate problems or attacks.
Tips for expression development
Follow these best practices to build reliable, maintainable expressions in your visualizations and alerts.
Start simple and iterate
Begin with basic operations and verify each step works before adding complexity. Use the Query Inspector to see intermediate results.
Name your queries clearly
While RefIDs default to letters, you can use descriptive names. Referencing ${errors} and ${total_requests} is clearer than $A and $B.
Test with realistic time ranges
Expressions may behave differently with various time ranges. Test with the same ranges you'll use in production dashboards or alerts.
Handle edge cases
Consider what happens when:
- Data is missing (NoData)
- Values are zero (division by zero)
- Metrics haven't been collected yet
- Time series have different numbers of points
Document complex expressions
Add panel descriptions or annotation text explaining what complex expressions calculate and why.
Monitor expression performance
If dashboards become slow, check if expressions are processing too much data. Consider moving heavy aggregations to recording rules or data source queries.