From 877e97f3be3a01fc05c5115e38d5041cd78376b3 Mon Sep 17 00:00:00 2001 From: Catherine Luse Date: Thu, 9 Apr 2020 12:11:43 -0700 Subject: [PATCH] Fix formatting on Prometheus expressions page --- .../en/cluster-admin/tools/alerts/_index.md | 7 + .../tools/monitoring/expression/_index.md | 561 ++++++++++-------- 2 files changed, 315 insertions(+), 253 deletions(-) diff --git a/content/rancher/v2.x/en/cluster-admin/tools/alerts/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/alerts/_index.md index 7fdca61df02..1b0b9685295 100644 --- a/content/rancher/v2.x/en/cluster-admin/tools/alerts/_index.md +++ b/content/rancher/v2.x/en/cluster-admin/tools/alerts/_index.md @@ -16,6 +16,7 @@ For details about what triggers the predefined alerts, refer to the [documentati This section covers the following topics: - [Alert event examples](#alert-event-examples) + - [Prometheus queries](#prometheus-queries) - [Urgency levels](#urgency-levels) - [Scope of alerts](#scope-of-alerts) - [Adding cluster alerts](#adding-cluster-alerts) @@ -30,6 +31,12 @@ Some examples of alert events are: - A scheduled deployment taking place as planned. - A node's hardware resources becoming overstressed. +### Prometheus Queries + +> **Prerequisite:** Monitoring must be [enabled]({{}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#enabling-cluster-monitoring) before you can trigger alerts with custom Prometheus queries or expressions. + +When you edit an alert rule, you will have the opportunity to configure the alert to be triggered based on a Prometheus expression. For examples of expressions, refer to [this page.]({{}}/rancher/v2.x/en/cluster-admin/tools/monitoring/expression) + # Urgency Levels You can set an urgency level for each alert. This urgency appears in the notification you receive, helping you to prioritize your response actions. For example, if you have an alert configured to inform you of a routine deployment, no action is required. These alerts can be assigned a low priority level. However, if a deployment fails, it can critically impact your organization, and you need to react quickly. Assign these alerts a high priority level. diff --git a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md index a667264c69c..9f5170c9779 100644 --- a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md +++ b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md @@ -1,375 +1,430 @@ --- -title: Expression +title: Prometheus Expressions weight: 4 --- -## In This Document +The PromQL expressions in this doc can be used to configure [alerts.]({{}}/rancher/v2.x/en/cluster-admin/tools/alerts/) + +> Before expression can be used in alerts, monitoring must be enabled. For more information, refer to the documentation on enabling monitoring [at the cluster level]({{}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#enabling-cluster-monitoring) or [at the project level.]({{}}/rancher/v2.x/en/project-admin/tools/monitoring/#enabling-project-monitoring) + +For more information about querying Prometheus, refer to the official [Prometheus documentation.](https://prometheus.io/docs/prometheus/latest/querying/basics/) - [Cluster Metrics](#cluster-metrics) - + [Node Metrics](#node-metrics) + - [Cluster CPU Utilization](#cluster-cpu-utilization) + - [Cluster Load Average](#cluster-load-average) + - [Cluster Memory Utilization](#cluster-memory-utilization) + - [Cluster Disk Utilization](#cluster-disk-utilization) + - [Cluster Disk I/O](#cluster-disk-i-o) + - [Cluster Network Packets](#cluster-network-packets) + - [Cluster Network I/O](#cluster-network-i-o) +- [Node Metrics](#node-metrics) + - [Node CPU Utilization](#node-cpu-utilization) + - [Node Load Average](#node-load-average) + - [Node Memory Utilization](#node-memory-utilization) + - [Node Disk Utilization](#node-disk-utilization) + - [Node Disk I/O](#node-disk-i-o) + - [Node Network Packets](#node-network-packets) + - [Node Network I/O](#node-network-i-o) - [Etcd Metrics](#etcd-metrics) + - [Etcd Has a Leader](#etcd-has-a-leader) + - [Number of Times the Leader Changes](#number-of-times-the-leader-changes) + - [Number of Failed Proposals](#number-of-failed-proposals) + - [GRPC Client Traffic](#grpc-client-traffic) + - [Peer Traffic](#peer-traffic) + - [DB Size](#db-size) + - [Active Streams](#active-streams) + - [Raft Proposals](#raft-proposals) + - [RPC Rate](#rpc-rate) + - [Disk Operations](#disk-operations) + - [Disk Sync Duration](#disk-sync-duration) - [Kubernetes Components Metrics](#kubernetes-components-metrics) + - [API Server Request Latency](#api-server-request-latency) + - [API Server Request Rate](#api-server-request-rate) + - [Scheduling Failed Pods](#scheduling-failed-pods) + - [Controller Manager Queue Depth](#controller-manager-queue-depth) + - [Scheduler E2E Scheduling Latency](#scheduler-e2e-scheduling-latency) + - [Scheduler Preemption Attempts](#scheduler-preemption-attempts) + - [Ingress Controller Connections](#ingress-controller-connections) + - [Ingress Controller Request Process Time](#ingress-controller-request-process-time) - [Rancher Logging Metrics](#rancher-logging-metrics) + - [Fluentd Buffer Queue Rate](#fluentd-buffer-queue-rate) + - [Fluentd Input Rate](#fluentd-input-rate) + - [Fluentd Output Errors Rate](#fluentd-output-errors-rate) + - [Fluentd Output Rate](#fluentd-output-rate) - [Workload Metrics](#workload-metrics) - + [Pod Metrics](#pod-metrics) - + [Container Metrics](#container-metrics) + - [Workload CPU Utilization](#workload-cpu-utilization) + - [Workload Memory Utilization](#workload-memory-utilization) + - [Workload Network Packets](#workload-network-packets) + - [Workload Network I/O](#workload-network-i-o) + - [Workload Disk I/O](#workload-disk-i-o) +- [Pod Metrics](#pod-metrics) + - [Pod CPU Utilization](#pod-cpu-utilization) + - [Pod Memory Utilization](#pod-memory-utilization) + - [Pod Network Packets](#pod-network-packets) + - [Pod Network I/O](#pod-network-i-o) + - [Pod Disk I/O](#pod-disk-i-o) +- [Container Metrics](#container-metrics) + - [Container CPU Utilization](#container-cpu-utilization) + - [Container Memory Utilization](#container-memory-utilization) + - [Container Disk I/O](#container-disk-i-o) -## Cluster Metrics +# Cluster Metrics -- **CPU Utilization** +### Cluster CPU Utilization - | Catalog | Expression | - | --- | --- | - | Detail | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))` | - | Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])))` | +| Catalog | Expression | +| --- | --- | +| Detail | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))` | +| Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])))` | -- **Load Average** +### Cluster Load Average - | Catalog | Expression | - | --- | --- | - | Detail |
load1`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
load5`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
load15`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
| - | Summary |
load1`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load5`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load15`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"})`
| +| Catalog | Expression | +| --- | --- | +| Detail |
load1`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
load5`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
load15`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
| +| Summary |
load1`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load5`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load15`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"})`
| -- **Memory Utilization** +### Cluster Memory Utilization - | Catalog | Expression | - | --- | --- | - | Detail | `1 - sum(node_memory_MemAvailable_bytes) by (instance) / sum(node_memory_MemTotal_bytes) by (instance)` | - | Summary | `1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)` | +| Catalog | Expression | +| --- | --- | +| Detail | `1 - sum(node_memory_MemAvailable_bytes) by (instance) / sum(node_memory_MemTotal_bytes) by (instance)` | +| Summary | `1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)` | -- **Disk Utilization** +### Cluster Disk Utilization - | Catalog | Expression | - | --- | --- | - | Detail | `(sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) - sum(node_filesystem_free_bytes{device!="rootfs"}) by (instance)) / sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance)` | - | Summary | `(sum(node_filesystem_size_bytes{device!="rootfs"}) - sum(node_filesystem_free_bytes{device!="rootfs"})) / sum(node_filesystem_size_bytes{device!="rootfs"})` | +| Catalog | Expression | +| --- | --- | +| Detail | `(sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) - sum(node_filesystem_free_bytes{device!="rootfs"}) by (instance)) / sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance)` | +| Summary | `(sum(node_filesystem_size_bytes{device!="rootfs"}) - sum(node_filesystem_free_bytes{device!="rootfs"})) / sum(node_filesystem_size_bytes{device!="rootfs"})` | -- **Disk I/O** +### Cluster Disk I/O - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(node_disk_read_bytes_total[5m])) by (instance)`
written`sum(rate(node_disk_written_bytes_total[5m])) by (instance)`
| - | Summary |
read`sum(rate(node_disk_read_bytes_total[5m]))`
written`sum(rate(node_disk_written_bytes_total[5m]))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
read`sum(rate(node_disk_read_bytes_total[5m])) by (instance)`
written`sum(rate(node_disk_written_bytes_total[5m])) by (instance)`
| +| Summary |
read`sum(rate(node_disk_read_bytes_total[5m]))`
written`sum(rate(node_disk_written_bytes_total[5m]))`
| -- **Network Packets** +### Cluster Network Packets - | Catalog | Expression | - | --- | --- | - | Detail |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
| - | Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
| +| Catalog | Expression | +| --- | --- | +| Detail |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
| +| Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
| -- **Network I/O** +### Cluster Network I/O - | Catalog | Expression | - | --- | --- | - | Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
| - | Summary |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
| +| Catalog | Expression | +| --- | --- | +| Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
| +| Summary |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
| -### Node Metrics +# Node Metrics -- **CPU Utilization** +### Node CPU Utilization - | Catalog | Expression | - | --- | --- | - | Detail | `avg(irate(node_cpu_seconds_total{mode!="idle", instance=~"$instance"}[5m])) by (mode)` | - | Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])))` | +| Catalog | Expression | +| --- | --- | +| Detail | `avg(irate(node_cpu_seconds_total{mode!="idle", instance=~"$instance"}[5m])) by (mode)` | +| Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])))` | -- **Load Average** +### Node Load Average - | Catalog | Expression | - | --- | --- | - | Detail |
load1`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load5`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load15`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
| - | Summary |
load1`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load5`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load15`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
| +| Catalog | Expression | +| --- | --- | +| Detail |
load1`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load5`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load15`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
| +| Summary |
load1`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load5`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load15`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
| -- **Memory Utilization** +### Node Memory Utilization - | Catalog | Expression | - | --- | --- | - | Detail | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})` | - | Summary | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"}) ` | +| Catalog | Expression | +| --- | --- | +| Detail | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})` | +| Summary | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"}) ` | -- **Disk Utilization** +### Node Disk Utilization - | Catalog | Expression | - | --- | --- | - | Detail | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"}) by (device)) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device)` | - | Summary | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"})) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"})` | +| Catalog | Expression | +| --- | --- | +| Detail | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"}) by (device)) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device)` | +| Summary | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"})) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"})` | -- **Disk I/O** +### Node Disk I/O - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`
written`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`
| - | Summary |
read`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`
written`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
read`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`
written`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`
| +| Summary |
read`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`
written`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`
| -- **Network Packets** +### Node Network Packets - | Catalog | Expression | - | --- | --- | - | Detail |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
| - | Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
| +| Catalog | Expression | +| --- | --- | +| Detail |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
| +| Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
| -- **Network I/O** +### Node Network I/O - | Catalog | Expression | - | --- | --- | - | Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
| - | Summary |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
| +| Catalog | Expression | +| --- | --- | +| Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
| +| Summary |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
| -## Etcd Metrics +# Etcd Metrics -- **Etcd has a leader** +### Etcd Has a Leader - `max(etcd_server_has_leader)` +`max(etcd_server_has_leader)` -- **Number of leader changes** +### Number of Times the Leader Changes - `max(etcd_server_leader_changes_seen_total)` +`max(etcd_server_leader_changes_seen_total)` -- **Number of failed proposals** +### Number of Failed Proposals - `sum(etcd_server_proposals_failed_total)` +`sum(etcd_server_proposals_failed_total)` -- **GRPC Client Traffic** +### GRPC Client Traffic - | Catalog | Expression | - | --- | --- | - | Detail |
in`sum(rate(etcd_network_client_grpc_received_bytes_total[5m])) by (instance)`
out`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m])) by (instance)`
| - | Summary |
in`sum(rate(etcd_network_client_grpc_received_bytes_total[5m]))`
out`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m]))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
in`sum(rate(etcd_network_client_grpc_received_bytes_total[5m])) by (instance)`
out`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m])) by (instance)`
| +| Summary |
in`sum(rate(etcd_network_client_grpc_received_bytes_total[5m]))`
out`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m]))`
| -- **Peer Traffic** +### Peer Traffic - | Catalog | Expression | - | --- | --- | - | Detail |
in`sum(rate(etcd_network_peer_received_bytes_total[5m])) by (instance)`
out`sum(rate(etcd_network_peer_sent_bytes_total[5m])) by (instance)`
| - | Summary |
in`sum(rate(etcd_network_peer_received_bytes_total[5m]))`
out`sum(rate(etcd_network_peer_sent_bytes_total[5m]))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
in`sum(rate(etcd_network_peer_received_bytes_total[5m])) by (instance)`
out`sum(rate(etcd_network_peer_sent_bytes_total[5m])) by (instance)`
| +| Summary |
in`sum(rate(etcd_network_peer_received_bytes_total[5m]))`
out`sum(rate(etcd_network_peer_sent_bytes_total[5m]))`
| -- **DB Size** +### DB Size - | Catalog | Expression | - | --- | --- | - | Detail | `sum(etcd_debugging_mvcc_db_total_size_in_bytes) by (instance)` | - | Summary | `sum(etcd_debugging_mvcc_db_total_size_in_bytes)` | +| Catalog | Expression | +| --- | --- | +| Detail | `sum(etcd_debugging_mvcc_db_total_size_in_bytes) by (instance)` | +| Summary | `sum(etcd_debugging_mvcc_db_total_size_in_bytes)` | -- **Active Streams** +### Active Streams - | Catalog | Expression | - | --- | --- | - | Detail |
lease-watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance)`
watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance)`
| - | Summary |
lease-watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})`
watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})`
| +| Catalog | Expression | +| --- | --- | +| Detail |
lease-watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance)`
watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance)`
| +| Summary |
lease-watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})`
watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})`
| -- **Raft Proposals** +### Raft Proposals - | Catalog | Expression | - | --- | --- | - | Detail |
applied`sum(increase(etcd_server_proposals_applied_total[5m])) by (instance)`
committed`sum(increase(etcd_server_proposals_committed_total[5m])) by (instance)`
pending`sum(increase(etcd_server_proposals_pending[5m])) by (instance)`
failed`sum(increase(etcd_server_proposals_failed_total[5m])) by (instance)`
| - | Summary |
applied`sum(increase(etcd_server_proposals_applied_total[5m]))`
committed`sum(increase(etcd_server_proposals_committed_total[5m]))`
pending`sum(increase(etcd_server_proposals_pending[5m]))`
failed`sum(increase(etcd_server_proposals_failed_total[5m]))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
applied`sum(increase(etcd_server_proposals_applied_total[5m])) by (instance)`
committed`sum(increase(etcd_server_proposals_committed_total[5m])) by (instance)`
pending`sum(increase(etcd_server_proposals_pending[5m])) by (instance)`
failed`sum(increase(etcd_server_proposals_failed_total[5m])) by (instance)`
| +| Summary |
applied`sum(increase(etcd_server_proposals_applied_total[5m]))`
committed`sum(increase(etcd_server_proposals_committed_total[5m]))`
pending`sum(increase(etcd_server_proposals_pending[5m]))`
failed`sum(increase(etcd_server_proposals_failed_total[5m]))`
| -- **RPC Rate** +### RPC Rate - | Catalog | Expression | - | --- | --- | - | Detail |
total`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m])) by (instance)`
fail`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m])) by (instance)`
| - | Summary |
total`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m]))`
fail`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m]))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
total`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m])) by (instance)`
fail`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m])) by (instance)`
| +| Summary |
total`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m]))`
fail`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m]))`
| -- **Disk Operations** +### Disk Operations - | Catalog | Expression | - | --- | --- | - | Detail |
commit-called-by-backend`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m])) by (instance)`
fsync-called-by-wal`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m])) by (instance)`
| - | Summary |
commit-called-by-backend`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m]))`
fsync-called-by-wal`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m]))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
commit-called-by-backend`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m])) by (instance)`
fsync-called-by-wal`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m])) by (instance)`
| +| Summary |
commit-called-by-backend`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m]))`
fsync-called-by-wal`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m]))`
| -- **Disk Sync Duration** +### Disk Sync Duration - | Catalog | Expression | - | --- | --- | - | Detail |
wal`histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))`
db`histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))`
| - | Summary |
wal`sum(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)))`
db`sum(histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
wal`histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))`
db`histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))`
| +| Summary |
wal`sum(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)))`
db`sum(histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)))`
| -## Kubernetes Components Metrics +# Kubernetes Components Metrics -- **API Server Request Latency** +### API Server Request Latency - | Catalog | Expression | - | --- | --- | - | Detail | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance, verb) /1e+06` | - | Summary | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance) /1e+06` | +| Catalog | Expression | +| --- | --- | +| Detail | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance, verb) /1e+06` | +| Summary | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance) /1e+06` | -- **API Server Request Rate** +### API Server Request Rate - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(apiserver_request_count[5m])) by (instance, code)` | - | Summary | `sum(rate(apiserver_request_count[5m])) by (instance)` | +| Catalog | Expression | +| --- | --- | +| Detail | `sum(rate(apiserver_request_count[5m])) by (instance, code)` | +| Summary | `sum(rate(apiserver_request_count[5m])) by (instance)` | -- **Scheduling Failed Pods** +### Scheduling Failed Pods - | Catalog | Expression | - | --- | --- | - | Detail | `sum(kube_pod_status_scheduled{condition="false"})` | - | Summary | `sum(kube_pod_status_scheduled{condition="false"})` | +| Catalog | Expression | +| --- | --- | +| Detail | `sum(kube_pod_status_scheduled{condition="false"})` | +| Summary | `sum(kube_pod_status_scheduled{condition="false"})` | -- **Controller Manager Queue Depth** +### Controller Manager Queue Depth - | Catalog | Expression | - | --- | --- | - | Detail |
volumes`sum(volumes_depth) by instance`
deployment`sum(deployment_depth) by instance`
replicaset`sum(replicaset_depth) by instance`
service`sum(service_depth) by instance`
serviceaccount`sum(serviceaccount_depth) by instance`
endpoint`sum(endpoint_depth) by instance`
daemonset`sum(daemonset_depth) by instance`
statefulset`sum(statefulset_depth) by instance`
replicationmanager`sum(replicationmanager_depth) by instance`
| - | Summary |
volumes`sum(volumes_depth)`
deployment`sum(deployment_depth)`
replicaset`sum(replicaset_depth)`
service`sum(service_depth)`
serviceaccount`sum(serviceaccount_depth)`
endpoint`sum(endpoint_depth)`
daemonset`sum(daemonset_depth)`
statefulset`sum(statefulset_depth)`
replicationmanager`sum(replicationmanager_depth)`
| +| Catalog | Expression | +| --- | --- | +| Detail |
volumes`sum(volumes_depth) by instance`
deployment`sum(deployment_depth) by instance`
replicaset`sum(replicaset_depth) by instance`
service`sum(service_depth) by instance`
serviceaccount`sum(serviceaccount_depth) by instance`
endpoint`sum(endpoint_depth) by instance`
daemonset`sum(daemonset_depth) by instance`
statefulset`sum(statefulset_depth) by instance`
replicationmanager`sum(replicationmanager_depth) by instance`
| +| Summary |
volumes`sum(volumes_depth)`
deployment`sum(deployment_depth)`
replicaset`sum(replicaset_depth)`
service`sum(service_depth)`
serviceaccount`sum(serviceaccount_depth)`
endpoint`sum(endpoint_depth)`
daemonset`sum(daemonset_depth)`
statefulset`sum(statefulset_depth)`
replicationmanager`sum(replicationmanager_depth)`
| -- **Scheduler E2E Scheduling Latency** +### Scheduler E2E Scheduling Latency - | Catalog | Expression | - | --- | --- | - | Detail | `histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06` | - | Summary | `sum(histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06)` | +| Catalog | Expression | +| --- | --- | +| Detail | `histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06` | +| Summary | `sum(histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06)` | -- **Scheduler Preemption Attempts** +### Scheduler Preemption Attempts - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(scheduler_total_preemption_attempts[5m])) by (instance)` | - | Summary | `sum(rate(scheduler_total_preemption_attempts[5m]))` | +| Catalog | Expression | +| --- | --- | +| Detail | `sum(rate(scheduler_total_preemption_attempts[5m])) by (instance)` | +| Summary | `sum(rate(scheduler_total_preemption_attempts[5m]))` | -- **Ingress Controller Connections** +### Ingress Controller Connections - | Catalog | Expression | - | --- | --- | - | Detail |
reading`sum(nginx_ingress_controller_nginx_process_connections{state="reading"}) by (instance)`
waiting`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"}) by (instance)`
writing`sum(nginx_ingress_controller_nginx_process_connections{state="writing"}) by (instance)`
accepted`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m]))) by (instance)`
active`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m]))) by (instance)`
handled`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m]))) by (instance)`
| - | Summary |
reading`sum(nginx_ingress_controller_nginx_process_connections{state="reading"})`
waiting`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"})`
writing`sum(nginx_ingress_controller_nginx_process_connections{state="writing"})`
accepted`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m])))`
active`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m])))`
handled`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m])))`
| +| Catalog | Expression | +| --- | --- | +| Detail |
reading`sum(nginx_ingress_controller_nginx_process_connections{state="reading"}) by (instance)`
waiting`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"}) by (instance)`
writing`sum(nginx_ingress_controller_nginx_process_connections{state="writing"}) by (instance)`
accepted`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m]))) by (instance)`
active`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m]))) by (instance)`
handled`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m]))) by (instance)`
| +| Summary |
reading`sum(nginx_ingress_controller_nginx_process_connections{state="reading"})`
waiting`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"})`
writing`sum(nginx_ingress_controller_nginx_process_connections{state="writing"})`
accepted`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m])))`
active`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m])))`
handled`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m])))`
| -- **Ingress Controller Request Process Time** +### Ingress Controller Request Process Time - | Catalog | Expression | - | --- | --- | - | Detail | `topk(10, histogram_quantile(0.95,sum by (le, host, path)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` | - | Summary | `topk(10, histogram_quantile(0.95,sum by (le, host)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` | +| Catalog | Expression | +| --- | --- | +| Detail | `topk(10, histogram_quantile(0.95,sum by (le, host, path)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` | +| Summary | `topk(10, histogram_quantile(0.95,sum by (le, host)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` | -## Rancher Logging Metrics +# Rancher Logging Metrics -- **Fluentd Buffer Queue Rate** - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_output_status_buffer_queue_length[5m])) by (instance)` | - | Summary | `sum(rate(fluentd_output_status_buffer_queue_length[5m]))` | +### Fluentd Buffer Queue Rate -- **Fluentd Input Rate** +| Catalog | Expression | +| --- | --- | +| Detail | `sum(rate(fluentd_output_status_buffer_queue_length[5m])) by (instance)` | +| Summary | `sum(rate(fluentd_output_status_buffer_queue_length[5m]))` | - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_input_status_num_records_total[5m])) by (instance)` | - | Summary | `sum(rate(fluentd_input_status_num_records_total[5m]))` | +### Fluentd Input Rate -- **Fluentd Output Errors Rate** +| Catalog | Expression | +| --- | --- | +| Detail | `sum(rate(fluentd_input_status_num_records_total[5m])) by (instance)` | +| Summary | `sum(rate(fluentd_input_status_num_records_total[5m]))` | - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_output_status_num_errors[5m])) by (type)` | - | Summary | `sum(rate(fluentd_output_status_num_errors[5m]))` | +### Fluentd Output Errors Rate -- **Fluentd Output Rate** +| Catalog | Expression | +| --- | --- | +| Detail | `sum(rate(fluentd_output_status_num_errors[5m])) by (type)` | +| Summary | `sum(rate(fluentd_output_status_num_errors[5m]))` | - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_output_status_num_records_total[5m])) by (instance)` | - | Summary | `sum(rate(fluentd_output_status_num_records_total[5m]))` | +### Fluentd Output Rate -## Workload Metrics +| Catalog | Expression | +| --- | --- | +| Detail | `sum(rate(fluentd_output_status_num_records_total[5m])) by (instance)` | +| Summary | `sum(rate(fluentd_output_status_num_records_total[5m]))` | -- **CPU Utilization** +# Workload Metrics - | Catalog | Expression | - | --- | --- | - | Detail |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
user seconds`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
system seconds`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
usage seconds`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
user seconds`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
system seconds`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
usage seconds`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| +### Workload CPU Utilization -- **Memory Utilization** +| Catalog | Expression | +| --- | --- | +| Detail |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
user seconds`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
system seconds`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
usage seconds`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| +| Summary |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
user seconds`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
system seconds`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
usage seconds`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - | Catalog | Expression | - | --- | --- | - | Detail | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""}) by (pod_name)` | - | Summary | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""})` | +### Workload Memory Utilization -- **Network Packets** +| Catalog | Expression | +| --- | --- | +| Detail | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""}) by (pod_name)` | +| Summary | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""})` | - | Catalog | Expression | - | --- | --- | - | Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| +### Workload Network Packets -- **Network I/O** +| Catalog | Expression | +| --- | --- | +| Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| +| Summary |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - | Catalog | Expression | - | --- | --- | - | Detail |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| +### Workload Network I/O -- **Disk I/O** +| Catalog | Expression | +| --- | --- | +| Detail |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| +| Summary |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| +### Workload Disk I/O -### Pod Metrics +| Catalog | Expression | +| --- | --- | +| Detail |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| +| Summary |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| -- **CPU Utilization** +# Pod Metrics - | Catalog | Expression | - | --- | --- | - | Detail |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
usage seconds`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
system seconds`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
user seconds`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
| - | Summary |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
usage seconds`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
system seconds`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
user seconds`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
| +### Pod CPU Utilization -- **Memory Utilization** +| Catalog | Expression | +| --- | --- | +| Detail |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
usage seconds`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
system seconds`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
user seconds`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
| +| Summary |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
usage seconds`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
system seconds`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
user seconds`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
| - | Catalog | Expression | - | --- | --- | - | Detail | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""}) by (container_name)` | - | Summary | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""})` | +### Pod Memory Utilization -- **Network Packets** +| Catalog | Expression | +| --- | --- | +| Detail | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""}) by (container_name)` | +| Summary | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""})` | - | Catalog | Expression | - | --- | --- | - | Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - | Summary |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| +### Pod Network Packets -- **Network I/O** +| Catalog | Expression | +| --- | --- | +| Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| +| Summary |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - | Catalog | Expression | - | --- | --- | - | Detail |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - | Summary |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| +### Pod Network I/O -- **Disk I/O** +| Catalog | Expression | +| --- | --- | +| Detail |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| +| Summary |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`
| - | Summary |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| +### Pod Disk I/O -### Container Metrics +| Catalog | Expression | +| --- | --- | +| Detail |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`
| +| Summary |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| -- **CPU Utilization** +# Container Metrics - | Catalog | Expression | - | --- | --- | - | cfs throttled seconds | `sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | usage seconds | `sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | system seconds | `sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | user seconds | `sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | +### Container CPU Utilization -- **Memory Utilization** +| Catalog | Expression | +| --- | --- | +| cfs throttled seconds | `sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | +| usage seconds | `sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | +| system seconds | `sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | +| user seconds | `sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name="$podName",container_name="$containerName"})` +### Container Memory Utilization -- **Disk IO** +`sum(container_memory_working_set_bytes{namespace="$namespace",pod_name="$podName",container_name="$containerName"})` - | Catalog | Expression | - | --- | --- | - | read | `sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | write | `sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | +### Container Disk I/O + +| Catalog | Expression | +| --- | --- | +| read | `sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | +| write | `sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` |