mirror of
https://github.com/rancher/rancher-docs.git
synced 2026-05-15 01:23:21 +00:00
Merge pull request #2884 from catherineluse/monitoring
Document configuration for routes, receivers, and rules in Rancher UI
This commit is contained in:
@@ -13,8 +13,8 @@ For information on configuring custom scrape targets and rules for Prometheus, p
|
||||
- [Configuring Targets with ServiceMonitors and PodMonitors](#configuring-targets-with-servicemonitors-and-podmonitors)
|
||||
- [ServiceMonitors](#servicemonitors)
|
||||
- [PodMonitors](#podmonitors)
|
||||
- [PrometheusRules](#prometheusrules)
|
||||
- [Alertmanager Config](#alertmanager-config)
|
||||
- [PrometheusRules](#prometheusrules)
|
||||
- [Alertmanager Config](#alertmanager-config)
|
||||
- [Trusted CA for Notifiers](#trusted-ca-for-notifiers)
|
||||
- [Additional Scrape Configurations](#additional-scrape-configurations)
|
||||
- [Examples](#examples)
|
||||
@@ -45,40 +45,15 @@ For more information about how ServiceMonitors work, refer to the [Prometheus Op
|
||||
|
||||
This CRD declaratively specifies how group of pods should be monitored. Any Pods in your cluster that match the labels located within the PodMonitor `selector` field will be monitored based on the `podMetricsEndpoints` specified on the PodMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#podmonitorspec) provided by Prometheus Operator.
|
||||
|
||||
### PrometheusRules
|
||||
# PrometheusRules
|
||||
|
||||
This CRD defines a group of Prometheus alerting and/or recording rules.
|
||||
|
||||
To add a group of alerting / recording rules, you should create a PrometheusRule CR the defines a RuleGroup with your desired rules, each specifying:
|
||||
For information on configuring Prometheus rules, refer to [this page.](./prometheusrules)
|
||||
|
||||
- The name of the new alert / record
|
||||
- A PromQL expression for the new alert / record
|
||||
- Labels that should be attached to the alert / record that identify it (e.g. cluster name or severity)
|
||||
- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules.
|
||||
# Alertmanager Config
|
||||
|
||||
For more information on what fields can be specified, please look at the [Prometheus Operator spec.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#prometheusrulespec)
|
||||
|
||||
### Alertmanager Config
|
||||
|
||||
The [Alertmanager Config](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file) Secret contains the configuration of an Alertmanager instance that sends out notifications based on alerts it receives from Prometheus.
|
||||
|
||||
By default, Rancher Monitoring deploys a single Alertmanager onto a cluster that uses a default Alertmanager Config Secret. As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster that can all be managed using the same underlying Alertmanager Config Secret.
|
||||
|
||||
This Secret should be updated or modified any time you want to:
|
||||
|
||||
- Add in new notifiers or receivers
|
||||
- Change the alerts that should be sent to specific notifiers or receivers
|
||||
- Change the group of alerts that are sent out
|
||||
|
||||
> By default, you can either choose to supply an existing Alertmanager Config Secret (i.e. any Secret in the `cattle-monitoring-system` namespace) or allow Rancher Monitoring to deploy a default Alertmanager Config Secret onto your cluster. By default, the Alertmanager Config Secret created by Rancher will never be modified / deleted on an upgrade / uninstall of the `rancher-monitoring` chart to prevent users from losing or overwriting their alerting configuration when executing operations on the chart.
|
||||
|
||||
For more information on what fields can be specified in this secret, please look at the [Prometheus Alertmanager docs](https://prometheus.io/docs/alerting/latest/alertmanager/)
|
||||
|
||||
The full spec for the Alertmanager configuration file and what it takes in can be found [here.](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file)
|
||||
|
||||
The notification integrations are configured with the `receiver`, which is documented [here.](https://prometheus.io/docs/alerting/latest/configuration/#receiver)
|
||||
|
||||
For more information, refer to the [official Prometheus documentation about configuring routes.](https://www.prometheus.io/docs/alerting/latest/configuration/#route)
|
||||
For information on configuring the Alertmanager, refer to [this page.](./alertmanager)
|
||||
|
||||
# Trusted CA for Notifiers
|
||||
|
||||
@@ -114,21 +89,4 @@ Prometheus rule files are held in PrometheusRule custom resources. Use the label
|
||||
|
||||
### Alertmanager Config
|
||||
|
||||
To set up notifications via Slack, the following Alertmanager Config YAML should be placed into the `alertmanager.yaml` key of the Alertmanager Config Secret, where the `api_url` should be updated to use your Webhook URL from Slack:
|
||||
|
||||
```yaml
|
||||
route:
|
||||
group_by: ['job']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
repeat_interval: 3h
|
||||
receiver: 'slack-notifications'
|
||||
receivers:
|
||||
- name: 'slack-notifications'
|
||||
slack_configs:
|
||||
- send_resolved: true
|
||||
text: '{{ template "slack.rancher.text" . }}'
|
||||
api_url: <user-provided slack webhook url here>
|
||||
templates:
|
||||
- /etc/alertmanager/config/*.tmpl
|
||||
```
|
||||
For an example configuration, refer to [this section.](./alertmanager/#example-alertmanager-config)
|
||||
+202
@@ -0,0 +1,202 @@
|
||||
---
|
||||
title: Alertmanager
|
||||
weight: 1
|
||||
---
|
||||
|
||||
The [Alertmanager Config](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file) Secret contains the configuration of an Alertmanager instance that sends out notifications based on alerts it receives from Prometheus.
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Creating Receivers in the Rancher UI](#creating-receivers-in-the-rancher-ui)
|
||||
- [Receiver Configuration](#receiver-configuration)
|
||||
- [Slack](#slack)
|
||||
- [Email](#email)
|
||||
- [PagerDuty](#pagerduty)
|
||||
- [Opsgenie](#opsgenie)
|
||||
- [Webhook](#webhook)
|
||||
- [Custom](#custom)
|
||||
- [Route Configuration](#route-configuration)
|
||||
- [Receiver](#receiver)
|
||||
- [Grouping](#grouping)
|
||||
- [Matching](#matching)
|
||||
- [Example Alertmanager YAML](#example-alertmanager-yaml)
|
||||
|
||||
# Overview
|
||||
|
||||
By default, Rancher Monitoring deploys a single Alertmanager onto a cluster that uses a default Alertmanager Config Secret. As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster that can all be managed using the same underlying Alertmanager Config Secret.
|
||||
|
||||
This Secret should be updated or modified any time you want to:
|
||||
|
||||
- Add in new notifiers or receivers
|
||||
- Change the alerts that should be sent to specific notifiers or receivers
|
||||
- Change the group of alerts that are sent out
|
||||
|
||||
> By default, you can either choose to supply an existing Alertmanager Config Secret (i.e. any Secret in the `cattle-monitoring-system` namespace) or allow Rancher Monitoring to deploy a default Alertmanager Config Secret onto your cluster. By default, the Alertmanager Config Secret created by Rancher will never be modified / deleted on an upgrade / uninstall of the `rancher-monitoring` chart to prevent users from losing or overwriting their alerting configuration when executing operations on the chart.
|
||||
|
||||
For more information on what fields can be specified in this secret, please look at the [Prometheus Alertmanager docs.](https://prometheus.io/docs/alerting/latest/alertmanager/)
|
||||
|
||||
The full spec for the Alertmanager configuration file and what it takes in can be found [here.](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file)
|
||||
|
||||
For more information, refer to the [official Prometheus documentation about configuring routes.](https://www.prometheus.io/docs/alerting/latest/configuration/#route)
|
||||
|
||||
# Creating Receivers in the Rancher UI
|
||||
_Available as of v2.5.4_
|
||||
|
||||
> **Prerequisite:** The monitoring application needs to be installed.
|
||||
|
||||
To create notification receivers in the Rancher UI,
|
||||
|
||||
1. Click **Cluster Explorer > Monitoring** and click **Receiver.**
|
||||
2. Enter a name for the receiver.
|
||||
3. Configure one or more providers for the receiver. For help filling out the forms, refer to the configuration options below.
|
||||
4. Click **Create.**
|
||||
|
||||
**Result:** Alerts can be configured to send notifications to the receiver(s).
|
||||
|
||||
# Receiver Configuration
|
||||
|
||||
The notification integrations are configured with the `receiver`, which is explained in the [Prometheus documentation.](https://prometheus.io/docs/alerting/latest/configuration/#receiver)
|
||||
|
||||
Rancher v2.5.4 introduced the capability to configure reducers by filling out forms in the Rancher UI.
|
||||
|
||||
{{% tabs %}}
|
||||
{{% tab "Rancher v2.5.4+" %}}
|
||||
|
||||
The following types of receivers can be configured in the Rancher UI:
|
||||
|
||||
- <a href="#slack">Slack</Slack>
|
||||
- <a href="#email">Email</Slack>
|
||||
- <a href="#pagerduty">PagerDuty</Slack>
|
||||
- <a href="#opsgenie">Opsgenie</Slack>
|
||||
- <a href="#webhook">Webhook</Slack>
|
||||
- <a href="#custom">Custom</Slack>
|
||||
|
||||
The custom receiver option can be used to configure any receiver in YAML that cannot be configured by filling out the other forms in the Rancher UI.
|
||||
|
||||
### Slack
|
||||
|
||||
| Field | Type | Description |
|
||||
|------|--------------|------|
|
||||
| URL | String | Enter your Slack webhook URL. For instructions to create a Slack webhook, see the [Slack documentation.](https://get.slack.help/hc/en-us/articles/115005265063-Incoming-WebHooks-for-Slack) |
|
||||
| Default Channel | String | Enter the name of the channel that you want to send alert notifications in the following format: `#<channelname>` |
|
||||
| Proxy URL | String | Proxy for the webhook notifications. |
|
||||
| Enable send resolved alerts | Bool | When true, you will receive alerts through the notifier even if the alert condition is no longer true. For example, if an alert is triggered because your CPU is too high, you will still receive the alert after CPU goes back to normal levels. |
|
||||
|
||||
### Email
|
||||
|
||||
| Field | Type | Description |
|
||||
|------|--------------|------|
|
||||
| Default Recipient Address | String | The email address that will receive notifications. |
|
||||
| Enable send resolved alerts | Bool | When true, you will receive alerts through the notifier even if the alert condition is no longer true. For example, if an alert is triggered because your CPU is too high, you will still receive the alert after CPU goes back to normal levels. |
|
||||
|
||||
SMTP options:
|
||||
|
||||
| Field | Type | Description |
|
||||
|------|--------------|------|
|
||||
| Sender | String | Enter an email address available on your SMTP mail server that you want to send the notification from. |
|
||||
| Host | String | Enter the IP address or hostname for your SMTP server. Example: `smtp.email.com` |
|
||||
| Use TLS | Bool | Use TLS for encryption. |
|
||||
| Username | String | Enter a username to authenticate with the SMTP server. |
|
||||
| Password | String | Enter a password to authenticate with the SMTP server. |
|
||||
|
||||
### PagerDuty
|
||||
|
||||
| Field | Type | Description |
|
||||
|------|------|-------|
|
||||
| Integration Type | String | Events API v2 or Prometheus. |
|
||||
| Default Integration Key | String | For instructions to get an integration key, see the [PagerDuty documentation.](https://www.pagerduty.com/docs/guides/prometheus-integration-guide/) |
|
||||
| Proxy URL | String | Proxy for the PagerDuty notifications. |
|
||||
| Enable send resolved alerts | Bool | When true, you will receive alerts through the notifier even if the alert condition is no longer true. For example, if an alert is triggered because your CPU is too high, you will still receive the alert after CPU goes back to normal levels. |
|
||||
|
||||
### Opsgenie
|
||||
|
||||
| Field | Description |
|
||||
|------|-------------|
|
||||
| API Key | For instructions to get an API key, refer to the [Opsgenie documentation.](https://docs.opsgenie.com/docs/api-key-management) |
|
||||
| Proxy URL | Proxy for the Opsgenie notifications. |
|
||||
| Enable send resolved alerts | When true, you will receive alerts through the notifier even if the alert condition is no longer true. For example, if an alert is triggered because your CPU is too high, you will still receive the alert after CPU goes back to normal levels. |
|
||||
|
||||
Opsgenie Responders:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|--------|
|
||||
| Type | String | Schedule, Team, User, or Escalation. For more information on alert responders, refer to the [Opsgenie documentation.](https://docs.opsgenie.com/docs/alert-recipients-and-teams) |
|
||||
| Send To | String | Id, Name, or Username of the Opsgenie recipient. |
|
||||
|
||||
### Webhook
|
||||
|
||||
| Field | Description |
|
||||
|-------|--------------|
|
||||
| URL | Webhook URL for the app of your choice. |
|
||||
| Proxy URL | Proxy for the webhook notification |
|
||||
| Enable send resolved alerts | When true, you will receive alerts through the notifier even if the alert condition is no longer true. For example, if an alert is triggered because your CPU is too high, you will still receive the alert after CPU goes back to normal levels. |
|
||||
|
||||
### Custom
|
||||
|
||||
The YAML provided here will be directly appended to your receiver within the Alertmanager Config Secret.
|
||||
|
||||
{{% /tab %}}
|
||||
{{% tab "Rancher v2.5.0-2.5.3" %}}
|
||||
The Alertmanager must be configured in YAML, as shown in this [example.](#example-alertmanager-config)
|
||||
{{% /tab %}}
|
||||
{{% /tabs %}}
|
||||
|
||||
|
||||
# Route Configuration
|
||||
|
||||
{{% tabs %}}
|
||||
{{% tab "Rancher v2.5.4+" %}}
|
||||
|
||||
### Receiver
|
||||
The route needs to refer to a [receiver](#receiver-configuration) that has already been configured.
|
||||
|
||||
### Grouping
|
||||
|
||||
| Field | Default | Description |
|
||||
|-------|--------------|---------|
|
||||
| Group By | N/a | The labels by which incoming alerts are grouped together. For example, `[ group_by: '[' <labelname>, ... ']' ]` Multiple alerts coming in for labels such as `cluster=A` and `alertname=LatencyHigh` can be batched into a single group. To aggregate by all possible labels, use the special value `'...'` as the sole label name, for example: `group_by: ['...']` Grouping by `...` effectively disables aggregation entirely, passing through all alerts as-is. This is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping.
|
||||
| Group Wait | 30s | How long to wait to buffer alerts of the same group before sending initially. |
|
||||
| Group Interval | 5m | How long to wait before sending an alert that has been added to a group of alerts for which an initial notification has already been sent. |
|
||||
| Repeat Interval | 4h | How long to wait before re-sending a given alert that has already been sent. |
|
||||
|
||||
### Matching
|
||||
|
||||
The **Match** field refers to a set of equality matchers an alert has to fulfill to match the node. When you add key-value pairs to the Rancher UI, they correspond to the YAML in this format:
|
||||
|
||||
```yaml
|
||||
match:
|
||||
[ <labelname>: <labelvalue>, ... ]
|
||||
```
|
||||
|
||||
The **Match Regex** field refers to a set of regex-matchers an alert has to fulfill to match the node. When you add key-value pairs in the Rancher UI, they correspond to the YAML in this format:
|
||||
|
||||
```yaml
|
||||
match_re:
|
||||
[ <labelname>: <regex>, ... ]
|
||||
```
|
||||
|
||||
{{% /tab %}}
|
||||
{{% tab "Rancher v2.5.0-2.5.3" %}}
|
||||
The Alertmanager must be configured in YAML, as shown in this [example.](#example-alertmanager-config)
|
||||
{{% /tab %}}
|
||||
{{% /tabs %}}
|
||||
|
||||
# Example Alertmanager Config
|
||||
|
||||
To set up notifications via Slack, the following Alertmanager Config YAML can be placed into the `alertmanager.yaml` key of the Alertmanager Config Secret, where the `api_url` should be updated to use your Webhook URL from Slack:
|
||||
|
||||
```yaml
|
||||
route:
|
||||
group_by: ['job']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
repeat_interval: 3h
|
||||
receiver: 'slack-notifications'
|
||||
receivers:
|
||||
- name: 'slack-notifications'
|
||||
slack_configs:
|
||||
- send_resolved: true
|
||||
text: '{{ template "slack.rancher.text" . }}'
|
||||
api_url: <user-provided slack webhook url here>
|
||||
templates:
|
||||
- /etc/alertmanager/config/*.tmpl
|
||||
```
|
||||
@@ -0,0 +1,432 @@
|
||||
---
|
||||
title: Prometheus Expressions
|
||||
weight: 4
|
||||
aliases:
|
||||
- /rancher/v2.x/en/project-admin/tools/monitoring/expression
|
||||
- /rancher/v2.x/en/cluster-admin/tools/monitoring/expression
|
||||
- /rancher/v2.x/en/monitoring-alerting/legacy/monitoring/cluster-monitoring/expression
|
||||
---
|
||||
|
||||
The PromQL expressions in this doc can be used to configure alerts.
|
||||
|
||||
For more information about querying Prometheus, refer to the official [Prometheus documentation.](https://prometheus.io/docs/prometheus/latest/querying/basics/)
|
||||
|
||||
<!-- TOC -->
|
||||
|
||||
- [Cluster Metrics](#cluster-metrics)
|
||||
- [Cluster CPU Utilization](#cluster-cpu-utilization)
|
||||
- [Cluster Load Average](#cluster-load-average)
|
||||
- [Cluster Memory Utilization](#cluster-memory-utilization)
|
||||
- [Cluster Disk Utilization](#cluster-disk-utilization)
|
||||
- [Cluster Disk I/O](#cluster-disk-i-o)
|
||||
- [Cluster Network Packets](#cluster-network-packets)
|
||||
- [Cluster Network I/O](#cluster-network-i-o)
|
||||
- [Node Metrics](#node-metrics)
|
||||
- [Node CPU Utilization](#node-cpu-utilization)
|
||||
- [Node Load Average](#node-load-average)
|
||||
- [Node Memory Utilization](#node-memory-utilization)
|
||||
- [Node Disk Utilization](#node-disk-utilization)
|
||||
- [Node Disk I/O](#node-disk-i-o)
|
||||
- [Node Network Packets](#node-network-packets)
|
||||
- [Node Network I/O](#node-network-i-o)
|
||||
- [Etcd Metrics](#etcd-metrics)
|
||||
- [Etcd Has a Leader](#etcd-has-a-leader)
|
||||
- [Number of Times the Leader Changes](#number-of-times-the-leader-changes)
|
||||
- [Number of Failed Proposals](#number-of-failed-proposals)
|
||||
- [GRPC Client Traffic](#grpc-client-traffic)
|
||||
- [Peer Traffic](#peer-traffic)
|
||||
- [DB Size](#db-size)
|
||||
- [Active Streams](#active-streams)
|
||||
- [Raft Proposals](#raft-proposals)
|
||||
- [RPC Rate](#rpc-rate)
|
||||
- [Disk Operations](#disk-operations)
|
||||
- [Disk Sync Duration](#disk-sync-duration)
|
||||
- [Kubernetes Components Metrics](#kubernetes-components-metrics)
|
||||
- [API Server Request Latency](#api-server-request-latency)
|
||||
- [API Server Request Rate](#api-server-request-rate)
|
||||
- [Scheduling Failed Pods](#scheduling-failed-pods)
|
||||
- [Controller Manager Queue Depth](#controller-manager-queue-depth)
|
||||
- [Scheduler E2E Scheduling Latency](#scheduler-e2e-scheduling-latency)
|
||||
- [Scheduler Preemption Attempts](#scheduler-preemption-attempts)
|
||||
- [Ingress Controller Connections](#ingress-controller-connections)
|
||||
- [Ingress Controller Request Process Time](#ingress-controller-request-process-time)
|
||||
- [Rancher Logging Metrics](#rancher-logging-metrics)
|
||||
- [Fluentd Buffer Queue Rate](#fluentd-buffer-queue-rate)
|
||||
- [Fluentd Input Rate](#fluentd-input-rate)
|
||||
- [Fluentd Output Errors Rate](#fluentd-output-errors-rate)
|
||||
- [Fluentd Output Rate](#fluentd-output-rate)
|
||||
- [Workload Metrics](#workload-metrics)
|
||||
- [Workload CPU Utilization](#workload-cpu-utilization)
|
||||
- [Workload Memory Utilization](#workload-memory-utilization)
|
||||
- [Workload Network Packets](#workload-network-packets)
|
||||
- [Workload Network I/O](#workload-network-i-o)
|
||||
- [Workload Disk I/O](#workload-disk-i-o)
|
||||
- [Pod Metrics](#pod-metrics)
|
||||
- [Pod CPU Utilization](#pod-cpu-utilization)
|
||||
- [Pod Memory Utilization](#pod-memory-utilization)
|
||||
- [Pod Network Packets](#pod-network-packets)
|
||||
- [Pod Network I/O](#pod-network-i-o)
|
||||
- [Pod Disk I/O](#pod-disk-i-o)
|
||||
- [Container Metrics](#container-metrics)
|
||||
- [Container CPU Utilization](#container-cpu-utilization)
|
||||
- [Container Memory Utilization](#container-memory-utilization)
|
||||
- [Container Disk I/O](#container-disk-i-o)
|
||||
|
||||
<!-- /TOC -->
|
||||
|
||||
# Cluster Metrics
|
||||
|
||||
### Cluster CPU Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))` |
|
||||
| Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])))` |
|
||||
|
||||
### Cluster Load Average
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>load1</td><td>`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`</td></tr><tr><td>load5</td><td>`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`</td></tr><tr><td>load15</td><td>`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>load1</td><td>`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"})`</td></tr><tr><td>load5</td><td>`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"})`</td></tr><tr><td>load15</td><td>`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"})`</td></tr></table> |
|
||||
|
||||
### Cluster Memory Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `1 - sum(node_memory_MemAvailable_bytes) by (instance) / sum(node_memory_MemTotal_bytes) by (instance)` |
|
||||
| Summary | `1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)` |
|
||||
|
||||
### Cluster Disk Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `(sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) - sum(node_filesystem_free_bytes{device!="rootfs"}) by (instance)) / sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance)` |
|
||||
| Summary | `(sum(node_filesystem_size_bytes{device!="rootfs"}) - sum(node_filesystem_free_bytes{device!="rootfs"})) / sum(node_filesystem_size_bytes{device!="rootfs"})` |
|
||||
|
||||
### Cluster Disk I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>read</td><td>`sum(rate(node_disk_read_bytes_total[5m])) by (instance)`</td></tr><tr><td>written</td><td>`sum(rate(node_disk_written_bytes_total[5m])) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>read</td><td>`sum(rate(node_disk_read_bytes_total[5m]))`</td></tr><tr><td>written</td><td>`sum(rate(node_disk_written_bytes_total[5m]))`</td></tr></table> |
|
||||
|
||||
### Cluster Network Packets
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive-dropped</td><td><code>sum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr><tr><td>receive-errs</td><td><code>sum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr><tr><td>receive-packets</td><td><code>sum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr><tr><td>transmit-dropped</td><td><code>sum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr><tr><td>transmit-errs</td><td><code>sum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr><tr><td>transmit-packets</td><td><code>sum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr></table> |
|
||||
| Summary | <table><tr><td>receive-dropped</td><td><code>sum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr><tr><td>receive-errs</td><td><code>sum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr><tr><td>receive-packets</td><td><code>sum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr><tr><td>transmit-dropped</td><td><code>sum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr><tr><td>transmit-errs</td><td><code>sum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr><tr><td>transmit-packets</td><td><code>sum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr></table> |
|
||||
|
||||
### Cluster Network I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive</td><td><code>sum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr><tr><td>transmit</td><td><code>sum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)</code></td></tr></table> |
|
||||
| Summary | <table><tr><td>receive</td><td><code>sum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr><tr><td>transmit</td><td><code>sum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))</code></td></tr></table> |
|
||||
|
||||
# Node Metrics
|
||||
|
||||
### Node CPU Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `avg(irate(node_cpu_seconds_total{mode!="idle", instance=~"$instance"}[5m])) by (mode)` |
|
||||
| Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])))` |
|
||||
|
||||
### Node Load Average
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>load1</td><td>`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`</td></tr><tr><td>load5</td><td>`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`</td></tr><tr><td>load15</td><td>`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`</td></tr></table> |
|
||||
| Summary | <table><tr><td>load1</td><td>`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`</td></tr><tr><td>load5</td><td>`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`</td></tr><tr><td>load15</td><td>`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`</td></tr></table> |
|
||||
|
||||
### Node Memory Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})` |
|
||||
| Summary | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"}) ` |
|
||||
|
||||
### Node Disk Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"}) by (device)) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device)` |
|
||||
| Summary | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"})) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"})` |
|
||||
|
||||
### Node Disk I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>read</td><td>`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`</td></tr><tr><td>written</td><td>`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`</td></tr></table> |
|
||||
| Summary | <table><tr><td>read</td><td>`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`</td></tr><tr><td>written</td><td>`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`</td></tr></table> |
|
||||
|
||||
### Node Network Packets
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive-dropped</td><td><code>sum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr><tr><td>receive-errs</td><td><code>sum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr><tr><td>receive-packets</td><td><code>sum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr><tr><td>transmit-dropped</td><td><code>sum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr><tr><td>transmit-errs</td><td><code>sum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr><tr><td>transmit-packets</td><td><code>sum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr></table> |
|
||||
| Summary | <table><tr><td>receive-dropped</td><td><code>sum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr><tr><td>receive-errs</td><td><code>sum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr><tr><td>receive-packets</td><td><code>sum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr><tr><td>transmit-dropped</td><td><code>sum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr><tr><td>transmit-errs</td><td><code>sum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr><tr><td>transmit-packets</td><td><code>sum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr></table> |
|
||||
|
||||
### Node Network I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive</td><td><code>sum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr><tr><td>transmit</td><td><code>sum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)</code></td></tr></table> |
|
||||
| Summary | <table><tr><td>receive</td><td><code>sum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr><tr><td>transmit</td><td><code>sum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))</code></td></tr></table> |
|
||||
|
||||
# Etcd Metrics
|
||||
|
||||
### Etcd Has a Leader
|
||||
|
||||
`max(etcd_server_has_leader)`
|
||||
|
||||
### Number of Times the Leader Changes
|
||||
|
||||
`max(etcd_server_leader_changes_seen_total)`
|
||||
|
||||
### Number of Failed Proposals
|
||||
|
||||
`sum(etcd_server_proposals_failed_total)`
|
||||
|
||||
### GRPC Client Traffic
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>in</td><td>`sum(rate(etcd_network_client_grpc_received_bytes_total[5m])) by (instance)`</td></tr><tr><td>out</td><td>`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m])) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>in</td><td>`sum(rate(etcd_network_client_grpc_received_bytes_total[5m]))`</td></tr><tr><td>out</td><td>`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m]))`</td></tr></table> |
|
||||
|
||||
### Peer Traffic
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>in</td><td>`sum(rate(etcd_network_peer_received_bytes_total[5m])) by (instance)`</td></tr><tr><td>out</td><td>`sum(rate(etcd_network_peer_sent_bytes_total[5m])) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>in</td><td>`sum(rate(etcd_network_peer_received_bytes_total[5m]))`</td></tr><tr><td>out</td><td>`sum(rate(etcd_network_peer_sent_bytes_total[5m]))`</td></tr></table> |
|
||||
|
||||
### DB Size
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(etcd_debugging_mvcc_db_total_size_in_bytes) by (instance)` |
|
||||
| Summary | `sum(etcd_debugging_mvcc_db_total_size_in_bytes)` |
|
||||
|
||||
### Active Streams
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>lease-watch</td><td>`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance)`</td></tr><tr><td>watch</td><td>`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>lease-watch</td><td>`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})`</td></tr><tr><td>watch</td><td>`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})`</td></tr></table> |
|
||||
|
||||
### Raft Proposals
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>applied</td><td>`sum(increase(etcd_server_proposals_applied_total[5m])) by (instance)`</td></tr><tr><td>committed</td><td>`sum(increase(etcd_server_proposals_committed_total[5m])) by (instance)`</td></tr><tr><td>pending</td><td>`sum(increase(etcd_server_proposals_pending[5m])) by (instance)`</td></tr><tr><td>failed</td><td>`sum(increase(etcd_server_proposals_failed_total[5m])) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>applied</td><td>`sum(increase(etcd_server_proposals_applied_total[5m]))`</td></tr><tr><td>committed</td><td>`sum(increase(etcd_server_proposals_committed_total[5m]))`</td></tr><tr><td>pending</td><td>`sum(increase(etcd_server_proposals_pending[5m]))`</td></tr><tr><td>failed</td><td>`sum(increase(etcd_server_proposals_failed_total[5m]))`</td></tr></table> |
|
||||
|
||||
### RPC Rate
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>total</td><td>`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m])) by (instance)`</td></tr><tr><td>fail</td><td>`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m])) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>total</td><td>`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m]))`</td></tr><tr><td>fail</td><td>`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m]))`</td></tr></table> |
|
||||
|
||||
### Disk Operations
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>commit-called-by-backend</td><td>`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m])) by (instance)`</td></tr><tr><td>fsync-called-by-wal</td><td>`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m])) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>commit-called-by-backend</td><td>`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m]))`</td></tr><tr><td>fsync-called-by-wal</td><td>`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m]))`</td></tr></table> |
|
||||
|
||||
### Disk Sync Duration
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>wal</td><td>`histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))`</td></tr><tr><td>db</td><td>`histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))`</td></tr></table> |
|
||||
| Summary | <table><tr><td>wal</td><td>`sum(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)))`</td></tr><tr><td>db</td><td>`sum(histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)))`</td></tr></table> |
|
||||
|
||||
# Kubernetes Components Metrics
|
||||
|
||||
### API Server Request Latency
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance, verb) /1e+06` |
|
||||
| Summary | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance) /1e+06` |
|
||||
|
||||
### API Server Request Rate
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(rate(apiserver_request_count[5m])) by (instance, code)` |
|
||||
| Summary | `sum(rate(apiserver_request_count[5m])) by (instance)` |
|
||||
|
||||
### Scheduling Failed Pods
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(kube_pod_status_scheduled{condition="false"})` |
|
||||
| Summary | `sum(kube_pod_status_scheduled{condition="false"})` |
|
||||
|
||||
### Controller Manager Queue Depth
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>volumes</td><td>`sum(volumes_depth) by instance`</td></tr><tr><td>deployment</td><td>`sum(deployment_depth) by instance`</td></tr><tr><td>replicaset</td><td>`sum(replicaset_depth) by instance`</td></tr><tr><td>service</td><td>`sum(service_depth) by instance`</td></tr><tr><td>serviceaccount</td><td>`sum(serviceaccount_depth) by instance`</td></tr><tr><td>endpoint</td><td>`sum(endpoint_depth) by instance`</td></tr><tr><td>daemonset</td><td>`sum(daemonset_depth) by instance`</td></tr><tr><td>statefulset</td><td>`sum(statefulset_depth) by instance`</td></tr><tr><td>replicationmanager</td><td>`sum(replicationmanager_depth) by instance`</td></tr></table> |
|
||||
| Summary | <table><tr><td>volumes</td><td>`sum(volumes_depth)`</td></tr><tr><td>deployment</td><td>`sum(deployment_depth)`</td></tr><tr><td>replicaset</td><td>`sum(replicaset_depth)`</td></tr><tr><td>service</td><td>`sum(service_depth)`</td></tr><tr><td>serviceaccount</td><td>`sum(serviceaccount_depth)`</td></tr><tr><td>endpoint</td><td>`sum(endpoint_depth)`</td></tr><tr><td>daemonset</td><td>`sum(daemonset_depth)`</td></tr><tr><td>statefulset</td><td>`sum(statefulset_depth)`</td></tr><tr><td>replicationmanager</td><td>`sum(replicationmanager_depth)`</td></tr></table> |
|
||||
|
||||
### Scheduler E2E Scheduling Latency
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06` |
|
||||
| Summary | `sum(histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06)` |
|
||||
|
||||
### Scheduler Preemption Attempts
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(rate(scheduler_total_preemption_attempts[5m])) by (instance)` |
|
||||
| Summary | `sum(rate(scheduler_total_preemption_attempts[5m]))` |
|
||||
|
||||
### Ingress Controller Connections
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>reading</td><td>`sum(nginx_ingress_controller_nginx_process_connections{state="reading"}) by (instance)`</td></tr><tr><td>waiting</td><td>`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"}) by (instance)`</td></tr><tr><td>writing</td><td>`sum(nginx_ingress_controller_nginx_process_connections{state="writing"}) by (instance)`</td></tr><tr><td>accepted</td><td>`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m]))) by (instance)`</td></tr><tr><td>active</td><td>`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m]))) by (instance)`</td></tr><tr><td>handled</td><td>`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m]))) by (instance)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>reading</td><td>`sum(nginx_ingress_controller_nginx_process_connections{state="reading"})`</td></tr><tr><td>waiting</td><td>`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"})`</td></tr><tr><td>writing</td><td>`sum(nginx_ingress_controller_nginx_process_connections{state="writing"})`</td></tr><tr><td>accepted</td><td>`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m])))`</td></tr><tr><td>active</td><td>`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m])))`</td></tr><tr><td>handled</td><td>`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m])))`</td></tr></table> |
|
||||
|
||||
### Ingress Controller Request Process Time
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `topk(10, histogram_quantile(0.95,sum by (le, host, path)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` |
|
||||
| Summary | `topk(10, histogram_quantile(0.95,sum by (le, host)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` |
|
||||
|
||||
# Rancher Logging Metrics
|
||||
|
||||
|
||||
### Fluentd Buffer Queue Rate
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(rate(fluentd_output_status_buffer_queue_length[5m])) by (instance)` |
|
||||
| Summary | `sum(rate(fluentd_output_status_buffer_queue_length[5m]))` |
|
||||
|
||||
### Fluentd Input Rate
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(rate(fluentd_input_status_num_records_total[5m])) by (instance)` |
|
||||
| Summary | `sum(rate(fluentd_input_status_num_records_total[5m]))` |
|
||||
|
||||
### Fluentd Output Errors Rate
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(rate(fluentd_output_status_num_errors[5m])) by (type)` |
|
||||
| Summary | `sum(rate(fluentd_output_status_num_errors[5m]))` |
|
||||
|
||||
### Fluentd Output Rate
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(rate(fluentd_output_status_num_records_total[5m])) by (instance)` |
|
||||
| Summary | `sum(rate(fluentd_output_status_num_records_total[5m]))` |
|
||||
|
||||
# Workload Metrics
|
||||
|
||||
### Workload CPU Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>cfs throttled seconds</td><td>`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>user seconds</td><td>`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>system seconds</td><td>`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>usage seconds</td><td>`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>cfs throttled seconds</td><td>`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>user seconds</td><td>`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>system seconds</td><td>`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>usage seconds</td><td>`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
### Workload Memory Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""}) by (pod_name)` |
|
||||
| Summary | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""})` |
|
||||
|
||||
### Workload Network Packets
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive-packets</td><td>`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>receive-dropped</td><td>`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>receive-errors</td><td>`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>transmit-packets</td><td>`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>transmit-dropped</td><td>`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>transmit-errors</td><td>`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>receive-packets</td><td>`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>receive-dropped</td><td>`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>receive-errors</td><td>`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-packets</td><td>`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-dropped</td><td>`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-errors</td><td>`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
### Workload Network I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive</td><td>`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>transmit</td><td>`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>receive</td><td>`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit</td><td>`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
### Workload Disk I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>read</td><td>`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr><tr><td>write</td><td>`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>read</td><td>`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr><tr><td>write</td><td>`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
# Pod Metrics
|
||||
|
||||
### Pod CPU Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>cfs throttled seconds</td><td>`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`</td></tr><tr><td>usage seconds</td><td>`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`</td></tr><tr><td>system seconds</td><td>`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`</td></tr><tr><td>user seconds</td><td>`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>cfs throttled seconds</td><td>`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`</td></tr><tr><td>usage seconds</td><td>`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`</td></tr><tr><td>system seconds</td><td>`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`</td></tr><tr><td>user seconds</td><td>`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
### Pod Memory Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""}) by (container_name)` |
|
||||
| Summary | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""})` |
|
||||
|
||||
### Pod Network Packets
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive-packets</td><td>`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>receive-dropped</td><td>`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>receive-errors</td><td>`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-packets</td><td>`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-dropped</td><td>`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-errors</td><td>`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
| Summary | <table><tr><td>receive-packets</td><td>`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>receive-dropped</td><td>`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>receive-errors</td><td>`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-packets</td><td>`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-dropped</td><td>`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit-errors</td><td>`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
### Pod Network I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>receive</td><td>`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit</td><td>`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
| Summary | <table><tr><td>receive</td><td>`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>transmit</td><td>`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
### Pod Disk I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| Detail | <table><tr><td>read</td><td>`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`</td></tr><tr><td>write</td><td>`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`</td></tr></table> |
|
||||
| Summary | <table><tr><td>read</td><td>`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr><tr><td>write</td><td>`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`</td></tr></table> |
|
||||
|
||||
# Container Metrics
|
||||
|
||||
### Container CPU Utilization
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| cfs throttled seconds | `sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` |
|
||||
| usage seconds | `sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` |
|
||||
| system seconds | `sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` |
|
||||
| user seconds | `sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` |
|
||||
|
||||
### Container Memory Utilization
|
||||
|
||||
`sum(container_memory_working_set_bytes{namespace="$namespace",pod_name="$podName",container_name="$containerName"})`
|
||||
|
||||
### Container Disk I/O
|
||||
|
||||
| Catalog | Expression |
|
||||
| --- | --- |
|
||||
| read | `sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` |
|
||||
| write | `sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` |
|
||||
+93
@@ -0,0 +1,93 @@
|
||||
---
|
||||
title: PrometheusRules
|
||||
weight: 2
|
||||
---
|
||||
|
||||
The PrometheusRules CRD defines a group of Prometheus alerting and/or recording rules.
|
||||
|
||||
- [About PrometheusRule Custom Resources](#about-prometheusrule-custom-resources)
|
||||
- [Creating PrometheusRules in the Rancher UI](#creating-prometheusrules-in-the-rancher-ui)
|
||||
- [Configuration](#configuration)
|
||||
- [Rule Group](#rule-group)
|
||||
- [Alerting Rules](#alerting-rules)
|
||||
- [Recording Rules](#recording-rules)
|
||||
|
||||
# About PrometheusRule Custom Resources
|
||||
|
||||
Prometheus rule files are held in PrometheusRule custom resources.
|
||||
|
||||
The PrometheusRule custom resource defines a RuleGroup with your desired rules. Each specifies the following:
|
||||
|
||||
- The name of the new alert or record
|
||||
- A PromQL (Prometheus query language) expression for the new alert or record
|
||||
- Labels that should be attached to the alert or record that identify it (e.g. cluster name or severity)
|
||||
- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules.
|
||||
|
||||
Alerting rules define alert conditions based on PromQL queries, and recording rules precompute frequently needed or computationally expensive queries at defined intervals.
|
||||
|
||||
For more information on what fields can be specified, please look at the [Prometheus Operator spec.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#prometheusrulespec)
|
||||
|
||||
Use the label selector field `ruleSelector` in the Prometheus object to define the rule files that you want to be mounted into Prometheus.
|
||||
|
||||
For examples, refer to the Prometheus documentation on [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [alerting rules.](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
|
||||
|
||||
|
||||
# Creating PrometheusRules in the Rancher UI
|
||||
|
||||
_Available as of v2.5.4_
|
||||
|
||||
> **Prerequisite:** The monitoring application needs to be installed.
|
||||
|
||||
To create rule groups in the Rancher UI,
|
||||
|
||||
1. Click **Cluster Explorer > Monitoring** and click **Prometheus Rules.**
|
||||
1. Click **Create.**
|
||||
1. Enter a **Group Name.**
|
||||
1. Configure the rules. A rule group may contain either alert rules or recording rules, but not both. For help filling out the forms, refer to the configuration options below.
|
||||
1. Click **Create.**
|
||||
|
||||
**Result:** Alerts can be configured to send notifications to the receiver(s).
|
||||
|
||||
# Configuration
|
||||
|
||||
{{% tabs %}}
|
||||
{{% tab "Rancher v2.5.4" %}}
|
||||
Rancher v2.5.4 introduced the capability to configure reducers by filling out forms in the Rancher UI.
|
||||
|
||||
|
||||
### Rule Group
|
||||
|
||||
| Field | Description |
|
||||
|-------|----------------|
|
||||
| Group Name | The name of the group. Must be unique within a rules file. |
|
||||
| Override Group Interval | Duration in seconds for how often rules in the group are evaluated. |
|
||||
|
||||
|
||||
### Alerting Rules
|
||||
|
||||
[Alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) allow you to define alert conditions based on PromQL (Prometheus expression language) expressions and to send notifications about firing alerts to an external service.
|
||||
|
||||
| Field | Description |
|
||||
|-------|----------------|
|
||||
| Alert Name | The name of the alert. Must be a valid label value. |
|
||||
| Wait to fire for | Duration in seconds. Alerts are considered firing once they have been returned for this long. Alerts which have not yet fired for long enough are considered pending. |
|
||||
| PromQL Expression | The PromQL expression to evaluate. Every evaluation cycle this is evaluated at the current time, and all resultant time series become pending/firing alerts. For more information, refer to the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/) or our [example PromQL expressions.](../expression) |
|
||||
| Labels | Labels to add or overwrite for each alert. |
|
||||
| Severity | When enabled, labels are attached to the alert or record that identify it by the severity level. |
|
||||
| Severity Label Value | Critical, warning, or none |
|
||||
|
||||
### Recording Rules
|
||||
|
||||
[Recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules) allow you to precompute frequently needed or computationally expensive PromQL (Prometheus expression language) expressions and save their result as a new set of time series.
|
||||
|
||||
| Field | Description |
|
||||
|-------|----------------|
|
||||
| Time Series Name | The name of the time series to output to. Must be a valid metric name. |
|
||||
| PromQL Expression | The PromQL expression to evaluate. Every evaluation cycle this is evaluated at the current time, and the result recorded as a new set of time series with the metric name as given by 'record'. For more information about expressions, refer to the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/) or our [example PromQL expressions.](../expression) |
|
||||
| Labels | Labels to add or overwrite before storing the result. |
|
||||
|
||||
{{% /tab %}}
|
||||
{{% tab "Rancher v2.5.0-v2.5.3" %}}
|
||||
For Rancher v2.5.0-v2.5.3, PrometheusRules must be configured in YAML. For examples, refer to the Prometheus documentation on [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [alerting rules.](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
|
||||
{{% /tab %}}
|
||||
{{% /tabs %}}
|
||||
Reference in New Issue
Block a user