diff --git a/content/rancher/v2.x/en/cluster-admin/tools/logging/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/logging/_index.md index b64b5e66501..60618c1b81e 100644 --- a/content/rancher/v2.x/en/cluster-admin/tools/logging/_index.md +++ b/content/rancher/v2.x/en/cluster-admin/tools/logging/_index.md @@ -38,23 +38,21 @@ Setting up a logging service to collect logs from your cluster/project has sever ## Logging Scope -You can configure logging at either [cluster level]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/logging/) or project level. +You can configure logging at either cluster level or project level. - Cluster logging writes logs for every pod in the cluster, i.e. in all the projects. For [RKE clusters]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters), it also writes logs for all the Kubernetes system components. - -- Project logging writes logs for every pod in that particular project. +- [Project logging]({{< baseurl >}}/rancher/v2.x/en/project-admin/tools/logging/) writes logs for every pod in that particular project. Logs that are sent to your logging service are from the following locations: - Pod logs stored at `/var/log/containers`. - - Kubernetes system components logs stored at `/var/lib/rancher/rke/logs/`. ## Enabling Cluster Logging As an [administrator]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/global-permissions/) or [cluster owner]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/cluster-project-roles/#cluster-roles), you can configure Rancher to send Kubernetes logs to a logging service. -1. From the **Global** view, navigate to the cluster that you want to configure cluster logging for. +1. From the **Global** view, navigate to the cluster that you want to configure cluster logging. 1. Select **Tools > Logging** in the navigation bar. diff --git a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/_index.md index c1d414147a3..660ba5f4d07 100644 --- a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/_index.md +++ b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/_index.md @@ -11,62 +11,35 @@ Using Rancher, you can monitor the state and processes of your cluster nodes, Ku In other words, Prometheus lets you view metrics from your different Rancher and Kubernetes objects. Using timestamps, Prometheus lets you query and view these metrics in easy-to-read graphs and visuals, either through the Rancher UI or [Grafana](https://grafana.com/), which is an analytics viewing platform deployed along with Prometheus. By viewing data that Prometheus scrapes from your cluster control plane, nodes, and deployments, you can stay on top of everything happening in your cluster. You can then use these analytics to better run your organization: stop system emergencies before they start, develop maintenance strategies, restore crashed servers, etc. Multi-tenancy support in terms of cluster and project-only Prometheus instances are also supported. -## In This Document - - - -- [Monitoring Scope](#monitoring-scope) - - + [Cluster Monitoring](#cluster-monitoring) - + [Project Monitoring](#project-monitoring) -- [Configuring Cluster Monitoring](#configuring-cluster-monitoring) -- [Configuring Project Monitoring](#configuring-project-monitoring) -- [Prometheus Configuration Options](#prometheus-configuration-options) - - + [Enable Node Exporter](#enable-node-exporter) - + [Persistent Storage](#persistent-storage) - + [Advanced Options](#advanced-options) -- [Viewing Metrics](#viewing-metrics) - - + [Rancher Dashboard](#rancher-dashboard) - + [Available Dashboard](#available-dashboard) - + [Grafana](#grafana) -- [Cluster Metrics](#cluster-metrics) -- [Etcd Metrics](#etcd-metrics) -- [Kubernetes Components Metrics](#kubernetes-components-metrics) -- [Rancher Logging Metrics](#rancher-logging-metrics) -- [Workload Metrics](#workload-metrics) -- [Custom Metrics](#custom-metrics) - - - ## Monitoring Scope -Using Prometheus, you can monitor Rancher at both the cluster and project level. Rancher deploys an individual Prometheus server per cluster, and an additional Prometheus server per Rancher project for multi-tenancy. +Using Prometheus, you can monitor Rancher at both the cluster level and [project level]({{< baseurl >}}/rancher/v2.x/en/project-admin/tools/monitoring/). For each cluster and project that is enabled for monitoring, Rancher deploys a Prometheus server. -[Cluster monitoring](#cluster-monitoring) allows you to view the health of a cluster's Kubernetes control plane and individual nodes. System administrators will likely be more interested in cluster monitoring, as administrators are more invested in the health of the Rancher control plane and cluster nodes. +- Cluster monitoring allows you to view the health of your Kubernetes cluster. Prometheus collects metrics from the cluster components below, which you can view in graphs and charts. -[Project monitoring]({{< baseurl >}}/rancher/v2.x/en/project-admin/tools/monitoring/) lets you view the state of pods running in a given project. Users responsible for maintaining a project will be most interested in project monitoring, as it helps them keep their applications up and running for their users. When you enable monitoring for a Rancher project, Prometheus collects metrics from its deployed HTTP and TCP/UDP workloads. + - [Kubernetes control plane]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/#kubernetes-components-metrics) + - [etcd database]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/#etcd-metrics) + - [All nodes (including workers)]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/#cluster-metrics) -For the rest of this document, everything is referred to monitoring for a cluster. +- [Project monitoring]({{< baseurl >}}/rancher/v2.x/en/project-admin/tools/monitoring/) allows you to view the state of pods running in a given project. Prometheus collects metrics from the project's deployed HTTP and TCP/UDP workloads. -## Cluster Monitoring +## Enabling Cluster Monitoring -When you enable monitoring for one of your Rancher clusters, Prometheus collects metrics from the cluster components below, which you can view in graphs and charts. We'll have more about the specific metrics collected later in this document. +As an [administrator]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/global-permissions/) or [cluster owner]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/cluster-project-roles/#cluster-roles), you can configure Rancher to deploy Prometheus to monitor your Kubernetes cluster. -- [Kubernetes control plane](#kubernetes-components-metrics) -- [etcd database](#etcd-metrics) -- [All nodes (including workers)](#cluster-metrics) +1. From the **Global** view, navigate to the cluster that you want to configure cluster monitoring. -## Configuring Monitoring +1. Select **Tools > Monitoring** in the navigation bar. -You can deploy Prometheus monitoring for a cluster, navigate to **Tools > Monitoring** as shown in the GIF below, which displays a user enabling cluster monitoring for a cluster named `local`. The only required action for deployment is to select the **Enable** option and click **Save**, but you might want to [customize configuration options](#prometheus-configuration-options) for your environment. +1. Select **Enable** to show the [Prometheus configuration options]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/prometheus/). Review the [resource consumption recommendations](#resource-consumption) to ensure you have enough resources for Prometheus and on your worker nodes to enable monitoring. Enter in your desired configuration options. -Following Prometheus deployment, two monitoring applications are added to the cluster's `system` project's **Apps** page: `cluster-monitoring` and `monitoring-operator`. You can use the `cluster-monitoring` catalog app to [access the Grafana instance](#grafana-accessing-for-clusters) for the cluster. +1. Click **Save**. + +**Result:** The Prometheus server will be deployed as well as two monitoring applications. The two monitoring applications, `cluster-monitoring` and `monitoring-operator`, are added as an [application]({{< baseurl >}}/rancher/v2.x/en/catalog/apps/) to the cluster's `system` project. After the applications are `active`, you can start viewing [cluster metrics]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/) through the [Rancher dashboard]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#rancher-dashboard) or directly from [Grafana]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#grafana). ### Resource Consumption -When enabling cluster level monitoring, you will need to ensure your worker nodes and Prometheus pod have enough resources. The tables below provides a guide of how much resource consumption will be used. +When enabling cluster monitoring, you need to ensure your worker nodes and Prometheus pod have enough resources. The tables below provides a guide of how much resource consumption will be used. #### Prometheus Pod Resource Consumption @@ -88,207 +61,3 @@ Node Exporter (Per Node) | 100 | 30 Kube State Cluster Monitor | 100 | 130 Grafana | 100 | 150 Prometheus Cluster Monitoring Nginx | 50 | 50 - -## Prometheus Configuration Options - -While configuring monitoring at either the cluster or project level, you can choose options to customize your monitoring settings. - -Option | Description --------|------------- -Data Retention | Configures how long your Prometheus instance retains monitoring data scraped from Rancher objects before it's purged. -Enable Node Exporter | Configures using [Node Exporter](https://github.com/prometheus/node_exporter/blob/master/README.md) or not, please take a look at the [notes](#enable-node-exporter). -Node Exporter Host Port | Configures the host port on which [Node Exporter](https://github.com/prometheus/node_exporter/blob/master/README.md) data is exposed (i.e., data that Prometheus collects from your node hardware), if enabling Node Exporter. -Enable Persistent Storage for Prometheus | Lets you configure storage for Prometheus so that you can retain your metric data if your Prometheus pod fails. See [Persistent Storage](#persistent-storage). -Enable Persistent Storage for Grafana | Lets you configure storage so that you can retain your dashboards and configuration if your Grafana pod fails. See [Persistent Storage](#persistent-storage). -Prometheus CPU Limit | Configures [the CPU resource limits](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) of the Promehtues pod. -Prometheus CPU Reservation | Configures [the CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) of the Promehtues pod. -Prometheus Memory Limit | Configures [the Memory resource limits](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) of the Promehtues pod. -Prometheus Memory Reservation | Configures [the Memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) of the Promehtues pod. -Add Selector | If you want to deploy the Prometheus/Grafana pods to a specific node when enable monitoring, add selectors to the pods so that they're deployed to your selected node(s). To use this option, you must first apply labels to your nodes. - -### Enable Node Exporter - -Node Exporter is a popular open source exporter which can expose the metrics for hardware and \*NIX kernels OS, it is designed to monitor the host system. However, there are still issues with namespaces when running it in a container, mostly around filesystem mount spaces. So if we need to monitor the actual network stats for the container network, we must deploy it with `hostNetwork` mode. - -Firstly, you need to consider which host port should expose to avoid port conflicts and fill into `Node Exporter Host Port` field. Secondly, you must open that port to allow the internal traffic from `Prometheus`. - -### Persistent Storage - ->**Prerequisite:** Configure one or more [storage class]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/#adding-storage-classes) to use as [persistent storage]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/) for your Prometheus/Grafana instance. - -By default, when you enable Prometheus for either a cluster or project, all monitoring data that Prometheus collects is stored on its own pod. This local storage means that if your Prometheus/Grafana pods fail, you'll lose all your monitoring data. Therefore, we recommend configuring persistent storage external to your cluster. This way, if your Prometheus/Grafana pods fail, the new pods that replace them can recover using your persistent storage. - -You can configure persistent storage for Prometheus and/or Grafana by using the radio buttons available when completing either [Configuring Cluster Monitoring](#configuring-cluster-monitoring) or [Configuring Project Monitoring](#configuring-project-monitoring). After enabling persistent storage, you'll then need to specify a [storage class]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/#storage-classes) that's used to provision a [persistent volume]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/#persistent-volumes), along with the size of the volume that's being provisioned. - -### Advanced Options - ->**Warning:** Monitoring app is [a specially designed app](https://github.com/rancher/system-charts/tree/dev/charts/rancher-monitoring). Any modification without familiarizing the entire app can lead to catastrophic errors. - -Monitoring is driven by [Rancher catalog application]({{< baseurl >}}/rancher/v2.x/en/catalog/apps/), so you can expand all options by clicking the **Show advanced options** and then configure it as you would configure any other app. - -## Viewing Metrics - -After you've deployed Prometheus to a cluster or project, you can view that data in one of two places: - -- [Rancher Dashboard](#cluster-dashboard) -- [Grafana](#grafana) - -### Rancher Dashboard - -After enabling cluster monitoring to one of your clusters, you can view the data it collects from the Rancher Dashboard. - ->**Note:** The Rancher Dashboard only displays Prometheus analytics for the cluster, not individual projects. If you want to view analytics for a project, you must [access the project's Grafana instance](#grafana-accessing-for-projects). - -#### Rancher Dashboard Use - -Prometheus metrics are displayed below the main dashboard display, and are denoted with the Grafana icon as displayed below. - ->**Tip:** Click the icon to open the metrics in [Grafana](#grafana). - -In each Prometheus metrics widget, you can toggle between a **Detail** view, which displays graphs and charts that let you view each event in a Prometheus time series, or a **Summary** view, which only lists events in a Prometheus time series out of the norm. - -You can also change the range of the time series that you're viewing to see a more refined or expansive data sample. - -Finally, you can customize the data sample to display data between chosen dates and times. - -### Available Dashboard - -After deploying Prometheus to a cluster, you can view the metrics from its Dashboard. - -When analyzing metrics, don't be concerned about any single standalone metric in the charts and graphs. Rather, you should establish a baseline for your metrics over the course of time (i.e., the range of values that your components usually operate within and are considered normal). After you establish this baseline, be on the lookout for large deltas in the charts and graphs, as these big changes usually indicate a problem that you need to investigate. - -### Grafana - -Your other option for viewing cluster data is Grafana, which is a leading open source platform for analytics and monitoring. - -Grafana allows you to query, visualize, alert, and ultimately, understand your cluster and workload data. - -For more information on Grafana and its capabilities, visit the [Grafana website](https://grafana.com/grafana). - -#### Accessing Grafana - -When enable monitoring, Rancher automatically creates a link to Grafana instance. Use this link to view monitoring data for the cluster or project. - -##### Grafana and Authentication - -When you deploy Prometheus to a cluster or project, Rancher automatically creates a Grafana instance for the object. Rancher determines which users can access the new Grafana instance, as well as the objects they can view within it, by validating them against cluster or project membership. Users that hold membership for the object will be able to access its Grafana instance. In other words, users' access in Grafana mirrors their access in Rancher. - -##### Grafana: Accessing for Clusters - -To access an instance of Grafana displays monitoring analytics for a cluster, browse to the cluster's `system` project and open **Apps**. From the `cluster-monitoring` catalog app, click the `/index.html` link. To view data for your cluster navigate to the cluster's _system_ project. - -##### Grafana: Accessing for Projects - -To access an instance of Grafana that's monitoring a project, browse to the applicable cluster and project. Then open **Apps**. From the `project-monitoring` catalog app, click the `/index.html` link. - -#### Manage Grafana - -To manage your cluster or project Grafana, you can sign into it by using `admin/admin`. For security, you should change the default password after first login. - -The preset Grafana dashboards are imported via [Grafana provisioning mechanism](http://docs.grafana.org/administration/provisioning/#dashboards), so you cannot modify them directly. A workaround, for now, is to clone the original and then modify the new copy. - -## Cluster Metrics - -These metrics display the hardware utilization for all nodes in your cluster, regardless of its Kubernetes Role. They give you a global monitoring insight into the cluster. - -Some of the biggest metrics to look out for: - -- **CPU Utilization** - - High load either indicates that your cluster is running efficiently (😄) or that you're running out of CPU resources (😞). - -- **Disk Utilization** - - Be on the lookout for increased read and write rates on nodes nearing their disk capacity. This advice is especially true for etcd nodes, as running out of storage on an etcd node leads to cluster failure. - -- **Memory Utilization** - - Deltas in memory utilization usually indicate a memory leak. - -- **Load Average** - - Generally, you want your load average to match your number of logical CPUs for the cluster. For example, if your cluster has 8 logical CPUs, the ideal load average would be 8 as well. If you load average is well under the number of logical CPUs for the cluster, you may want to reduce cluster resources. On the other hand, if your average is over 8, your cluster may need more resources. - -To view the data for one node, browse into the **Nodes** and go into a node view to look for the **Node Metrics**. - -[_Get expressions for Cluster Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#cluster-metrics) - -## Etcd Metrics - ->**Note:** Supported in [the cluster launched by Rancher]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters). - -These metrics display the operations of the etcd database on each of your cluster nodes. After establishing a baseline of normal etcd operational metrics, observe them for abnormal deltas between metric refreshes, which indicate potential issues with etcd. Always address etcd issues immediately! - -You should also pay attention to the text at the top of the etcd metrics, which displays leadership statistics. This text indicates if etcd currently has a leader, which is the etcd instance that coordinates the other etcd instances in your cluster. A large increase in leader changes implies etcd is unstable. If you notice a change in leadership statistics, you should investigate them for issues. - -Some of the biggest metrics to look out for: - -- **Etcd has a leader** - - etcd is usually deployed on multiple nodes and elects a leader to coordinate its operations. If etcd does not have a leader, its operations are not being coordinated. - -- **Number of leader changes** - - If this statistic suddenly grows, it usually indicates network communication issues that constantly force the cluster to elect a new leader. - -[_Get expressions for Etcd Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#etcd-metrics) - -## Kubernetes Components Metrics - -These metrics display data about the cluster's individual Kubernetes components. Primarily, it displays information about connections and latency for each component: the API server, controller manager, scheduler, and ingress controller. - ->**Note:** The metrics for the controller manager, scheduler and ingress controller are only supported in [the cluster launched by Rancher]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters). - -When analyzing Kubernetes component metrics, don't be concerned about any single standalone metric in the charts and graphs that display. Rather, you should establish a baseline for metrics considered normal following a period of observation (i.e., the range of values that your components usually operate within and are considered normal). After you establish this baseline, be on the lookout for large deltas in the charts and graphs, as these big changes usually indicate a problem that you need to investigate. - -Some of the more important component metrics to monitor are: - -- **API Server Request Latency** - - Increasing API response times indicate there's a generalized problem that requires investigation. - -- **API Server Request Rate** - - Rising API request rates usually coincide with increased API response times. Increased request rates also indicate a generalized problem requiring investigation. - -- **Scheduler Preemption Attempts** - - If you see a spike in scheduler preemptions, it's an indication that you're running out of hardware resources, as Kubernetes is recognizing it doesn't have enough resources to run all your pods and is prioritizing the more important ones. - -- **Scheduling Failed Pods** - - Failed pods can have a variety of causes, such as unbound persistent volume claims, exhausted hardware resources, non-responsive nodes, etc. - -- **Ingress Controller Request Process Time** - - How fast ingress is routing connections to your cluster services. - -[_Get expressions for Kubernetes Component Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#kubernetes-component-metrics) - -## Rancher Logging Metrics - -Although the Dashboard for a cluster primary displays data sourced from Prometheus, it also displays information for cluster logging, provided that you have configured Rancher to use a logging service. - -For more information about enabling logging for a cluster, see [logging]({{< baseurl >}}/rancher/v2.x/en/tools/logging). - -[_Get expressions for Rancher Logging Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#rancher-logging-metrics) - -## Workload Metrics - ->**Note:** Supported by [enabling cluster monitoring](#configuring-cluster-monitoring). - -These metrics display the hardware utilization for a Kubernetes workload. You can also view metrics for [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) and so on. - -To view the pod metrics, navigate into the pod view and click on **Pod Metrics**. You can also view the container metrics by navigating to **Container Metrics** option - -[_Get expressions for Workload Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#workload-metrics) - -## Custom Metrics - ->**Note:** Supported by [enabling project monitoring](#configuring-project-monitoring). - -If you want to scrape the metrics from any [exporters](https://prometheus.io/docs/instrumenting/exporters/), you only need to set up some exposing endpoints on deploying but without configuring the project Prometheus directly. - -Imagine that you have deployed a [Redis](https://redis.io/) app/cluster in the namespace `redis-app` of the project `Datacenter`, and you are going to monitor it via [Redis exporter](https://github.com/oliver006/redis_exporter). By enabling project monitoring, you only need to configure **Custom Metrics** under **Advanced Options** as shown in the GIF below, and set the correct `Container Port`, `Path` and `Protocol`. - -![AddCustomMetrics]({{< baseurl >}}/img/rancher/add-custom-metrics.gif) diff --git a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/_index.md new file mode 100644 index 00000000000..9ccce603661 --- /dev/null +++ b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/_index.md @@ -0,0 +1,113 @@ +--- +title: Cluster Metrics +weight: 3 +--- + +_Available as of v2.2.0_ + +Cluster metrics display the hardware utilization for all nodes in your cluster, regardless of its role. They give you a global monitoring insight into the cluster. + +Some of the biggest metrics to look out for: + +- **CPU Utilization** + + High load either indicates that your cluster is running efficiently or that you're running out of CPU resources. + +- **Disk Utilization** + + Be on the lookout for increased read and write rates on nodes nearing their disk capacity. This advice is especially true for etcd nodes, as running out of storage on an etcd node leads to cluster failure. + +- **Memory Utilization** + + Deltas in memory utilization usually indicate a memory leak. + +- **Load Average** + + Generally, you want your load average to match your number of logical CPUs for the cluster. For example, if your cluster has 8 logical CPUs, the ideal load average would be 8 as well. If you load average is well under the number of logical CPUs for the cluster, you may want to reduce cluster resources. On the other hand, if your average is over 8, your cluster may need more resources. + +## Finding Node Metrics + +1. From the **Global** view, navigate to the cluster that you want to view metrics. + +1. Select **Nodes** in the navigation bar. + +1. Select a specific node and click on its name. + +1. Click on **Node Metrics**. + +[_Get expressions for Cluster Metrics_]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/monitoring/expression/#cluster-metrics) + +### Etcd Metrics + +>**Note:** Only supported for [Rancher launched Kubernetes clusters]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/). + +Etcd metrics display the operations of the etcd database on each of your cluster nodes. After establishing a baseline of normal etcd operational metrics, observe them for abnormal deltas between metric refreshes, which indicate potential issues with etcd. Always address etcd issues immediately! + +You should also pay attention to the text at the top of the etcd metrics, which displays leader election statistics. This text indicates if etcd currently has a leader, which is the etcd instance that coordinates the other etcd instances in your cluster. A large increase in leader changes implies etcd is unstable. If you notice a change in leader election statistics, you should investigate them for issues. + +Some of the biggest metrics to look out for: + +- **Etcd has a leader** + + etcd is usually deployed on multiple nodes and elects a leader to coordinate its operations. If etcd does not have a leader, its operations are not being coordinated. + +- **Number of leader changes** + + If this statistic suddenly grows, it usually indicates network communication issues that constantly force the cluster to elect a new leader. + +[_Get expressions for Etcd Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#etcd-metrics) + +### Kubernetes Components Metrics + +Kubernetes components metrics display data about the cluster's individual Kubernetes components. Primarily, it displays information about connections and latency for each component: the API server, controller manager, scheduler, and ingress controller. + +>**Note:** The metrics for the controller manager, scheduler and ingress controller are only supported for [Rancher launched Kubernetes clusters]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/). + +When analyzing Kubernetes component metrics, don't be concerned about any single standalone metric in the charts and graphs that display. Rather, you should establish a baseline for metrics considered normal following a period of observation, e.g. the range of values that your components usually operate within and are considered normal. After you establish this baseline, be on the lookout for large deltas in the charts and graphs, as these big changes usually indicate a problem that you need to investigate. + +Some of the more important component metrics to monitor are: + +- **API Server Request Latency** + + Increasing API response times indicate there's a generalized problem that requires investigation. + +- **API Server Request Rate** + + Rising API request rates usually coincide with increased API response times. Increased request rates also indicate a generalized problem requiring investigation. + +- **Scheduler Preemption Attempts** + + If you see a spike in scheduler preemptions, it's an indication that you're running out of hardware resources, as Kubernetes is recognizing it doesn't have enough resources to run all your pods and is prioritizing the more important ones. + +- **Scheduling Failed Pods** + + Failed pods can have a variety of causes, such as unbound persistent volume claims, exhausted hardware resources, non-responsive nodes, etc. + +- **Ingress Controller Request Process Time** + + How fast ingress is routing connections to your cluster services. + +[_Get expressions for Kubernetes Component Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#kubernetes-component-metrics) + +## Rancher Logging Metrics + +Although the Dashboard for a cluster primarly displays data sourced from Prometheus, it also displays information for cluster logging, provided that you have [configured Rancher to use a logging service]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/logging/). + +[_Get expressions for Rancher Logging Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#rancher-logging-metrics) + +## Finding Workload Metrics + +Workload metrics display the hardware utilization for a Kubernetes workload. You can also view metrics for [deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [stateful sets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) and so on. + +1. From the **Global** view, navigate to the project that you want to view workload metrics. + +1. Select **Workloads > Workloads** in the navigation bar. + +1. Select a specific workload and click on its name. + +1. In the **Pods** section, select a specific pod and click on its name. + + - **View the Pod Metrics:** Click on **Pod Metrics**. + - **View the Container Metrics:** In the **Containers** section, select a specific container and click on its name. Click on **Container Metrics**. + +[_Get expressions for Workload Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#workload-metrics) diff --git a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md index 1cccdf62887..4b52c207fae 100644 --- a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md +++ b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/expression/_index.md @@ -1,6 +1,6 @@ --- title: Expression -weight: 10000 +weight: 4 --- ## In This Document @@ -8,13 +8,11 @@ weight: 10000 - [Cluster Metrics](#cluster-metrics) - + [Node Metrics](#node-metrics) - [Etcd Metrics](#etcd-metrics) - [Kubernetes Components Metrics](#kubernetes-components-metrics) - [Rancher Logging Metrics](#rancher-logging-metrics) - [Workload Metrics](#workload-metrics) - + [Pod Metrics](#pod-metrics) + [Container Metrics](#container-metrics) @@ -37,7 +35,7 @@ weight: 10000 | Summary |
load1`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load5`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load15`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"})`
| - **Memory Utilization** - + | Catalog | Expression | | --- | --- | | Detail | `1 - sum(node_memory_MemAvailable_bytes) by (instance) / sum(node_memory_MemTotal_bytes) by (instance)` | @@ -49,7 +47,7 @@ weight: 10000 | --- | --- | | Detail | `(sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) - sum(node_filesystem_free_bytes{device!="rootfs"}) by (instance)) / sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance)` | | Summary | `(sum(node_filesystem_size_bytes{device!="rootfs"}) - sum(node_filesystem_free_bytes{device!="rootfs"})) / sum(node_filesystem_size_bytes{device!="rootfs"})` | - + - **Disk I/O** | Catalog | Expression | @@ -65,7 +63,7 @@ weight: 10000 | Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
| - **Network I/O** - + | Catalog | Expression | | --- | --- | | Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
| @@ -88,7 +86,7 @@ weight: 10000 | Summary |
load1`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load5`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load15`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
| - **Memory Utilization** - + | Catalog | Expression | | --- | --- | | Detail | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})` | @@ -100,7 +98,7 @@ weight: 10000 | --- | --- | | Detail | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"}) by (device)) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device)` | | Summary | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"})) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"})` | - + - **Disk I/O** | Catalog | Expression | @@ -116,7 +114,7 @@ weight: 10000 | Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
| - **Network I/O** - + | Catalog | Expression | | --- | --- | | Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
| @@ -151,7 +149,7 @@ weight: 10000 | Summary |
in`sum(rate(etcd_network_peer_received_bytes_total[5m]))`
out`sum(rate(etcd_network_peer_sent_bytes_total[5m]))`
| - **DB Size** - + | Catalog | Expression | | --- | --- | | Detail | `sum(etcd_debugging_mvcc_db_total_size_in_bytes) by (instance)` | @@ -230,7 +228,7 @@ weight: 10000 | Summary | `sum(histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06)` | - **Scheduler Preemption Attempts** - + | Catalog | Expression | | --- | --- | | Detail | `sum(rate(scheduler_total_preemption_attempts[5m])) by (instance)` | @@ -274,7 +272,7 @@ weight: 10000 | Summary | `sum(rate(fluentd_output_status_num_errors[5m]))` | - **Fluentd Output Rate** - + | Catalog | Expression | | --- | --- | | Detail | `sum(rate(fluentd_output_status_num_records_total[5m])) by (instance)` | @@ -297,14 +295,14 @@ weight: 10000 | Summary | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""})` | - **Network Packets** - + | Catalog | Expression | | --- | --- | | Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| | Summary |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - **Network I/O** - + | Catalog | Expression | | --- | --- | | Detail |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| @@ -334,7 +332,7 @@ weight: 10000 | Summary | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""})` | - **Network Packets** - + | Catalog | Expression | | --- | --- | | Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| @@ -370,10 +368,8 @@ weight: 10000 `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name="$podName",container_name="$containerName"})` - **Disk IO** - + | Catalog | Expression | | --- | --- | | read | `sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | | write | `sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - - diff --git a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/prometheus/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/prometheus/_index.md new file mode 100644 index 00000000000..bdfe0c32a85 --- /dev/null +++ b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/prometheus/_index.md @@ -0,0 +1,37 @@ +--- +title: Prometheus Configuration +weight: 1 +--- + +_Available as of v2.2.0_ + + +While configuring monitoring at either the [cluster level]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#enabling-cluster-monitoring) or [project level]({{< baseurl >}}/rancher/v2.x/en/project-admin/tools/monitoring/#enabling-project-monitoring), there are multiple options that can be configured. + +Option | Description +-------|------------- +Data Retention | How long your Prometheus instance retains monitoring data scraped from Rancher objects before it's purged. +[Enable Node Exporter](#node-exporter) | Whether or not to deploy the node exporter. +Node Exporter Host Port | The host port on which data is exposed, i.e. data that Prometheus collects from your node hardware. Required if you have enabled the node exporter. +[Enable Persistent Storage](#persistent-storage) for Prometheus | Whether or not to configure storage for Prometheus so that metrics can be retained even if the Prometheus pod fails. +[Enable Persistent Storage](#persistent-storage) for Grafana | Whether or not to configure storage for Grafana so that the Grafana dashboards and configuration can be retained even if the Grafana pod fails. +Prometheus [CPU Limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) | CPU resource limit for the Prometheus pod. +Prometheus [CPU Reservation](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) | CPU reservation for the Promehtues pod. +Prometheus [Memory Limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) | Memory resource limit for the Prometheus pod. +Prometheus [Memory Reservation](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) | Memory resource requests for the Prometheus pod. +Selector | Ability to select the nodes in which Prometheus and Grafana pods are deployed to. To use this option, the nodes must have labels. +Advanced Options | Since monitoring is an [application](https://github.com/rancher/system-charts/tree/dev/charts/rancher-monitoring) from the [Rancher catalog]({{< baseurl >}}/rancher/v2.x/en/catalog/), it can be [configured like other catalog application]({{< baseurl >}}/rancher/v2.x/en/catalog/apps/#configuration-options). _Warning: Any modification to the application without understanding the entire application can lead to catastrophic errors._ + +## Node Exporter + +The [node exporter](https://github.com/prometheus/node_exporter/blob/master/README.md) is a popular open source exporter, which exposes the metrics for hardware and \*NIX kernels OS. It is designed to monitor the host system. However, there are still issues with namespaces when running it in a container, mostly around filesystem mount spaces. In order to monitor actual network metrics for the container network, the node exporter must be deployed with the `hostNetwork` mode. + +When configuring Prometheus and enabling the node exporter, enter a host port in the **Node Exporter Host Port** that will not produce port conflicts with existing applications. The host port chosen must be open to allow internal traffic between Prometheus and the Node Exporter. + +## Persistent Storage + +>**Prerequisite:** Configure one or more [storage class]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/volumes-and-storage/#adding-storage-classes) to use as [persistent storage]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/volumes-and-storage/) for your Prometheus or Grafana pod. + +By default, when you enable Prometheus for either a cluster or project, all monitoring data that Prometheus collects is stored on its own pod. With local storage, if the Prometheus or Grafana pods fail, all the data is lost. Rancher recommends configuring an external persistent storage to the cluster. With the external persistent storage, if the Prometheus or Grafana pods fail, the new pods can recover using data from the persistent storage. + +When enabling persistent storage for Prometheus or Grafana, specify the size of the persistent volume and select the [storage class]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/volumes-and-storage/#storage-classes). diff --git a/content/rancher/v2.x/en/cluster-admin/tools/monitoring/viewing-metrics/_index.md b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/viewing-metrics/_index.md new file mode 100644 index 00000000000..74cd5a28a31 --- /dev/null +++ b/content/rancher/v2.x/en/cluster-admin/tools/monitoring/viewing-metrics/_index.md @@ -0,0 +1,59 @@ +--- +title: Viewing Metrics +weight: 2 +--- + +_Available as of v2.2.0_ + +After you've enabled monitoring at either the [cluster level]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#enabling-cluster-monitoring) or [project level]({{< baseurl >}}/rancher/v2.x/en/project-admin/tools/monitoring/#enabling-project-monitoring), you will want to be start viewing the data being collected. There are multiple ways to view this data. + +## Rancher Dashboard + +>**Note:** This is only available if you've enabled monitoring at the [cluster level]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#enabling-cluster-monitoring). Project specific analytics must be viewed using the project's Grafana instance. + +Rancher's dashboards are available at multiple locations: + +- **Cluster Dashboard**: From the **Global** view, navigate to the cluster. +- **`System` Project Dashboard**: From the **Global** view, navigate to the **System** project of the cluster. +- **Node Metrics**: From the **Global** view, navigate to the cluster. Select **Nodes**. Find the individual node and click on its name. Click **Node Metrics.** +- **Pod Metrics**: From the **Global** view, navigate to the project. Select **Workloads > Workloads**. Find the individual workload and click on its name. Find the individual pod and click on its name. Click **Pod Metrics.** +- **Container Metrics**: From the **Global** view, navigate to the project. Select **Workloads > Workloads**. Find the individual workload and click on its name. Find the individual pod and click on its name. Find the individual container and click on its name. Click **Container Metrics.** + +Prometheus metrics are displayed and are denoted with the Grafana icon. If you click on the icon, the metrics will open a new tab in Grafana. + +Within each Prometheus metrics widget, there are several ways to customize your view. + +- Toggle between two views: + - **Detail**: Displays graphs and charts that let you view each event in a Prometheus time series + - **Summary** Displays events in a Prometheus time series that are outside the norm. +- Change the range of the time series that you're viewing to see a more refined or expansive data sample. +- Customize the data sample to display data between specific dates and times. + +When analyzing these metrics, don't be concerned about any single standalone metric in the charts and graphs. Rather, you should establish a baseline for your metrics over the course of time, e.g. the range of values that your components usually operate within and are considered normal. After you establish the baseline, be on the lookout for any large deltas in the charts and graphs, as these big changes usually indicate a problem that you need to investigate. + +## Grafana + +If you've enabled monitoring at either the [cluster level]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#enabling-cluster-monitoring) or [project level]({{< baseurl >}}/rancher/v2.x/en/project-admin/tools/monitoring/#enabling-project-monitoring), Rancher automatically creates a link to Grafana instance. Use this link to view monitoring data. + +Grafana allows you to query, visualize, alert, and ultimately, understand your cluster and workload data. For more information on Grafana and its capabilities, visit the [Grafana website](https://grafana.com/grafana). + +### Authentication + +Rancher determines which users can access the new Grafana instance, as well as the objects they can view within it, by validating them against the user's [cluster or project roles]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/cluster-project-roles/). In other words, a user's access in Grafana mirrors their access in Rancher. + +### Accessing Grafana from the Grafana Instance + +1. From the **Global** view, navigate to the cluster that you want to access Grafana. + +1. From the main navigation bar, choose **Apps**. In versions prior to v2.2.0, choose **Catalog Apps** on the main navigation bar. + +1. Find the application based on what level of metrics you want to view: + + - **Cluster Level**: Find the `cluster-monitoring` application. + - **Project Level**: Find the `project-monitoring` application. + +1. Click the `/index.html` link. You will be redirected to a new webpage for Grafana, which shows metrics for either the cluster or project depending on which application you selected. + +1. Sign in to Grafana. The default username is `admin` and the default password is `admin`. For security, Rancher recommends changing the default password after logging in. + +**Results:** You will be logged into Grafana from the Grafana instance. After logging in, you can view the preset Grafana dashboards, which are imported via the [Grafana provisioning mechanism](http://docs.grafana.org/administration/provisioning/#dashboards), so you cannot modify them directly. For now, if you want to configure your own dashboards, clone the original and modify the new copy. diff --git a/content/rancher/v2.x/en/project-admin/tools/logging/_index.md b/content/rancher/v2.x/en/project-admin/tools/logging/_index.md index 224d3a8c5f7..c156b81c4b5 100644 --- a/content/rancher/v2.x/en/project-admin/tools/logging/_index.md +++ b/content/rancher/v2.x/en/project-admin/tools/logging/_index.md @@ -35,9 +35,9 @@ Setting up a logging service to collect logs from your cluster/project has sever ## Logging Scope -You can configure logging at either [cluster level]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/logging/) or project level. +You can configure logging at either cluster level or project level. -- Cluster logging writes logs for every pod in the cluster, i.e. in all the projects. For [RKE clusters]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters), it also writes logs for all the Kubernetes system components. +- [Cluster logging]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/logging/) writes logs for every pod in the cluster, i.e. in all the projects. For [RKE clusters]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters), it also writes logs for all the Kubernetes system components. - Project logging writes logs for every pod in that particular project. @@ -51,7 +51,7 @@ Logs that are sent to your logging service are from the following locations: As an [administrator]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/global-permissions/), [cluster owner or member]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/cluster-project-roles/#cluster-roles) or [project owner]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/cluster-project-roles/#project-roles), you can configure Rancher to send Kubernetes logs to a logging service. -1. From the **Global** view, navigate to the project that you want to configure project logging for. +1. From the **Global** view, navigate to the project that you want to configure project logging. 1. Select **Tools > Logging** in the navigation bar. In versions prior to v2.2.0, you can choose **Resources > Logging**. diff --git a/content/rancher/v2.x/en/project-admin/tools/monitoring/_index.md b/content/rancher/v2.x/en/project-admin/tools/monitoring/_index.md index 67c80cdb818..f7b50898c3b 100644 --- a/content/rancher/v2.x/en/project-admin/tools/monitoring/_index.md +++ b/content/rancher/v2.x/en/project-admin/tools/monitoring/_index.md @@ -11,298 +11,40 @@ Using Rancher, you can monitor the state and processes of your cluster nodes, Ku In other words, Prometheus lets you view metrics from your different Rancher and Kubernetes objects. Using timestamps, Prometheus lets you query and view these metrics in easy-to-read graphs and visuals, either through the Rancher UI or [Grafana](https://grafana.com/), which is an analytics viewing platform deployed along with Prometheus. By viewing data that Prometheus scrapes from your cluster control plane, nodes, and deployments, you can stay on top of everything happening in your cluster. You can then use these analytics to better run your organization: stop system emergencies before they start, develop maintenance strategies, restore crashed servers, etc. Multi-tenancy support in terms of cluster and project-only Prometheus instances are also supported. -## In This Document - - - -- [Monitoring Scope](#monitoring-scope) - - + [Cluster Monitoring](#cluster-monitoring) - + [Project Monitoring](#project-monitoring) -- [Configuring Cluster Monitoring](#configuring-cluster-monitoring) -- [Configuring Project Monitoring](#configuring-project-monitoring) -- [Prometheus Configuration Options](#prometheus-configuration-options) - - + [Enable Node Exporter](#enable-node-exporter) - + [Persistent Storage](#persistent-storage) - + [Advanced Options](#advanced-options) -- [Viewing Metrics](#viewing-metrics) - - + [Rancher Dashboard](#rancher-dashboard) - + [Available Dashboard](#available-dashboard) - + [Grafana](#grafana) -- [Cluster Metrics](#cluster-metrics) -- [Etcd Metrics](#etcd-metrics) -- [Kubernetes Components Metrics](#kubernetes-components-metrics) -- [Rancher Logging Metrics](#rancher-logging-metrics) -- [Workload Metrics](#workload-metrics) -- [Custom Metrics](#custom-metrics) - - - ## Monitoring Scope -Using Prometheus, you can monitor Rancher at both the cluster and project level. Rancher deploys an individual Prometheus server per cluster, and an additional Prometheus server per Rancher project for multi-tenancy. +Using Prometheus, you can monitor Rancher at both the [cluster level]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/) and project level. For each cluster and project that is enabled for monitoring, Rancher deploys a Prometheus server. -[Cluster monitoring](#cluster-monitoring) allows you to view the health of a cluster's Kubernetes control plane and individual nodes. System administrators will likely be more interested in cluster monitoring, as administrators are more invested in the health of the Rancher control plane and cluster nodes. +- [Cluster monitoring]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/) allows you to view the health of your Kubernetes cluster. Prometheus collects metrics from the cluster components below, which you can view in graphs and charts. -[Project monitoring](#project-monitoring) lets you view the state of pods running in a given project. Users responsible for maintaining a project will be most interested in project monitoring, as it helps them keep their applications up and running for their users. + - [Kubernetes control plane]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/#kubernetes-components-metrics) + - [etcd database]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/#etcd-metrics) + - [All nodes (including workers)]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/#cluster-metrics) -### Cluster Monitoring - -When you enable monitoring for one of your Rancher clusters, Prometheus collects metrics from the cluster components below, which you can view in graphs and charts. We'll have more about the specific metrics collected later in this document. - -- [Kubernetes control plane](#kubernetes-components-metrics) -- [etcd database](#etcd-metrics) -- [All nodes (including workers)](#cluster-metrics) - -### Project Monitoring - -When you enable monitoring for a Rancher project, Prometheus collects metrics from its deployed HTTP and TCP/UDP workloads. We'll have more about the specific metrics collected [later in this document](#custom-metrics). - -## Configuring Cluster Monitoring - -You can deploy Prometheus monitoring for a cluster, navigate to **Tools > Monitoring** as shown in the GIF below, which displays a user enabling cluster monitoring for a cluster named `local`. The only required action for deployment is to select the **Enable** option and click **Save**, but you might want to [customize configuration options](#prometheus-configuration-options) for your environment. - -![EnableClusterMonitoring]({{< baseurl >}}/img/rancher/enable-cluster-monitoring.gif) - -Following Prometheus deployment, two monitoring applications are added to the cluster's `system` project's **Apps** page: `cluster-monitoring` and `monitoring-operator`. You can use the `cluster-monitoring` catalog app to [access the Grafana instance](#grafana-accessing-for-clusters) for the cluster. - -### Resource Consumption - -When enabling cluster level monitoring, you will need to ensure your worker nodes and Prometheus pod have enough resources. The tables below provides a guide of how much resource consumption will be used. - -#### Prometheus Pod Resource Consumption - -This table is the resource consumption of the Prometheus pod, which is based on the number of all the nodes in the cluster. The count of nodes includes the worker, control plane and etcd nodes. Total disk space allocation should be approximated by the `rate * retention` period set at the cluster level. When enabling cluster level monitoring, you should adjust the CPU and Memory limits and reservation. - -Number of Cluster Nodes | CPU (milli CPU) | Memory | Disk -------------------------|-----|--------|------ -5 | 500 | 650 MB | ~1 GB/Day -50| 2000 | 2 GB | ~5 GB/Day -256| 4000 | 6 GB | ~18 GB/Day - -#### Other Pods Resource Consumption - -Besides the Prometheus pod, there are components that are deployed that require additional resources on the worker nodes. - -Pod | CPU (milli CPU) | Memory (MB) -----|-----------------|------------ -Node Exporter (Per Node) | 100 | 30 -Kube State Cluster Monitor | 100 | 130 -Grafana | 100 | 150 -Prometheus Cluster Monitoring Nginx | 50 | 50 +- Project monitoring allows you to view the state of pods running in a given project. Prometheus collects metrics from the project's deployed HTTP and TCP/UDP workloads. ## Configuring Project Monitoring -You can enable project monitoring by opening the project and then selecting **Tools > Monitoring** as shown in the GIF below, which displays enabling the `default` project monitoring. +As an [administrator]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/global-permissions/), [cluster owner or member]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/cluster-project-roles/#cluster-roles) or [project owner]({{< baseurl >}}/rancher/v2.x/en/admin-settings/rbac/cluster-project-roles/#project-roles), you can configure Rancher to deploy Prometheus to monitor your Kubernetes cluster. -![EnableProjectMonitoring]({{< baseurl >}}/img/rancher/enable-project-monitoring.gif) +1. From the **Global** view, navigate to the project that you want to configure project monitoring. -After you enable project monitoring, a single application is added to the project's **Apps** page: `project-monitoring`. Use this catalog app to [access the Grafana instance](#grafana-accessing-for-projects) for the project. +1. Select **Tools > Monitoring** in the navigation bar. -With enabling cluster monitoring, you can collect the [Workload metrics](#workload-metrics) for this project, otherwise, you can only collect the [Custom metrics](#custom-metrics) from this project. +1. Select **Enable** to show the [Prometheus configuration options]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/prometheus/). Enter in your desired configuration options. -## Prometheus Configuration Options +1. Click **Save**. -While configuring monitoring at either the cluster or project level, you can choose options to customize your monitoring settings. You can enable the options below while completing either [Configuring Cluster Monitoring](#configuring-cluster-monitoring) or [Configuring Project Monitoring](#configuring-project-monitoring). +**Result:** A single application,`project-monitoring`, is added as an [application]({{< baseurl >}}/rancher/v2.x/en/catalog/apps/) to the cluster's `system` project. After the application is `active`, you can start viewing [project metrics](#project-metrics) through the [Rancher dashboard]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#rancher-dashboard) or directly from [Grafana]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/#grafana). -Option | Description --------|------------- -Data Retention | Configures how long your Prometheus instance retains monitoring data scraped from Rancher objects before it's purged. -Enable Node Exporter | Configures using [Node Exporter](https://github.com/prometheus/node_exporter/blob/master/README.md) or not, please take a look at the [notes](#enable-node-exporter). -Node Exporter Host Port | Configures the host port on which [Node Exporter](https://github.com/prometheus/node_exporter/blob/master/README.md) data is exposed (i.e., data that Prometheus collects from your node hardware), if enabling Node Exporter. -Enable Persistent Storage for Prometheus | Lets you configure storage for Prometheus so that you can retain your metric data if your Prometheus pod fails. See [Persistent Storage](#persistent-storage). -Enable Persistent Storage for Grafana | Lets you configure storage so that you can retain your dashboards and configuration if your Grafana pod fails. See [Persistent Storage](#persistent-storage). -Prometheus CPU Limit | Configures [the CPU resource limits](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) of the Promehtues pod. -Prometheus CPU Reservation | Configures [the CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) of the Promehtues pod. -Prometheus Memory Limit | Configures [the Memory resource limits](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) of the Promehtues pod. -Prometheus Memory Reservation | Configures [the Memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) of the Promehtues pod. -Add Selector | If you want to deploy the Prometheus/Grafana pods to a specific node when enable monitoring, add selectors to the pods so that they're deployed to your selected node(s). To use this option, you must first apply labels to your nodes. +## Project Metrics -### Enable Node Exporter +If [cluster monitoring]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/) is also enabled for the project, [workload metrics]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/cluster-metrics/#workload-metrics) are available for the project. -Node Exporter is a popular open source exporter which can expose the metrics for hardware and \*NIX kernels OS, it is designed to monitor the host system. However, there are still namespacing issues with running it in a container, mostly around filesystem mount spaces. So if we need to monitor the actual network stats for the container network, we must deploy it with `hostNetwork` mode. +If only project monitoring is enabled, you can monitor custom metrics from any [exporters](https://prometheus.io/docs/instrumenting/exporters/). You can expose some endpoints on deployments without needing to configure Prometheus for your project. -Firstly, you need to consider which host port should expose to avoid port conflicts and fill into `Node Exporter Host Port` field. Secondly, you must open that port to allow the internal traffic from `Prometheus`. +### Example -### Persistent Storage +A [Redis](https://redis.io/) application is deployed in the namespace `redis-app` in the project `Datacenter`. It is monitored via [Redis exporter](https://github.com/oliver006/redis_exporter). ->**Prerequisite:** Configure one or more [storage class]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/#adding-storage-classes) to use as [persistent storage]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/) for your Prometheus/Grafana instance. - -By default, when you enable Prometheus for either a cluster or project, all monitoring data that Prometheus collects is stored on its own pod. This local storage means that if your Prometheus/Grafana pods fail, you'll lose all your monitoring data. Therefore, we recommend configuring persistent storage external to your cluster. This way, if your Prometheus/Grafana pods fail, the new pods that replace them can recover using your persistent storage. - -You can configure persistent storage for Prometheus and/or Grafana by using the radio buttons available when completing either [Configuring Cluster Monitoring](#configuring-cluster-monitoring) or [Configuring Project Monitoring](#configuring-project-monitoring). After enabling persistent storage, you'll then need to specify a [storage class]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/#storage-classes) that's used to provision a [persistent volume]({{< baseurl >}}/rancher/v2.x/en/k8s-in-rancher/volumes-and-storage/#persistent-volumes), along with the size of the volume that's being provisioned. - -### Advanced Options - ->**Warning:** Monitoring app is [a specially designed app](https://github.com/rancher/system-charts/tree/dev/charts/rancher-monitoring). Any modification without familiarizing the entire app can lead to catastrophic errors. - -Monitoring is driven by [Rancher Catalog App]({{< baseurl >}}/rancher/v2.x/en/catalog), so you can expand all options by clicking the **Show advanced options** and then configure it as you wolud configure any other app. - -## Viewing Metrics - -After you've deployed Prometheus to a cluster or project, you can view that data in one of two places: - -- [Rancher Dashboard](#cluster-dashboard) -- [Grafana](#grafana) - -### Rancher Dashboard - -After enabling cluster monitoring to one of your clusters, you can view the data it collects from the Rancher Dashboard. - ->**Note:** The Rancher Dashboard only displays Prometheus analytics for the cluster, not individual projects. If you want to view analytics for a project, you must [access the project's Grafana instance](#grafana-accessing-for-projects). - -#### Rancher Dashboard Use - -Prometheus metrics are displayed below the main dashboard display, and are denoted with the Grafana icon as displayed below. - ->**Tip:** Click the icon to open the metrics in [Grafana](#grafana). - -In each Prometheus metrics widget, you can toggle between a **Detail** view, which displays graphs and charts that let you view each event in a Prometheus time series, or a **Summary** view, which only lists events in a Prometheus time series out of the norm. - -You can also change the range of the time series that you're viewing to see a more refined or expansive data sample. - -Finally, you can customize the data sample to display data between chosen dates and times. - -### Available Dashboard - -After deploying Prometheus to a cluster, you can view the metrics from its Dashboard. - -When analyzing metrics, don't be concerned about any single standalone metric in the charts and graphs. Rather, you should establish a baseline for your metrics over the course of time (i.e., the range of values that your components usually operate within and are considered normal). After you establish this baseline, be on the lookout for large deltas in the charts and graphs, as these big changes usually indicate a problem that you need to investigate. - -### Grafana - -Your other option for viewing cluster data is Grafana, which is a leading open source platform for analytics and monitoring. - -Grafana allows you to query, visualize, alert, and ultimately, understand your cluster and workload data. - -For more information on Grafana and its capabilities, visit the [Grafana website](https://grafana.com/grafana). - -#### Accessing Grafana - -When enable monitoring, Rancher automatically creates a link to Grafana instance. Use this link to view monitoring data for the cluster or project. - -##### Grafana and Authentication - -When you deploy Prometheus to a cluster or project, Rancher automatically creates a Grafana instance for the object. Rancher determines which users can access the new Grafana instance, as well as the objects they can view within it, by validating them against cluster or project membership. Users that hold membership for the object will be able to access its Grafana instance. In other words, users' access in Grafana mirrors their access in Rancher. - -##### Grafana: Accessing for Clusters - -To access an instance of Grafana displays monitoring analytics for a cluster, browse to the cluster's `system` project and open **Apps**. From the `cluster-monitoring` catalog app, click the `/index.html` link. To view data for your cluster navigate to the cluster's _system_ project. - -##### Grafana: Accessing for Projects - -To access an instance of Grafana that's monitoring a project, browse to the applicable cluster and project. Then open **Apps**. From the `project-monitoring` catalog app, click the `/index.html` link. - -#### Manage Grafana - -To manage your cluster or project Grafana, you can sign into it by using `admin/admin`. For security, you should change the default password after first login. - -The preset Grafana dashboards are imported via [Grafana provisioning mechanism](http://docs.grafana.org/administration/provisioning/#dashboards), so you cannot modify them directly. A workaround, for now, is to clone the original and then modify the new copy. - -## Cluster Metrics - -These metrics display the hardware utilization for all nodes in your cluster, regardless of its Kubernetes Role. They give you a global monitoring insight into the cluster. - -Some of the biggest metrics to look out for: - -- **CPU Utilization** - - High load either indicates that your cluster is running efficiently (😄) or that you're running out of CPU resources (😞). - -- **Disk Utilization** - - Be on the lookout for increased read and write rates on nodes nearing their disk capacity. This advice is especially true for etcd nodes, as running out of storage on an etcd node leads to cluster failure. - -- **Memory Utilization** - - Deltas in memory utilization usually indicate a memory leak. - -- **Load Average** - - Generally, you want your load average to match your number of logical CPUs for the cluster. For example, if your cluster has 8 logical CPUs, the ideal load average would be 8 as well. If you load average is well under the number of logical CPUs for the cluster, you may want to reduce cluster resources. On the other hand, if your average is over 8, your cluster may need more resources. - -To view the data for one node, browse into the **Nodes** and go into a node view to look for the **Node Metrics**. - -[_Get expressions for Cluster Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#cluster-metrics) - -## Etcd Metrics - ->**Note:** Supported in [the cluster launched by Rancher]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters). - -These metrics display the operations of the etcd database on each of your cluster nodes. After establishing a baseline of normal etcd operational metrics, observe them for abnormal deltas between metric refreshes, which indicate potential issues with etcd. Always address etcd issues immediately! - -You should also pay attention to the text at the top of the etcd metrics, which displays leadership statistics. This text indicates if etcd currently has a leader, which is the etcd instance that coordinates the other etcd instances in your cluster. A large increase in leader changes implies etcd is unstable. If you notice a change in leadership statistics, you should investigate them for issues. - -Some of the biggest metrics to look out for: - -- **Etcd has a leader** - - etcd is usually deployed on multiple nodes and elects a leader to coordinate its operations. If etcd does not have a leader, its operations are not being coordinated. - -- **Number of leader changes** - - If this statistic suddenly grows, it usually indicates network communication issues that constantly force the cluster to elect a new leader. - -[_Get expressions for Etcd Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#etcd-metrics) - -## Kubernetes Components Metrics - -These metrics display data about the cluster's individual Kubernetes components. Primarily, it displays information about connections and latency for each component: the API server, controller manager, scheduler, and ingress controller. - ->**Note:** The metrics for the controller manager, scheduler and ingress controller are only supported in [the cluster launched by Rancher]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters). - -When analyzing Kubernetes component metrics, don't be concerned about any single standalone metric in the charts and graphs that display. Rather, you should establish a baseline for metrics considered normal following a period of observation (i.e., the range of values that your components usually operate within and are considered normal). After you establish this baseline, be on the lookout for large deltas in the charts and graphs, as these big changes usually indicate a problem that you need to investigate. - -Some of the more important component metrics to monitor are: - -- **API Server Request Latency** - - Increasing API response times indicate there's a generalized problem that requires investigation. - -- **API Server Request Rate** - - Rising API request rates usually coincide with increased API response times. Increased request rates also indicate a generalized problem requiring investigation. - -- **Scheduler Preemption Attempts** - - If you see a spike in scheduler preemptions, it's an indication that you're running out of hardware resources, as Kubernetes is recognizing it doesn't have enough resources to run all your pods and is prioritizing the more important ones. - -- **Scheduling Failed Pods** - - Failed pods can have a variety of causes, such as unbound persistent volume claims, exhausted hardware resources, non-responsive nodes, etc. - -- **Ingress Controller Request Process Time** - - How fast ingress is routing connections to your cluster services. - -[_Get expressions for Kubernetes Component Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#kubernetes-component-metrics) - -## Rancher Logging Metrics - -Although the Dashboard for a cluster primary displays data sourced from Prometheus, it also displays information for cluster logging, provided that you have configured Rancher to use a logging service. - -For more information about enabling logging for a cluster, see [logging]({{< baseurl >}}/rancher/v2.x/en/tools/logging). - -[_Get expressions for Rancher Logging Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#rancher-logging-metrics) - -## Workload Metrics - ->**Note:** Supported by [enabling cluster monitoring](#configuring-cluster-monitoring). - -These metrics display the hardware utilization for a Kubernetes workload. You can also view metrics for [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) and so on. - -To view the pod metrics, navigate into the pod view and click on **Pod Metrics**. You can also view the container metrics by navigating to **Container Metrics** option - -[_Get expressions for Workload Metrics_]({{< baseurl >}}/rancher/v2.x/en/tools/monitoring/expression/#workload-metrics) - -## Custom Metrics - ->**Note:** Supported by [enabling project monitoring](#configuring-project-monitoring). - -If you want to scrape the metrics from any [exporters](https://prometheus.io/docs/instrumenting/exporters/), you only need to set up some exposing endpoints on deploying but without configuring the project Prometheus directly. - -Imagine that you have deployed a [Redis](https://redis.io/) app/cluster in the namespace `redis-app` of the project `Datacenter`, and you are going to monitor it via [Redis exporter](https://github.com/oliver006/redis_exporter). By enabling project monitoring, you only need to configure **Custom Metrics** under **Advanced Options** as shown in the GIF below, and set the correct `Container Port`, `Path` and `Protocol`. - -![AddCustomMetrics]({{< baseurl >}}/img/rancher/add-custom-metrics.gif) +After enabling project monitoring, you can edit the application to configure the **Advanced Options -> Custom Metrics** section. Enter the `Container Port` and `Path` and select the `Protocol`. diff --git a/content/rancher/v2.x/en/project-admin/tools/monitoring/expression/_index.md b/content/rancher/v2.x/en/project-admin/tools/monitoring/expression/_index.md deleted file mode 100644 index 1cccdf62887..00000000000 --- a/content/rancher/v2.x/en/project-admin/tools/monitoring/expression/_index.md +++ /dev/null @@ -1,379 +0,0 @@ ---- -title: Expression -weight: 10000 ---- - -## In This Document - - - -- [Cluster Metrics](#cluster-metrics) - - + [Node Metrics](#node-metrics) -- [Etcd Metrics](#etcd-metrics) -- [Kubernetes Components Metrics](#kubernetes-components-metrics) -- [Rancher Logging Metrics](#rancher-logging-metrics) -- [Workload Metrics](#workload-metrics) - - + [Pod Metrics](#pod-metrics) - + [Container Metrics](#container-metrics) - - - -## Cluster Metrics - -- **CPU Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))` | - | Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])))` | - -- **Load Average** - - | Catalog | Expression | - | --- | --- | - | Detail |
load1`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
load5`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
load15`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance)`
| - | Summary |
load1`sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load5`sum(node_load5) by (instance) / count(node_cpu_seconds_total{mode="system"})`
load15`sum(node_load15) by (instance) / count(node_cpu_seconds_total{mode="system"})`
| - -- **Memory Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `1 - sum(node_memory_MemAvailable_bytes) by (instance) / sum(node_memory_MemTotal_bytes) by (instance)` | - | Summary | `1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)` | - -- **Disk Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `(sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance) - sum(node_filesystem_free_bytes{device!="rootfs"}) by (instance)) / sum(node_filesystem_size_bytes{device!="rootfs"}) by (instance)` | - | Summary | `(sum(node_filesystem_size_bytes{device!="rootfs"}) - sum(node_filesystem_free_bytes{device!="rootfs"})) / sum(node_filesystem_size_bytes{device!="rootfs"})` | - -- **Disk I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(node_disk_read_bytes_total[5m])) by (instance)`
written`sum(rate(node_disk_written_bytes_total[5m])) by (instance)`
| - | Summary |
read`sum(rate(node_disk_read_bytes_total[5m]))`
written`sum(rate(node_disk_written_bytes_total[5m]))`
| - -- **Network Packets** - - | Catalog | Expression | - | --- | --- | - | Detail |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
| - | Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
| - -- **Network I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m])) by (instance)
| - | Summary |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*"}[5m]))
| - -### Node Metrics - -- **CPU Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `avg(irate(node_cpu_seconds_total{mode!="idle", instance=~"$instance"}[5m])) by (mode)` | - | Summary | `1 - (avg(irate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])))` | - -- **Load Average** - - | Catalog | Expression | - | --- | --- | - | Detail |
load1`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load5`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load15`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
| - | Summary |
load1`sum(node_load1{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load5`sum(node_load5{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
load15`sum(node_load15{instance=~"$instance"}) / count(node_cpu_seconds_total{mode="system",instance=~"$instance"})`
| - -- **Memory Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"})` | - | Summary | `1 - sum(node_memory_MemAvailable_bytes{instance=~"$instance"}) / sum(node_memory_MemTotal_bytes{instance=~"$instance"}) ` | - -- **Disk Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"}) by (device)) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) by (device)` | - | Summary | `(sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"}) - sum(node_filesystem_free_bytes{device!="rootfs",instance=~"$instance"})) / sum(node_filesystem_size_bytes{device!="rootfs",instance=~"$instance"})` | - -- **Disk I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`
written`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`
| - | Summary |
read`sum(rate(node_disk_read_bytes_total{instance=~"$instance"}[5m]))`
written`sum(rate(node_disk_written_bytes_total{instance=~"$instance"}[5m]))`
| - -- **Network Packets** - - | Catalog | Expression | - | --- | --- | - | Detail |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
| - | Summary |
receive-droppedsum(rate(node_network_receive_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-errssum(rate(node_network_receive_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
receive-packetssum(rate(node_network_receive_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-droppedsum(rate(node_network_transmit_drop_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-errssum(rate(node_network_transmit_errs_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmit-packetssum(rate(node_network_transmit_packets_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
| - -- **Network I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m])) by (device)
| - | Summary |
receivesum(rate(node_network_receive_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
transmitsum(rate(node_network_transmit_bytes_total{device!~"lo | veth.* | docker.* | flannel.* | cali.* | cbr.*",instance=~"$instance"}[5m]))
| - -## Etcd Metrics - -- **Etcd has a leader** - - `max(etcd_server_has_leader)` - -- **Number of leader changes** - - `max(etcd_server_leader_changes_seen_total)` - -- **Number of failed proposals** - - `sum(etcd_server_proposals_failed_total)` - -- **GRPC Client Traffic** - - | Catalog | Expression | - | --- | --- | - | Detail |
in`sum(rate(etcd_network_client_grpc_received_bytes_total[5m])) by (instance)`
out`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m])) by (instance)`
| - | Summary |
in`sum(rate(etcd_network_client_grpc_received_bytes_total[5m]))`
out`sum(rate(etcd_network_client_grpc_sent_bytes_total[5m]))`
| - -- **Peer Traffic** - - | Catalog | Expression | - | --- | --- | - | Detail |
in`sum(rate(etcd_network_peer_received_bytes_total[5m])) by (instance)`
out`sum(rate(etcd_network_peer_sent_bytes_total[5m])) by (instance)`
| - | Summary |
in`sum(rate(etcd_network_peer_received_bytes_total[5m]))`
out`sum(rate(etcd_network_peer_sent_bytes_total[5m]))`
| - -- **DB Size** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(etcd_debugging_mvcc_db_total_size_in_bytes) by (instance)` | - | Summary | `sum(etcd_debugging_mvcc_db_total_size_in_bytes)` | - -- **Active Streams** - - | Catalog | Expression | - | --- | --- | - | Detail |
lease-watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (instance)`
watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (instance)`
| - | Summary |
lease-watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"})`
watch`sum(grpc_server_started_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) - sum(grpc_server_handled_total{grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"})`
| - -- **Raft Proposals** - - | Catalog | Expression | - | --- | --- | - | Detail |
applied`sum(increase(etcd_server_proposals_applied_total[5m])) by (instance)`
committed`sum(increase(etcd_server_proposals_committed_total[5m])) by (instance)`
pending`sum(increase(etcd_server_proposals_pending[5m])) by (instance)`
failed`sum(increase(etcd_server_proposals_failed_total[5m])) by (instance)`
| - | Summary |
applied`sum(increase(etcd_server_proposals_applied_total[5m]))`
committed`sum(increase(etcd_server_proposals_committed_total[5m]))`
pending`sum(increase(etcd_server_proposals_pending[5m]))`
failed`sum(increase(etcd_server_proposals_failed_total[5m]))`
| - -- **RPC Rate** - - | Catalog | Expression | - | --- | --- | - | Detail |
total`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m])) by (instance)`
fail`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m])) by (instance)`
| - | Summary |
total`sum(rate(grpc_server_started_total{grpc_type="unary"}[5m]))`
fail`sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK"}[5m]))`
| - -- **Disk Operations** - - | Catalog | Expression | - | --- | --- | - | Detail |
commit-called-by-backend`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m])) by (instance)`
fsync-called-by-wal`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m])) by (instance)`
| - | Summary |
commit-called-by-backend`sum(rate(etcd_disk_backend_commit_duration_seconds_sum[1m]))`
fsync-called-by-wal`sum(rate(etcd_disk_wal_fsync_duration_seconds_sum[1m]))`
| - -- **Disk Sync Duration** - - | Catalog | Expression | - | --- | --- | - | Detail |
wal`histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))`
db`histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))`
| - | Summary |
wal`sum(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)))`
db`sum(histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)))`
| - -## Kubernetes Components Metrics - -- **API Server Request Latency** - - | Catalog | Expression | - | --- | --- | - | Detail | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance, verb) /1e+06` | - | Summary | `avg(apiserver_request_latencies_sum / apiserver_request_latencies_count) by (instance) /1e+06` | - -- **API Server Request Rate** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(apiserver_request_count[5m])) by (instance, code)` | - | Summary | `sum(rate(apiserver_request_count[5m])) by (instance)` | - -- **Scheduling Failed Pods** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(kube_pod_status_scheduled{condition="false"})` | - | Summary | `sum(kube_pod_status_scheduled{condition="false"})` | - -- **Controller Manager Queue Depth** - - | Catalog | Expression | - | --- | --- | - | Detail |
volumes`sum(volumes_depth) by instance`
deployment`sum(deployment_depth) by instance`
replicaset`sum(replicaset_depth) by instance`
service`sum(service_depth) by instance`
serviceaccount`sum(serviceaccount_depth) by instance`
endpoint`sum(endpoint_depth) by instance`
daemonset`sum(daemonset_depth) by instance`
statefulset`sum(statefulset_depth) by instance`
replicationmanager`sum(replicationmanager_depth) by instance`
| - | Summary |
volumes`sum(volumes_depth)`
deployment`sum(deployment_depth)`
replicaset`sum(replicaset_depth)`
service`sum(service_depth)`
serviceaccount`sum(serviceaccount_depth)`
endpoint`sum(endpoint_depth)`
daemonset`sum(daemonset_depth)`
statefulset`sum(statefulset_depth)`
replicationmanager`sum(replicationmanager_depth)`
| - -- **Scheduler E2E Scheduling Latency** - - | Catalog | Expression | - | --- | --- | - | Detail | `histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06` | - | Summary | `sum(histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket) by (le, instance)) / 1e+06)` | - -- **Scheduler Preemption Attempts** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(scheduler_total_preemption_attempts[5m])) by (instance)` | - | Summary | `sum(rate(scheduler_total_preemption_attempts[5m]))` | - -- **Ingress Controller Connections** - - | Catalog | Expression | - | --- | --- | - | Detail |
reading`sum(nginx_ingress_controller_nginx_process_connections{state="reading"}) by (instance)`
waiting`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"}) by (instance)`
writing`sum(nginx_ingress_controller_nginx_process_connections{state="writing"}) by (instance)`
accpeted`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m]))) by (instance)`
active`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m]))) by (instance)`
handled`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m]))) by (instance)`
| - | Summary |
reading`sum(nginx_ingress_controller_nginx_process_connections{state="reading"})`
waiting`sum(nginx_ingress_controller_nginx_process_connections{state="waiting"})`
writing`sum(nginx_ingress_controller_nginx_process_connections{state="writing"})`
accpeted`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="accepted"}[5m])))`
active`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="active"}[5m])))`
handled`sum(ceil(increase(nginx_ingress_controller_nginx_process_connections_total{state="handled"}[5m])))`
| - -- **Ingress Controller Request Process Time** - - | Catalog | Expression | - | --- | --- | - | Detail | `topk(10, histogram_quantile(0.95,sum by (le, host, path)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` | - | Summary | `topk(10, histogram_quantile(0.95,sum by (le, host)(rate(nginx_ingress_controller_request_duration_seconds_bucket{host!="_"}[5m]))))` | - -## Rancher Logging Metrics - -- **Fluentd Buffer Queue Rate** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_output_status_buffer_queue_length[5m])) by (instance)` | - | Summary | `sum(rate(fluentd_output_status_buffer_queue_length[5m]))` | - -- **Fluentd Input Rate** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_input_status_num_records_total[5m])) by (instance)` | - | Summary | `sum(rate(fluentd_input_status_num_records_total[5m]))` | - -- **Fluentd Output Errors Rate** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_output_status_num_errors[5m])) by (type)` | - | Summary | `sum(rate(fluentd_output_status_num_errors[5m]))` | - -- **Fluentd Output Rate** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(rate(fluentd_output_status_num_records_total[5m])) by (instance)` | - | Summary | `sum(rate(fluentd_output_status_num_records_total[5m]))` | - -## Workload Metrics - -- **CPU Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
user seconds`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
system seconds`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
usage seconds`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
user seconds`sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
system seconds`sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
usage seconds`sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - -- **Memory Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""}) by (pod_name)` | - | Summary | `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name=~"$podName", container_name!=""})` | - -- **Network Packets** - - | Catalog | Expression | - | --- | --- | - | Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - -- **Network I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - -- **Disk I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m])) by (pod_name)`
| - | Summary |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name=~"$podName",container_name!=""}[5m]))`
| - -### Pod Metrics - -- **CPU Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
usage seconds`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
system seconds`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
user seconds`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m])) by (container_name)`
| - | Summary |
cfs throttled seconds`sum(rate(container_cpu_cfs_throttled_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
usage seconds`sum(rate(container_cpu_usage_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
system seconds`sum(rate(container_cpu_system_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
user seconds`sum(rate(container_cpu_user_seconds_total{container_name!="POD",namespace="$namespace",pod_name="$podName", container_name!=""}[5m]))`
| - -- **Memory Utilization** - - | Catalog | Expression | - | --- | --- | - | Detail | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""}) by (container_name)` | - | Summary | `sum(container_memory_working_set_bytes{container_name!="POD",namespace="$namespace",pod_name="$podName",container_name!=""})` | - -- **Network Packets** - - | Catalog | Expression | - | --- | --- | - | Detail |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - | Summary |
receive-packets`sum(rate(container_network_receive_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-dropped`sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
receive-errors`sum(rate(container_network_receive_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-packets`sum(rate(container_network_transmit_packets_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-dropped`sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit-errors`sum(rate(container_network_transmit_errors_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - -- **Network I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - | Summary |
receive`sum(rate(container_network_receive_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
transmit`sum(rate(container_network_transmit_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - -- **Disk I/O** - - | Catalog | Expression | - | --- | --- | - | Detail |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m])) by (container_name)`
| - | Summary |
read`sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
write`sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name!=""}[5m]))`
| - -### Container Metrics - -- **CPU Utilization** - - | Catalog | Expression | - | --- | --- | - | cfs throttled seconds | `sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | usage seconds | `sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | system seconds | `sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | user seconds | `sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - -- **Memory Utilization** - - `sum(container_memory_working_set_bytes{namespace="$namespace",pod_name="$podName",container_name="$containerName"})` - -- **Disk IO** - - | Catalog | Expression | - | --- | --- | - | read | `sum(rate(container_fs_reads_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - | write | `sum(rate(container_fs_writes_bytes_total{namespace="$namespace",pod_name="$podName",container_name="$containerName"}[5m]))` | - -