Merge pull request #3440 from catherineluse/staging

Enhance monitoring docs
This commit is contained in:
Catherine Luse
2021-08-12 11:53:54 -07:00
committed by GitHub
28 changed files with 993 additions and 360 deletions
@@ -9,125 +9,90 @@ aliases:
- /rancher/v2.5/en/cluster-admin/tools/monitoring/
---
Using Rancher, you can quickly deploy leading open-source monitoring alerting solutions onto your cluster.
Using the `rancher-monitoring` application, you can quickly deploy leading open-source monitoring and alerting solutions onto your cluster.
The `rancher-monitoring` operator, introduced in Rancher v2.5, is powered by [Prometheus](https://prometheus.io/), [Grafana](https://grafana.com/grafana/), [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/), the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), and the [Prometheus adapter.](https://github.com/DirectXMan12/k8s-prometheus-adapter) This page describes how to enable monitoring and alerting within a cluster using the new monitoring application.
Rancher's solution allows users to:
- Monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments via Prometheus, a leading open-source monitoring solution.
- Define alerts based on metrics collected via Prometheus
- Create custom dashboards to make it easy to visualize collected metrics via Grafana
- Configure alert-based notifications via Email, Slack, PagerDuty, etc. using Prometheus Alertmanager
- Defines precomputed, frequently needed or computationally expensive expressions as new time series based on metrics collected via Prometheus (only available in 2.5)
- Expose collected metrics from Prometheus to the Kubernetes Custom Metrics API via Prometheus Adapter for use in HPA (only available in 2.5)
More information about the resources that get deployed onto your cluster to support this solution can be found in the [`rancher-monitoring`](https://github.com/rancher/charts/tree/main/charts/rancher-monitoring) Helm chart, which closely tracks the upstream [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) Helm chart maintained by the Prometheus community with certain changes tracked in the [CHANGELOG.md](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/CHANGELOG.md).
> If you previously enabled Monitoring, Alerting, or Notifiers in Rancher before v2.5, there is no upgrade path for switching to the new monitoring/ alerting solution. You will need to disable monitoring/ alerting/notifiers in Cluster Manager before deploying the new monitoring solution via Cluster Explorer.
For more information about upgrading the Monitoring app in Rancher 2.5, please refer to the [migration docs](./migrating).
- [About Prometheus](#about-prometheus)
- [Enable Monitoring](#enable-monitoring)
- [Default Alerts, Targets, and Grafana Dashboards](#default-alerts-targets-and-grafana-dashboards)
- [Features](#features)
- [How Monitoring Works](#how-monitoring-works)
- [Default Components and Deployments](#default-components-and-deployments)
- [Role-based Access Control](#role-based-access-control)
- [Guides](#guides)
- [Windows Cluster Support](#windows-cluster-support)
- [Using Monitoring](#using-monitoring)
- [Grafana UI](#grafana-ui)
- [Prometheus UI](#prometheus-ui)
- [Viewing the Prometheus Targets](#viewing-the-prometheus-targets)
- [Viewing the PrometheusRules](#viewing-the-prometheusrules)
- [Viewing Active Alerts in Alertmanager](#viewing-active-alerts-in-alertmanager)
- [Uninstall Monitoring](#uninstall-monitoring)
- [Setting Resource Limits and Requests](#setting-resource-limits-and-requests)
- [Known Issues](#known-issues)
# About Prometheus
### Features
Prometheus provides a time series of your data, which is, according to the [Prometheus documentation:](https://prometheus.io/docs/concepts/data_model/)
Prometheus lets you view metrics from your Rancher and Kubernetes objects. Using timestamps, Prometheus lets you query and view these metrics in easy-to-read graphs and visuals, either through the Rancher UI or Grafana, which is an analytics viewing platform deployed along with Prometheus.
> A stream of timestamped values belonging to the same metric and the same set of labeled dimensions, along with comprehensive statistics and metrics of the monitored cluster.
By viewing data that Prometheus scrapes from your cluster control plane, nodes, and deployments, you can stay on top of everything happening in your cluster. You can then use these analytics to better run your organization: stop system emergencies before they start, develop maintenance strategies, or restore crashed servers.
In other words, Prometheus lets you view metrics from your different Rancher and Kubernetes objects. Using timestamps, Prometheus lets you query and view these metrics in easy-to-read graphs and visuals, either through the Rancher UI or Grafana, which is an analytics viewing platform deployed along with Prometheus.
The `rancher-monitoring` operator, introduced in Rancher v2.5, is powered by [Prometheus](https://prometheus.io/), [Grafana](https://grafana.com/grafana/), [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/), the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), and the [Prometheus adapter.](https://github.com/DirectXMan12/k8s-prometheus-adapter)
By viewing data that Prometheus scrapes from your cluster control plane, nodes, and deployments, you can stay on top of everything happening in your cluster. You can then use these analytics to better run your organization: stop system emergencies before they start, develop maintenance strategies, restore crashed servers, etc.
The monitoring application allows you to:
# Enable Monitoring
- Monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments
- Define alerts based on metrics collected via Prometheus
- Create custom Grafana dashboards
- Configure alert-based notifications via Email, Slack, PagerDuty, etc. using Prometheus Alertmanager
- Defines precomputed, frequently needed or computationally expensive expressions as new time series based on metrics collected via Prometheus
- Expose collected metrics from Prometheus to the Kubernetes Custom Metrics API via Prometheus Adapter for use in HPA
As an [administrator]({{<baseurl>}}/rancher/v2.5/en/admin-settings/rbac/global-permissions/) or [cluster owner]({{<baseurl>}}/rancher/v2.5/en/admin-settings/rbac/cluster-project-roles/#cluster-roles), you can configure Rancher to deploy Prometheus to monitor your Kubernetes cluster.
# How Monitoring Works
> **Requirements:**
>
> - Make sure that you are allowing traffic on port 9796 for each of your nodes because Prometheus will scrape metrics from here.
> - Make sure your cluster fulfills the resource requirements. The cluster should have at least 1950Mi memory available, 2700m CPU, and 50Gi storage. A breakdown of the resource limits and requests is [here.](#setting-resource-limits-and-requests)
> - When installing monitoring on an RKE cluster using RancherOS or Flatcar Linux nodes, change the etcd node certificate directory to `/opt/rke/etc/kubernetes/ssl`.
For an explanation of how the monitoring components work together, see [this page.](./how-monitoring-works)
{{% tabs %}}
{{% tab "Rancher v2.5.8+" %}}
# Default Components and Deployments
### Enable Monitoring for use without SSL
### Built-in Dashboards
1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.**
1. Click **Apps.**
1. Click the `rancher-monitoring` app.
1. Optional: Click **Chart Options** and configure alerting, Prometheus and Grafana. For help, refer to the [configuration reference.](./configuration)
1. Scroll to the bottom of the Helm chart README and click **Install.**
By default, the monitoring application deploys Grafana dashboards (curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project) onto a cluster.
**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace.
It also deploys an Alertmanager UI and a Prometheus UI. For more information about these tools, see [Built-in Dashboards.](./dashboards)
### Default Metrics Exporters
### Enable Monitoring for use with SSL
By default, Rancher Monitoring deploys exporters (such as [node-exporter](https://github.com/prometheus/node_exporter) and [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)).
1. Follow the steps on [this page]({{<baseurl>}}/rancher/v2.5/en/k8s-in-rancher/secrets/) to create a secret in order for SSL to be used for alerts.
- The secret should be created in the `cattle-monitoring-system` namespace. If it doesn't exist, create it first.
- Add the `ca`, `cert`, and `key` files to the secret.
1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.**
1. Click **Apps.**
1. Click the `rancher-monitoring` app.
1. Click **Alerting**.
1. Click **Additional Secrets** and add the secrets created earlier.
**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace.
These default exporters automatically scrape metrics for CPU and memory from all components of your Kubernetes cluster, including your workloads.
When [creating a receiver,]({{<baseurl>}}/rancher/v2.5/en/monitoring-alerting/configuration/alertmanager/#creating-receivers-in-the-rancher-ui) SSL-enabled receivers such as email or webhook will have a **SSL** section with fields for **CA File Path**, **Cert File Path**, and **Key File Path**. Fill in these fields with the paths to each of `ca`, `cert`, and `key`. The path will be of the form `/etc/alertmanager/secrets/name-of-file-in-secret`.
### Default Alerts
For example, if you created a secret with these key-value pairs:
The monitoring application deploys some alerts by default. To see the default alerts, go to the [Alertmanager UI](./dashboard/accessing-the-alertmanager-ui) and click **Expand all groups.**
```yaml
ca.crt=`base64-content`
cert.pem=`base64-content`
key.pfx=`base64-content`
```
### Components Exposed in the Rancher UI
Then **Cert File Path** would be set to `/etc/alertmanager/secrets/cert.pem`.
For a list of monitoring components exposed in the Rancher UI, along with common use cases for editing them, see [this section.](./how-monitoring-works/#components-exposed-in-the-rancher-ui)
{{% /tab %}}
{{% tab "Rancher before v2.5.8" %}}
# Role-based Access Control
1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.**
1. Click **Apps.**
1. Click the `rancher-monitoring` app.
1. Optional: Click **Chart Options** and configure alerting, Prometheus and Grafana. For help, refer to the [configuration reference.](./configuration)
1. Scroll to the bottom of the Helm chart README and click **Install.**
For information on configuring access to monitoring, see [this page.](./rbac)
**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace.
# Guides
{{% /tab %}}
- [Enable monitoring](./guides/enable-monitoring)
- [Uninstall monitoring](./guides/uninstall)
- [Monitoring Rancher apps](./guides/monitoring-rancher-apps)
- [Monitoring workloads](./guides/monitoring-workloads)
- [Customizing Grafana dashboards](./guides/customize-grafana)
- [Persistent Grafana dashboards](./guides/persist-grafana)
- [Setting up metrics for horizontal pod autoscaling](./guides/hpa)
- [Debugging high memory usage](./guides/memory-usage)
- [Migrating from Monitoring V1 to V2](./guides/migrating)
{{% /tabs %}}
# Configuration
### Default Alerts, Targets, and Grafana Dashboards
### Configuring Monitoring Resources in Rancher
By default, Rancher Monitoring deploys exporters (such as [node-exporter](https://github.com/prometheus/node_exporter) and [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)) as well as default Prometheus alerts and Grafana dashboards (curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project) onto a cluster.
> The configuration reference assumes familiarity with how monitoring components work together. For more information, see [How Monitoring Works.](./how-monitoring-works)
To see the default alerts, go to the [Alertmanager UI](#viewing-active-alerts-in-alertmanager) and click **Expand all groups.**
- [ServiceMonitor and PodMonitor](./configuration/servicemonitor-podmonitor)
- [Receiver](./configuration/receiver)
- [Route](./configuration/route)
- [PrometheusRule](./configuration/advanced/prometheusrule)
- [Prometheus](./configuration/advanced/prometheus)
- [Alertmanager](./configuration/advanced/alertmanager)
To see what services you are monitoring, you will need to see your targets. To view the default targets, refer to [Viewing the Prometheus Targets.](#viewing-the-prometheus-targets)
### Configuring Helm Chart Options
To see the default dashboards, go to the [Grafana UI.](#grafana-ui) In the left navigation bar, click the icon with four boxes and click **Manage.**
### Next Steps
To configure Prometheus resources from the Rancher UI, click **Apps & Marketplace > Monitoring** in the upper left corner.
For more information on `rancher-monitoring` chart options, including options to set resource limits and requests, see [this page.](./configuration/helm-chart-options)
# Windows Cluster Support
@@ -139,102 +104,10 @@ To be able to fully deploy Monitoring V2 for Windows, all of your Windows hosts
For more details on how to upgrade wins on existing Windows hosts, refer to the section on [Windows cluster support for Monitoring V2.](./windows-clusters)
# Using Monitoring
Installing `rancher-monitoring` makes the following dashboards available from the Rancher UI.
> **Note:** If you want to set up Alertmanager, Grafana or Ingress, it has to be done with the settings on the Helm chart deployment. It's problematic to create Ingress outside the deployment.
### Grafana UI
[Grafana](https://grafana.com/grafana/) allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture.
Rancher allows any users who are authenticated by Kubernetes and have access the Grafana service deployed by the Rancher Monitoring chart to access Grafana via the Rancher Dashboard UI. By default, all users who are able to access Grafana are given the [Viewer](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#viewer-role) role, which allows them to view any of the default dashboards deployed by Rancher.
However, users can choose to log in to Grafana as an [Admin](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#admin-role) if necessary. The default Admin username and password for the Grafana instance will be `admin`/`prom-operator`, but alternative credentials can also be supplied on deploying or upgrading the chart.
> **Persistent Dashboards:** To allow the Grafana dashboard to persist after it restarts, add the dashboard configuration JSON into a ConfigMap. ConfigMaps also allow the dashboards to be deployed with a GitOps or CD based approach. This allows the dashboard to be put under version control. For details, refer to [this section.](./persist-grafana)
To see the Grafana UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Grafana.
<figcaption>Cluster Compute Resources Dashboard in Grafana</figcaption>
![Cluster Compute Resources Dashboard in Grafana]({{<baseurl>}}/img/rancher/cluster-compute-resources-dashboard.png)
<figcaption>Default Dashboards in Grafana</figcaption>
![Default Dashboards in Grafana]({{<baseurl>}}/img/rancher/grafana-default-dashboard.png)
### Prometheus UI
To see the Prometheus UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Graph.**
<figcaption>Prometheus Graph UI</figcaption>
![Prometheus Graph UI]({{<baseurl>}}/img/rancher/prometheus-graph-ui.png)
### Viewing the Prometheus Targets
To see the Prometheus Targets, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Targets.**
<figcaption>Targets in the Prometheus UI</figcaption>
![Prometheus Targets UI]({{<baseurl>}}/img/rancher/prometheus-targets-ui.png)
### Viewing the PrometheusRules
To see the PrometheusRules, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Rules.**
<figcaption>Rules in the Prometheus UI</figcaption>
![PrometheusRules UI]({{<baseurl>}}/img/rancher/prometheus-rules-ui.png)
For more information on PrometheusRules in Rancher, see [this page.](./configuration/prometheusrules)
### Viewing Active Alerts in Alertmanager
When `rancher-monitoring` is installed, the Prometheus Alertmanager UI is deployed.
The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.
In the Alertmanager UI, you can view your alerts and the current Alertmanager configuration.
To see the PrometheusRules, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Alertmanager.**
**Result:** The Alertmanager UI opens in a new tab. For help with configuration, refer to the [official Alertmanager documentation.](https://prometheus.io/docs/alerting/latest/alertmanager/)
For more information on configuring Alertmanager in Rancher, see [this page.](./configuration/alertmanager)
<figcaption>The Alertmanager UI</figcaption>
![Alertmanager UI]({{<baseurl>}}/img/rancher/alertmanager-ui.png)
# Uninstall Monitoring
1. From the **Cluster Explorer,** click Apps & Marketplace.
1. Click **Installed Apps.**
1. Go to the `cattle-monitoring-system` namespace and check the boxes for `rancher-monitoring-crd` and `rancher-monitoring`.
1. Click **Delete.**
1. Confirm **Delete.**
**Result:** `rancher-monitoring` is uninstalled.
> **Note on Persistent Grafana Dashboards:** For users who are using Monitoring V2 v9.4.203 or below, uninstalling the Monitoring chart will delete the cattle-dashboards namespace, which will delete all persisted dashboards, unless the namespace is marked with the annotation `helm.sh/resource-policy: "keep"`. This annotation is added by default in Monitoring V2 v14.5.100+ but can be manually applied on the cattle-dashboards namespace before an uninstall if an older version of the Monitoring chart is currently installed onto your cluster.
# Setting Resource Limits and Requests
The resource requests and limits can be configured when installing `rancher-monitoring`.
The default values are in the [values.yaml](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/values.yaml) in the `rancher-monitoring` Helm chart.
The default values in the table below are the minimum required resource limits and requests.
| Resource Name | Memory Limit | CPU Limit | Memory Request | CPU Request |
| ------------- | ------------ | ----------- | ---------------- | ------------------ |
| alertmanager | 500Mi | 1000m | 100Mi | 100m |
| grafana | 200Mi | 200m | 100Mi | 100m |
| kube-state-metrics subchart | 200Mi | 100m | 130Mi | 100m |
| prometheus-node-exporter subchart | 50Mi | 200m | 30Mi | 100m |
| prometheusOperator | 500Mi | 200m | 100Mi | 100m |
| prometheus | 2500Mi | 1000m | 1750Mi | 750m |
| **Total** | **3950Mi** | **2700m** | **2210Mi** | **1250m** |
At least 50Gi storage is recommended.
# Known Issues
There is a [known issue](https://github.com/rancher/rancher/issues/28787#issuecomment-693611821) that K3s clusters require more default memory. If you are enabling monitoring on a K3s cluster, we recommend to setting `prometheus.prometheusSpec.resources.memory.limit` to 2500 Mi and `prometheus.prometheusSpec.resources.memory.request` to 1750 Mi.
For tips on debugging high memory usage, see [this page.](./memory-usage)
@@ -1,96 +1,61 @@
---
title: Configuration
weight: 3
weight: 5
aliases:
- /rancher/v2.5/en/monitoring-alerting/v2.5/configuration
---
This page captures some of the most important options for configuring the custom resources for monitoring.
This page captures some of the most important options for configuring Monitoring V2 in the Rancher UI.
For information on configuring custom scrape targets and rules for Prometheus, please refer to the upstream documentation for the [Prometheus Operator.](https://github.com/prometheus-operator/prometheus-operator) Some of the most important custom resources are explained in the Prometheus Operator [design documentation.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md) The Prometheus Operator documentation can help also you set up RBAC, Thanos, or custom configuration.
- [Configuring Prometheus](#configuring-prometheus)
- [Configuring Targets with ServiceMonitors and PodMonitors](#configuring-targets-with-servicemonitors-and-podmonitors)
- [ServiceMonitors](#servicemonitors)
- [PodMonitors](#podmonitors)
- [PrometheusRules](#prometheusrules)
- [Alertmanager Config](#alertmanager-config)
- [Trusted CA for Notifiers](#trusted-ca-for-notifiers)
- [Additional Scrape Configurations](#additional-scrape-configurations)
- [Examples](#examples)
This section assumes that you understand how the Prometheus Operators custom resources work together. For more information, see [this section.]
# Configuring Prometheus
# Setting Resource Limits and Requests
The primary way that users will be able to customize this feature for specific Monitoring and Alerting use cases is by creating and/or modifying ConfigMaps, Secrets, and Custom Resources pertaining to this deployment.
The resource requests and limits for the monitoring application can be configured when installing `rancher-monitoring`. For more information about the default limits, see [this page.](./resource-limits)
Prometheus Operator introduces a set of [Custom Resource Definitions](https://github.com/prometheus-operator/prometheus-operator#customresourcedefinitions) that allow users to deploy and manage Prometheus and Alertmanager instances by creating and modifying those custom resources on a cluster.
# Prometheus Configuration
Prometheus Operator will automatically update your Prometheus configuration based on the live state of these custom resources.
It is usually not necessary to directly edit the Prometheus custom resource.
There are also certain special types of ConfigMaps/Secrets such as those corresponding to Grafana Dashboards, Grafana Datasources, and Alertmanager Configs that will automatically update your Prometheus configuration via sidecar proxies that observe the live state of those resources within your cluster.
Instead, to configure Prometheus to scrape custom metrics, you will only need to create a new ServiceMonitor or PodMonitor to configure Prometheus to scrape additional metrics.
By default, a set of these resources (curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project) are deployed onto your cluster as part of installing the Rancher Monitoring Application to set up a basic Monitoring / Alerting stack. For more information how to configure custom targets, alerts, notifiers, and dashboards after deploying the chart, see below.
# Configuring Targets with ServiceMonitors and PodMonitors
### ServiceMonitor and PodMonitor Configuration
Customizing the scrape configuration used by Prometheus to determine which resources to scrape metrics from will primarily involve creating / modifying the following resources within your cluster:
For details, see [this page.](./)
### ServiceMonitors
### Advanced Prometheus Configuration
This CRD declaratively specifies how groups of Kubernetes services should be monitored. Any Services in your cluster that match the labels located within the ServiceMonitor `selector` field will be monitored based on the `endpoints` specified on the ServiceMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#servicemonitor) provided by Prometheus Operator.
Link to how monitoring works for the section about the Prometheus CR.
For more information about how ServiceMonitors work, refer to the [Prometheus Operator documentation.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md)
For more information about directly editing the Prometheus custom resource, which may be helpful in advanced use cases, see [this page.](./advanced/prometheus)
### PodMonitors
# Alertmanager Configuration
This CRD declaratively specifies how group of pods should be monitored. Any Pods in your cluster that match the labels located within the PodMonitor `selector` field will be monitored based on the `podMetricsEndpoints` specified on the PodMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#podmonitorspec) provided by Prometheus Operator.
The Alertmanager custom resource usually doesn't need to be edited directly. For most common use cases, you can manage alerts by updating Routes and Receivers.
# PrometheusRules
Routes and receivers are part of the configuration of the alertmanager custom resource. In the Rancher UI, Routes and Receivers are not true custom resources, but pseudo-custom resources that are mapped to sections within the Alertmanager custom resource.
This CRD defines a group of Prometheus alerting and/or recording rules.
When routes and receivers are updated, the monitoring application will automatically update Alertmanager to reflect those changes.
For information on configuring PrometheusRules, refer to [this page.](./prometheusrules)
For some advanced use cases, you may want to configure alertmanager directly. For more information, refer to [this page.](./advanced/alertmanager)
# Alertmanager Config
For information on configuring the Alertmanager, refer to [this page.](./alertmanager)
# Trusted CA for Notifiers
### Receivers
If you need to add a trusted CA to your notifier, follow these steps:
[link to section of how monitoring works that explains receivers]
1. Create the `cattle-monitoring-system` namespace.
1. Add your trusted CA secret to the `cattle-monitoring-system` namespace.
1. Deploy or upgrade the `rancher-monitoring` Helm chart. In the chart options, reference the secret in **Alerting > Additional Secrets.**
For details on how to configure receivers, see [this page.](./receiver)
### Routes
[link to section of how monitoring works that explains routes]
**Result:** The default Alertmanager custom resource will have access to your trusted CA.
The route needs to refer to a receiver that has already been configured.
# Additional Scrape Configurations
### Advanced
If the scrape configuration you want cannot be specified via a ServiceMonitor or PodMonitor at the moment, you can provide an `additionalScrapeConfigSecret` on deploying or upgrading `rancher-monitoring`.
Link to how monitoring works for the section about the alertmanager CR.
A [scrape_config section](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) specifies a set of targets and parameters describing how to scrape them. In the general case, one scrape configuration specifies a single job.
An example of where this might be used is with Istio. For more information, see [this section.](https://rancher.com/docs/rancher/v2.5/en/istio/v2.5/configuration-reference/selectors-and-scrape)
# Examples
### ServiceMonitor
An example ServiceMonitor custom resource can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml)
### PodMonitor
An example PodMonitor can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/example-app-pod-monitor.yaml) An example Prometheus resource that refers to it can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/prometheus-pod-monitor.yaml)
### PrometheusRule
For users who are familiar with Prometheus, a PrometheusRule contains the alerting and recording rules that you would normally place in a [Prometheus rule file](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/).
For a more fine-grained application of PrometheusRules within your cluster, the ruleSelector field on a Prometheus resource allows you to select which PrometheusRules should be loaded onto Prometheus based on the labels attached to the PrometheusRules resources.
An example PrometheusRule is on [this page.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md)
### Alertmanager Config
For an example configuration, refer to [this section.](./alertmanager/#example-alertmanager-config)
For more information about directly editing the Alertmanager custom resource, which may be helpful in advanced use cases, see [this page.](./advanced/alertmanager)
@@ -0,0 +1,4 @@
---
title: Advanced Configuration
weight: 5
---
@@ -0,0 +1,40 @@
---
title: Alertmanager Configuration
weight: 1
---
It is usually not necessary to directly edit the Alertmanager custom resource. For most use cases, you will only need to edit the Receivers and Routes to configure notifications.
When Receivers and Routes are updated, the monitoring application will automatically update the Alertmanager custom resource to be consistent with those changes.
> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works)
# About the Alertmanager Custom Resource
By default, Rancher Monitoring deploys a single Alertmanager onto a cluster that uses a default Alertmanager Config Secret.
You may want to edit the Alertmanager custom resource if you would like to take advantage of advanced options that are not exposed in the Rancher UI forms, such as the ability to create a routing tree structure that is more than two levels deep.
It is also possible to create more than one Alertmanager in a cluster, which may be useful if you want to implement namespace-scoped monitoring. In this case, you should manage the Alertmanager custom resources using the same underlying Alertmanager Config Secret.
### Deeply Nested Routes
While the Rancher UI only supports a routing tree that is two levels deep, you can configure more deeply nested routing structures by editing the Alertmanager YAML.
### Multiple Alertmanager Replicas
As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster. The replicas can all be managed using the same underlying Alertmanager Config Secret.
This Secret should be updated or modified any time you want to:
- Add in new notifiers or receivers
- Change the alerts that should be sent to specific notifiers or receivers
- Change the group of alerts that are sent out
By default, you can either choose to supply an existing Alertmanager Config Secret (i.e. any Secret in the `cattle-monitoring-system` namespace) or allow Rancher Monitoring to deploy a default Alertmanager Config Secret onto your cluster.
By default, the Alertmanager Config Secret created by Rancher will never be modified or deleted on an upgrade or uninstall of the `rancher-monitoring` chart. This restriction prevents users from losing or overwriting their alerting configuration when executing operations on the chart.
For more information on what fields can be specified in the Alertmanager Config Secret, please look at the [Prometheus Alertmanager docs.](https://prometheus.io/docs/alerting/latest/alertmanager/)
The full spec for the Alertmanager configuration file and what it takes in can be found [here.](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file)
@@ -0,0 +1,21 @@
---
title: Prometheus Configuration
weight: 1
aliases:
- /rancher/v2.5/en/monitoring-alerting/v2.5/configuration/prometheusrules
- /rancher/v2.5/en/monitoring-alerting/configuration/prometheusrules
- /rancher/v2.5/en/monitoring-alerting/configuration/advanced/prometheusrules
---
It is usually not necessary to directly edit the Prometheus custom resource because the monitoring application automatically updates it based on changes to ServiceMonitors and PodMonitors.
> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works)
# About the Prometheus Custom Resource
- when the Prometheus operator observes it, it creates prometheus-rancher-monitoring-prometheus, which is the prometheus deployment that is created based on the configuration in the Prometheus CR.
- This is where we configure details like what Alertmanagers are connected to Prometheus, what are the external URLs, and other details that prometheus needs. Rancher builds this CR for you. It has fields for pod monitor and service monitor selectors - technically you can filter that to include only the ones in a certain namespace.
- monitoring v2 only supports one prometheus per cluster because we havent supported project level monitoring. But you might want to edit prometheus Cr if you want to limit the namespaces.
- prometheus also has the rules and routes in it.
@@ -1,42 +1,12 @@
---
title: PrometheusRules
weight: 2
aliases:
- /rancher/v2.5/en/monitoring-alerting/v2.5/configuration/prometheusrules
title: Configuring PrometheusRules
weight: 3
---
A PrometheusRule defines a group of Prometheus alerting and/or recording rules.
- [About PrometheusRule Custom Resources](#about-prometheusrule-custom-resources)
- [Connecting Routes and PrometheusRules](#connecting-routes-and-prometheusrules)
- [Creating PrometheusRules in the Rancher UI](#creating-prometheusrules-in-the-rancher-ui)
- [Configuration](#configuration)
- [Rule Group](#rule-group)
- [Alerting Rules](#alerting-rules)
- [Recording Rules](#recording-rules)
> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works)
### About PrometheusRule Custom Resources
Prometheus rule files are held in PrometheusRule custom resources.
A PrometheusRule allows you to define one or more RuleGroups. Each RuleGroup consists of a set of Rule objects that can each represent either an alerting or a recording rule with the following fields:
- The name of the new alert or record
- A PromQL (Prometheus query language) expression for the new alert or record
- Labels that should be attached to the alert or record that identify it (e.g. cluster name or severity)
- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules.
Alerting rules define alert conditions based on PromQL queries. Recording rules precompute frequently needed or computationally expensive queries at defined intervals.
For more information on what fields can be specified, please look at the [Prometheus Operator spec.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#prometheusrulespec)
Use the label selector field `ruleSelector` in the Prometheus object to define the rule files that you want to be mounted into Prometheus.
For examples, refer to the Prometheus documentation on [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [alerting rules.](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
### Connecting Routes and PrometheusRules
When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Prometheus to figure out which Route should receive this Alert. For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label.
### Creating PrometheusRules in the Rancher UI
@@ -54,6 +24,23 @@ To create rule groups in the Rancher UI,
**Result:** Alerts can be configured to send notifications to the receiver(s).
### About the PrometheusRule Custom Resource
When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Alertmanager to figure out which Route should receive this Alert. For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label.
Prometheus rule files are held in PrometheusRule custom resources. A PrometheusRule allows you to define one or more RuleGroups. Each RuleGroup consists of a set of Rule objects that can each represent either an alerting or a recording rule with the following fields:
- The name of the new alert or record
- A PromQL expression for the new alert or record
- Labels that should be attached to the alert or record that identify it (e.g. cluster name or severity)
- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules.
For more information on what fields can be specified, please look at the [Prometheus Operator spec.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#prometheusrulespec)
Use the label selector field `ruleSelector` in the Prometheus object to define the rule files that you want to be mounted into Prometheus.
For examples, refer to the Prometheus documentation on [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [alerting rules.](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
# Configuration
{{% tabs %}}
@@ -0,0 +1,25 @@
---
title: Examples
weight: 5
---
### ServiceMonitor
An example ServiceMonitor custom resource can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml)
### PodMonitor
An example PodMonitor can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/example-app-pod-monitor.yaml) An example Prometheus resource that refers to it can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/prometheus-pod-monitor.yaml)
### PrometheusRule
For users who are familiar with Prometheus, a PrometheusRule contains the alerting and recording rules that you would normally place in a [Prometheus rule file](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/).
For a more fine-grained application of PrometheusRules within your cluster, the ruleSelector field on a Prometheus resource allows you to select which PrometheusRules should be loaded onto Prometheus based on the labels attached to the PrometheusRules resources.
An example PrometheusRule is on [this page.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md)
### Alertmanager Config
For an example configuration, refer to [this section.](./alertmanager/#example-alertmanager-config)
@@ -0,0 +1,77 @@
---
title: Helm Chart Options
weight: 8
---
- [Configuring Resource Limits and Requests](#configuring-resource-limits-and-requests)
- [Trusted CA for Notifiers](#trusted-ca-for-notifiers)
- [Additional Scrape Configurations](#additional-scrape-configurations)
- [Configuring Applications Packaged within Monitoring V2](#configuring-applications-packaged-within-monitoring-v2)
- [Increase the Replicas of Alertmanager](#increase-the-replicas-of-alertmanager)
- [Configuring the Namespace for a Persistent Grafana Dashboard](#configuring-the-namespace-for-a-persistent-grafana-dashboard)
# Configuring Resource Limits and Requests
The resource requests and limits can be configured when installing `rancher-monitoring`.
The default values are in the [values.yaml](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/values.yaml) in the `rancher-monitoring` Helm chart.
The default values in the table below are the minimum required resource limits and requests.
| Resource Name | Memory Limit | CPU Limit | Memory Request | CPU Request |
| ------------- | ------------ | ----------- | ---------------- | ------------------ |
| alertmanager | 500Mi | 1000m | 100Mi | 100m |
| grafana | 200Mi | 200m | 100Mi | 100m |
| kube-state-metrics subchart | 200Mi | 100m | 130Mi | 100m |
| prometheus-node-exporter subchart | 50Mi | 200m | 30Mi | 100m |
| prometheusOperator | 500Mi | 200m | 100Mi | 100m |
| prometheus | 2500Mi | 1000m | 1750Mi | 750m |
| **Total** | **3950Mi** | **2700m** | **2210Mi** | **1250m** |
At least 50Gi storage is recommended.
# Trusted CA for Notifiers
If you need to add a trusted CA to your notifier, follow these steps:
1. Create the `cattle-monitoring-system` namespace.
1. Add your trusted CA secret to the `cattle-monitoring-system` namespace.
1. Deploy or upgrade the `rancher-monitoring` Helm chart. In the chart options, reference the secret in **Alerting > Additional Secrets.**
**Result:** The default Alertmanager custom resource will have access to your trusted CA.
# Additional Scrape Configurations
If the scrape configuration you want cannot be specified via a ServiceMonitor or PodMonitor at the moment, you can provide an `additionalScrapeConfigSecret` on deploying or upgrading `rancher-monitoring`.
A [scrape_config section](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) specifies a set of targets and parameters describing how to scrape them. In the general case, one scrape configuration specifies a single job.
An example of where this might be used is with Istio. For more information, see [this section.](https://rancher.com/docs/rancher/v2.5/en/istio/v2.5/configuration-reference/selectors-and-scrape)
# Configuring Applications Packaged within Monitoring v2
We deploy kube-state-metrics and node-exporter with monitoring v2. Node exporter are deployed as DaemonSets. In the monitoring v2 helm chart, in the values.yaml, each of the things are deployed as sub charts.
We also deploy grafana which is not managed by prometheus.
If you look at what the helm chart is doing like in kube-state-metrics, there are plenty more values that you can set that arent exposed in the top level chart.
But in the top level chart you can add values that override values that exist in the sub chart.
### Increase the Replicas of Alertmanager
As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster. The replicas can all be managed using the same underlying Alertmanager Config Secret. For more information on the Alertmanager Config Secret, refer to [this section.](../configuration/advanced/alertmanager/#multiple-alertmanager-replicas)
### Configuring the Namespace for a Persistent Grafana Dashboard
To specify that you would like Grafana to watch for ConfigMaps across all namespaces, set this value in the `rancher-monitoring` Helm chart:
```
grafana.sidecar.dashboards.searchNamespace=ALL
```
Note that the RBAC roles exposed by the Monitoring chart to add Grafana Dashboards are still restricted to giving permissions for users to add dashboards in the namespace defined in `grafana.dashboards.namespace`, which defaults to `cattle-dashboards`.
@@ -1,17 +1,19 @@
---
title: Alertmanager
title: Receiver Configuration
shortTitle: Receivers
weight: 1
aliases:
- /rancher/v2.5/en/monitoring-alerting/v2.5/configuration/alertmanager
- rancher/v2.5/en/monitoring-alerting/legacy/notifiers/
- /rancher/v2.5/en/cluster-admin/tools/notifiers
- /rancher/v2.5/en/cluster-admin/tools/alerts
- /rancher/v2.5/en/monitoring-alerting/configuration/alertmanager
---
The [Alertmanager Config](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file) Secret contains the configuration of an Alertmanager instance that sends out notifications based on alerts it receives from Prometheus.
- [Overview](#overview)
- [Connecting Routes and PrometheusRules](#connecting-routes-and-prometheusrules)
> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works)
- [Creating Receivers in the Rancher UI](#creating-receivers-in-the-rancher-ui)
- [Receiver Configuration](#receiver-configuration)
- [Slack](#slack)
@@ -26,30 +28,10 @@ The [Alertmanager Config](https://prometheus.io/docs/alerting/latest/configurati
- [Receiver](#receiver)
- [Grouping](#grouping)
- [Matching](#matching)
- [Example Alertmanager Configs](#example-alertmanager-configs)
- [Configuring Multiple Receivers](#configuring-multiple-receivers)
- [Example Alertmanager Config](../examples/#example-alertmanager-config)
- [Example Route Config for CIS Scan Alerts](#example-route-config-for-cis-scan-alerts)
# Overview
By default, Rancher Monitoring deploys a single Alertmanager onto a cluster that uses a default Alertmanager Config Secret. As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster that can all be managed using the same underlying Alertmanager Config Secret.
This Secret should be updated or modified any time you want to:
- Add in new notifiers or receivers
- Change the alerts that should be sent to specific notifiers or receivers
- Change the group of alerts that are sent out
> By default, you can either choose to supply an existing Alertmanager Config Secret (i.e. any Secret in the `cattle-monitoring-system` namespace) or allow Rancher Monitoring to deploy a default Alertmanager Config Secret onto your cluster. By default, the Alertmanager Config Secret created by Rancher will never be modified / deleted on an upgrade / uninstall of the `rancher-monitoring` chart to prevent users from losing or overwriting their alerting configuration when executing operations on the chart.
For more information on what fields can be specified in this secret, please look at the [Prometheus Alertmanager docs.](https://prometheus.io/docs/alerting/latest/alertmanager/)
The full spec for the Alertmanager configuration file and what it takes in can be found [here.](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file)
For more information, refer to the [official Prometheus documentation about configuring routes.](https://www.prometheus.io/docs/alerting/latest/configuration/#route)
### Connecting Routes and PrometheusRules
When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Prometheus to figure out which Route should receive this Alert. For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label.
- [Trusted CA for Notifiers](#trusted-ca-for-notifiers)
# Creating Receivers in the Rancher UI
_Available as of v2.5.4_
@@ -248,7 +230,6 @@ url http://rancher-alerting-drivers-sachet.ns-1.svc:9876/alert
<!-- https://github.com/messagebird/sachet -->
{{% /tab %}}
{{% tab "Rancher v2.5.4-2.5.7" %}}
The following types of receivers can be configured in the Rancher UI:
@@ -330,45 +311,14 @@ The Alertmanager must be configured in YAML, as shown in these [examples.](#exam
{{% /tab %}}
{{% /tabs %}}
# Configuring Multiple Receivers
# Route Configuration
By editing the forms in the Rancher UI, you can set up a Receiver resource with all the information Alertmanager needs to send alerts to your notification system.
{{% tabs %}}
{{% tab "Rancher v2.5.4+" %}}
It is also possible to send alerts to multiple notification systems. One way is to configure the Receiver using custom YAML, in which case you can add the configuration for multiple notification systems, as long as you are sure that both systems should receive the same messages.
### Receiver
The route needs to refer to a [receiver](#receiver-configuration) that has already been configured.
You can also set up multiple receivers by using the `continue` option for a route, so that the alerts sent to a receiver continue being evaluated in the next level of the routing tree, which could contain another receiver.
### Grouping
| Field | Default | Description |
|-------|--------------|---------|
| Group By | N/a | The labels by which incoming alerts are grouped together. For example, `[ group_by: '[' <labelname>, ... ']' ]` Multiple alerts coming in for labels such as `cluster=A` and `alertname=LatencyHigh` can be batched into a single group. To aggregate by all possible labels, use the special value `'...'` as the sole label name, for example: `group_by: ['...']` Grouping by `...` effectively disables aggregation entirely, passing through all alerts as-is. This is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping. |
| Group Wait | 30s | How long to wait to buffer alerts of the same group before sending initially. |
| Group Interval | 5m | How long to wait before sending an alert that has been added to a group of alerts for which an initial notification has already been sent. |
| Repeat Interval | 4h | How long to wait before re-sending a given alert that has already been sent. |
### Matching
The **Match** field refers to a set of equality matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs to the Rancher UI, they correspond to the YAML in this format:
```yaml
match:
[ <labelname>: <labelvalue>, ... ]
```
The **Match Regex** field refers to a set of regex-matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs in the Rancher UI, they correspond to the YAML in this format:
```yaml
match_re:
[ <labelname>: <regex>, ... ]
```
{{% /tab %}}
{{% tab "Rancher v2.5.0-2.5.3" %}}
The Alertmanager must be configured in YAML, as shown in these [examples.](#example-alertmanager-configs)
{{% /tab %}}
{{% /tabs %}}
# Example Alertmanager Configs
@@ -440,3 +390,8 @@ spec:
```
For more information on enabling alerting for `rancher-cis-benchmark`, see [this section.]({{<baseurl>}}/rancher/v2.5/en/cis-scans/v2.5/#enabling-alerting-for-rancher-cis-benchmark)
# Trusted CA for Notifiers
If you need to add a trusted CA to your notifier, follow the steps in [this section.](../helm-chart-options/#trusted-ca-for-notifiers)
@@ -0,0 +1,74 @@
---
title: Route Configuration
shortTitle: Routes
weight: 5
---
The route configuration is the section of the Alertmanager custom resource that controls how the alerts fired by Prometheus are grouped and filtered before they reach the receiver.
When a Route is changed, the Prometheus Operator regenerates the Alertmanager custom resource to reflect the changes.
For more information about configuring routes, refer to the [official Alertmanager documentation.](https://www.prometheus.io/docs/alerting/latest/configuration/#route)
> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works)
- [Route Restrictions](#route-restrictions)
- [Route Configuration](#route-configuration)
- [Receiver](#receiver)
- [Grouping](#grouping)
- [Matching](#matching)
# Route Restrictions
- Alertmanager proxies alerts for Prometheus based on a configuration. It has receivers and a routing tree.
- Receivers: One or more notification providers (Slack, PagerDuty, etc.) to send alerts to.
- Routing tree: A set of routes that filter alerts to certain receivers based on labels.
- Alerting drivers proxy alerts for Alertmanager to non-native receivers, such as Microsoft Teams and SMS.
- can configure a routing tree to send and then continue. We only support routing trees with one root and then a depth of one more, for a depth two tree. But technically a continue route lets you make the tree deeper.
- the receiver is for one or more notification providers. So if you know every alert for slack should also go to pager duty, you can put both configs in the same receiver.
- we now support broad SMS, not just Aliyun.
# Route Configuration
### Note on Labels and Annotations
Labels should be used for identifying information that can affect the routing of notifications. Identifying information about the alert could consist of a container name, or the name of the team that should be notified.
Annotations should be used for information that does not affect who receives the alert, such as a runbook url or error message.
{{% tabs %}}
{{% tab "Rancher v2.5.4+" %}}
### Receiver
The route needs to refer to a [receiver](#receiver-configuration) that has already been configured.
### Grouping
| Field | Default | Description |
|-------|--------------|---------|
| Group By | N/a | The labels by which incoming alerts are grouped together. For example, `[ group_by: '[' <labelname>, ... ']' ]` Multiple alerts coming in for labels such as `cluster=A` and `alertname=LatencyHigh` can be batched into a single group. To aggregate by all possible labels, use the special value `'...'` as the sole label name, for example: `group_by: ['...']` Grouping by `...` effectively disables aggregation entirely, passing through all alerts as-is. This is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping. |
| Group Wait | 30s | How long to wait to buffer alerts of the same group before sending initially. |
| Group Interval | 5m | How long to wait before sending an alert that has been added to a group of alerts for which an initial notification has already been sent. |
| Repeat Interval | 4h | How long to wait before re-sending a given alert that has already been sent. |
### Matching
The **Match** field refers to a set of equality matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs to the Rancher UI, they correspond to the YAML in this format:
```yaml
match:
[ <labelname>: <labelvalue>, ... ]
```
The **Match Regex** field refers to a set of regex-matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs in the Rancher UI, they correspond to the YAML in this format:
```yaml
match_re:
[ <labelname>: <regex>, ... ]
```
{{% /tab %}}
{{% tab "Rancher v2.5.0-2.5.3" %}}
The Alertmanager must be configured in YAML, as shown in this [example.](./examples/#alertmanager-config)
{{% /tab %}}
{{% /tabs %}}
@@ -0,0 +1,31 @@
---
title: ServiceMonitor and PodMonitor Configuration
shortTitle: ServiceMonitors and PodMonitors
weight: 7
---
ServiceMonitors and PodMonitors are both pseudo-CRDs that map the scrape configuration of the Prometheus custom resource.
These configuration objects declaratively specify the endpoints that Prometheus will scrape metrics from.
ServiceMonitors are more commonly used than PodMonitors, and we recommend them for most use cases.
> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works)
### ServiceMonitors
This pseudo-CRD maps to a section of the Prometheus custom resource configuration. It declaratively specifies how groups of Kubernetes services should be monitored.
When a ServiceMonitor is created, the Prometheus Operator updates the Prometheus scrape configuration to include the ServiceMonitor configuration. Then Prometheus begins scraping metrics from the endpoint defined in the ServiceMonitor.
Any Services in your cluster that match the labels located within the ServiceMonitor `selector` field will be monitored based on the `endpoints` specified on the ServiceMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#servicemonitor) provided by Prometheus Operator.
For more information about how ServiceMonitors work, refer to the [Prometheus Operator documentation.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md)
### PodMonitors
This pseudo-CRD maps to a section of the Prometheus custom resource configuration. It declaratively specifies how group of pods should be monitored.
When a PodMonitor is created, the Prometheus Operator updates the Prometheus scrape configuration to include the PodMonitor configuration. Then Prometheus begins scraping metrics from the endpoint defined in the ServiceMonitor.
Any Pods in your cluster that match the labels located within the PodMonitor `selector` field will be monitored based on the `podMetricsEndpoints` specified on the PodMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#podmonitorspec) provided by Prometheus Operator.
@@ -0,0 +1,86 @@
---
title: Built-in Dashboards
weight: 3
---
- [Grafana UI](#grafana-ui)
- [Alertmanager UI](#alertmanager-ui)
- [Prometheus UI](#prometheus-ui)
# Grafana UI
[Grafana](https://grafana.com/grafana/) allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture.
To see the default dashboards for time series data visualization, go to the Grafana UI.
### Customizing Grafana
To view and customize the PromQL queries powering the Grafana dashboard, see [this page.](./customize-grafana)
### Persistent Grafana Dashboards
To create a persistent Grafana dashboard, see [this page.](./persist-grafana)
### Access to Grafana
For information about role-based access control for Grafana, see [this section.](./rbac/#role-based-access-control-for-grafana)
# Alertmanager UI
When `rancher-monitoring` is installed, the Prometheus Alertmanager UI is deployed, allowing you to view your alerts and the current Alertmanager configuration.
> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works)
### Accessing the Alertmanager UI
The Alertmanager UI lets you see the most recently fired alerts.
> **Prerequisite:** The `rancher-monitoring` application must be installed.
To see the Alertmanager UI, go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Alertmanager.**
**Result:** The Alertmanager UI opens in a new tab. For help with configuration, refer to the [official Alertmanager documentation.](https://prometheus.io/docs/alerting/latest/alertmanager/)
For more information on configuring Alertmanager in Rancher, see [this page.](./configuration/alertmanager)
<figcaption>The Alertmanager UI</figcaption>
![Alertmanager UI]({{<baseurl>}}/img/rancher/alertmanager-ui.png)
### Viewing Default Alerts
To see alerts that are fired by default, go to the [Alertmanager UI](./alertmanager-ui) and click **Expand all groups.**
# Prometheus UI
By default, the [kube-state-metrics service](https://github.com/kubernetes/kube-state-metrics) provides a wealth of information about CPU and memory utilization to the monitoring application. These metrics cover Kubernetes resources across namespaces. This means that in order to see resource metrics for a service, you don't need to create a new ServiceMonitor for it. Because the data is already in the time series database, you can go to the Prometheus UI and run a PromQL query to get the information. The same query can be used to configure a Grafana dashboard to show a graph of those metrics over time.
To see the Prometheus UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Graph.**
<figcaption>Prometheus Graph UI</figcaption>
![Prometheus Graph UI]({{<baseurl>}}/img/rancher/prometheus-graph-ui.png)
### Viewing the Prometheus Targets
To see what services you are monitoring, you will need to see your targets. Targets are set up by ServiceMonitors and PodMonitors as sources to scrape metrics from. You won't need to directly edit targets, but the Prometheus UI can be useful for giving you an overview of all of the sources of metrics that are being scraped.
To see the Prometheus Targets, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Targets.**
<figcaption>Targets in the Prometheus UI</figcaption>
![Prometheus Targets UI]({{<baseurl>}}/img/rancher/prometheus-targets-ui.png)
### Viewing the PrometheusRules
When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Alertmanager to figure out which Route should receive a certain Alert.
To see the PrometheusRules, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Rules.**
You can also see the rules in the Prometheus UI:
<figcaption>Rules in the Prometheus UI</figcaption>
![PrometheusRules UI]({{<baseurl>}}/img/rancher/prometheus-rules-ui.png)
For more information on configuring PrometheusRules in Rancher, see [this page.](./configuration/prometheusrules)
@@ -1,16 +1,17 @@
---
title: Prometheus Expressions
weight: 4
title: PromQL Expression Reference
weight: 6
aliases:
- /rancher/v2.5/en/project-admin/tools/monitoring/expression
- /rancher/v2.5/en/cluster-admin/tools/monitoring/expression
- /rancher/v2.5/en/monitoring-alerting/legacy/monitoring/cluster-monitoring/expression
- /rancher/v2.5/en/monitoring-alerting/v2.5/configuration/expression
- /rancher/v2.5/en/monitoring/alerting/configuration/expression
---
The PromQL expressions in this doc can be used to configure alerts.
For more information about querying Prometheus, refer to the official [Prometheus documentation.](https://prometheus.io/docs/prometheus/latest/querying/basics/)
For more information about querying the Prometheus time series database, refer to the official [Prometheus documentation.](https://prometheus.io/docs/prometheus/latest/querying/basics/)
<!-- TOC -->
@@ -0,0 +1,15 @@
---
title: Monitoring Guides
shortTitle: Guides
weight: 4
---
- [Enable monitoring](./enable-monitoring)
- [Uninstall monitoring](./uninstall)
- [Monitoring Rancher apps](./monitoring-rancher-apps)
- [Monitoring workloads](./monitoring-workloads)
- [Customizing Grafana dashboards](./customize-grafana)
- [Persistent Grafana dashboards](./persist-grafana)
- [Setting up metrics for horizontal pod autoscaling](./hpa)
- [Debugging high memory usage](./memory-usage)
- [Migrating from Monitoring V1 to V2](./migrating)
@@ -0,0 +1,52 @@
---
title: Customizing Grafana Dashboards
weight: 5
---
In this section, you'll learn how to customize the Grafana dashboard to show metrics that apply to a certain container.
### Prerequisites
Before you can customize a Grafana dashboard, the `rancher-monitoring` application must be installed.
To see the links to the external monitoring UIs, including Grafana dashboards, you will need at least a [project-member role.]({{<baseurl>}}/rancher/v2.5/en/monitoring-alerting/rbac/#users-with-rancher-cluster-manager-based-permissions)
### Signing in to Grafana
1. In the Rancher UI, go to the cluster that has the dashboard you want to customize.
1. In the left navigation menu, click **Monitoring.**
1. Click **Grafana.** The Grafana dashboard should open in a new tab.
1. Go to the log in icon in the lower left corner and click **Sign In.**
1. Log in to Grafana. The default Admin username and password for the Grafana instance is `admin/prom-operator`. (Regardless of who has the password, cluster administrator permission in Rancher is still required access the Grafana instance.) Alternative credentials can also be supplied on deploying or upgrading the chart.
### Getting the PromQL Query Powering a Grafana Panel
For any panel, you can click the title and click **Explore** to get the PromQL queries powering the graphic.
For this example, we would like to get the CPU usage for the Alertmanager container, so we click **CPU Utilization > Inspect.**
1. The **Data** tab shows the underlying data as a time series, with the time in first column and the PromQL query result in the second column. Copy the PromQL query.
```
(1 - (avg(irate({__name__=~"node_cpu_seconds_total|windows_cpu_time_total",mode="idle"}[5m])))) * 100
```
### Modifying an Existing Grafana Panel
1. Open the Grafana dashboard.
### Creating a New Grafana Panel in a Dashboard
- lets say you want metrics that apply only for the container alertmanager.
- link to the promql queries used to make grafana dashboards. To get those queries,
- go to grafana
- right click on a graphic and click explore
- it shows you the PromQL queries that are embedded in it
- can modify it
- grafana shows you updated based on your modifications to the query
- also link to persisting grafana dashboards section
@@ -0,0 +1,79 @@
---
title: Enable Monitoring
weight: 1
---
As an [administrator]({{<baseurl>}}/rancher/v2.5/en/admin-settings/rbac/global-permissions/) or [cluster owner]({{<baseurl>}}/rancher/v2.5/en/admin-settings/rbac/cluster-project-roles/#cluster-roles), you can configure Rancher to deploy Prometheus to monitor your Kubernetes cluster.
This page describes how to enable monitoring and alerting within a cluster using the new monitoring application.
You can enable monitoring with or without SSL.
# Requirements
- Make sure that you are allowing traffic on port 9796 for each of your nodes because Prometheus will scrape metrics from here.
- Make sure your cluster fulfills the resource requirements. The cluster should have at least 1950Mi memory available, 2700m CPU, and 50Gi storage. A breakdown of the resource limits and requests is [here.](./configuration/helm-chart-options/#setting-resource-limits-and-requests)
- When installing monitoring on an RKE cluster using RancherOS or Flatcar Linux nodes, change the etcd node certificate directory to `/opt/rke/etc/kubernetes/ssl`.
> **Note:** If you want to set up Alertmanager, Grafana or Ingress, it has to be done with the settings on the Helm chart deployment. It's problematic to create Ingress outside the deployment.
# Setting Resource Limits and Requests
The resource requests and limits can be configured when installing `rancher-monitoring`. To configure Prometheus resources from the Rancher UI, click **Apps & Marketplace > Monitoring** in the upper left corner.
For more information about the default limits, see [this page.](./configuration/helm-chart-options/#setting-resource-limits-and-requests)
# Install the Monitoring Application
{{% tabs %}}
{{% tab "Rancher v2.5.8" %}}
### Enable Monitoring for use without SSL
1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.**
1. Click **Apps.**
1. Click the `rancher-monitoring` app.
1. Optional: Click **Chart Options** and configure alerting, Prometheus and Grafana. For help, refer to the [configuration reference.](./configuration)
1. Scroll to the bottom of the Helm chart README and click **Install.**
**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace.
### Enable Monitoring for use with SSL
1. Follow the steps on [this page]({{<baseurl>}}/rancher/v2.5/en/k8s-in-rancher/secrets/) to create a secret in order for SSL to be used for alerts.
- The secret should be created in the `cattle-monitoring-system` namespace. If it doesn't exist, create it first.
- Add the `ca`, `cert`, and `key` files to the secret.
1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.**
1. Click **Apps.**
1. Click the `rancher-monitoring` app.
1. Click **Alerting**.
1. Click **Additional Secrets** and add the secrets created earlier.
**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace.
When [creating a receiver,]({{<baseurl>}}/rancher/v2.5/en/monitoring-alerting/configuration/alertmanager/#creating-receivers-in-the-rancher-ui) SSL-enabled receivers such as email or webhook will have a **SSL** section with fields for **CA File Path**, **Cert File Path**, and **Key File Path**. Fill in these fields with the paths to each of `ca`, `cert`, and `key`. The path will be of the form `/etc/alertmanager/secrets/name-of-file-in-secret`.
For example, if you created a secret with these key-value pairs:
```yaml
ca.crt=`base64-content`
cert.pem=`base64-content`
key.pfx=`base64-content`
```
Then **Cert File Path** would be set to `/etc/alertmanager/secrets/cert.pem`.
{{% /tab %}}
{{% tab "Rancher v2.5.0-2.5.7" %}}
1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.**
1. Click **Apps.**
1. Click the `rancher-monitoring` app.
1. Optional: Click **Chart Options** and configure alerting, Prometheus and Grafana. For help, refer to the [configuration reference.](./configuration)
1. Scroll to the bottom of the Helm chart README and click **Install.**
**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace.
{{% /tab %}}
{{% /tabs %}}
@@ -0,0 +1,31 @@
---
title: Setting up Metrics for HPA
weight: 7
---
The monitoring app installs a Prometheus adapter that can be used for making the metrics from monitoring available from the Kubernetes API. This is useful for horizontal pod autoscaling based on custom metrics.
- kube-state-metrics: monitors internal K8s components
-
For HPA its important to talk about kubernetes metrics APIs. For every rke cluster, metrics server is added on. HPA can hit that, can scale up or down based on pod or node usage.
We package Prometheus Adapter. It implements a k8s metrics api, says I want to expose these metrics in the k8s api so it can be used for HPA.
- kubernetes metrics APIs are implemented as adapters.
- the default adapter that has been implemented for a long time is the resource metrics API. This is why when you deploy RKE, the default API that is added on is metrics server.
- Metrics server is a kubernetes project that is an adapter that implements the resource metrics API. It collects different node metrics and stores it in a way that is accessible by HPA.
- If you want prometheus metrics to be stored on the Kubernetes API for you to be able to do HPA on, then the relevant way to configure that is by using Prometheus Adapter. It is packaged by default in monitoring v2, but not v1.
- if you want to do the custom metrics API, there is a secret for Prometheus Adapter that you can modify that will start exposing selected metrics from Prometheus onto those APIs, which can then be consumed by HPA.
- resource metrics: implemented by metrics-server, deployed as an RKE add-on
- custom metrics Api: implemented by Prometheus Adapter, exposed for use within the cluster (e.g. HPA)
- External Metrics API: implemented by Prometheus Adapter, exposed for use outside the cluster.
Kubernetes metrics API
- for HPA, how do I query prometheus to use that?
- prometheus stores data within its own time series database
- there are times when you also want to expose that within kubernetes itself, so that things like HPA can use it.
- k8s has metrics apis that are implemented as adapters
- big one is metrics API
@@ -0,0 +1,20 @@
---
title: Debugging High Memory Usage
weight: 8
---
Every time series in Prometheus is uniquely identified by its [metric name](https://prometheus.io/docs/practices/naming/#metric-names) and optional key-value pairs called [labels.](https://prometheus.io/docs/practices/naming/#labels)
The labels allow the ability to filter and aggregate the time series data, but they also multiply the amount of data that Prometheus collects.
Each time series has a defined set of labels, and Prometheus generates a new time series for all unique combinations of labels. If a metric has two labels attached, two time series are generated for that metric. Changing any label value, including adding or removing a label, will create a new time series.
Prometheus is optimized to store data that is index-based on series. It is designed for a relatively consistent number of time series and a relatively large number of samples that need to be collected from the exporters over time.
Inversely, Prometheus is not optimized to accommodate a rapidly changing number of time series. For that reason, large bursts of memory usage can occur when monitoring is installed on clusters where many resources are being created and destroyed, especially on multi-tenant clusters.
### Reducing Memory Bursts
To reduce memory consumption, Prometheus can be configured to store fewer time series, by scraping fewer metrics or by attaching fewer labels to the time series. To see which series use the most memory, you can check the TSDB (time series database) status page in the Prometheus UI.
Distributed Prometheus solutions such as [Thanos](https://thanos.io/) and [Cortex](https://cortexmetrics.io/) use an alternate architecture in which multiple small Prometheus instances are deployed. In the case of Thanos, the metrics from each Prometheus are aggregated into the common Thanos deployment, and then those metrics are exported to a persistent store, such as S3. This more robust architecture avoids burdening any single Prometheus instance with too many time series, while also preserving the ability to query metrics on a global level.
@@ -1,13 +1,22 @@
---
title: Migrating to Rancher v2.5 Monitoring
weight: 5
weight: 9
aliases:
- /rancher/v2.5/en/monitoring-alerting/v2.5/migrating
---
If you previously enabled Monitoring, Alerting, or Notifiers in Rancher before v2.5, there is no automatic upgrade path for switching to the new monitoring/alerting solution. Before deploying the new monitoring solution via Cluster Explore, you will need to disable and remove all existing custom alerts, notifiers and monitoring installations for the whole cluster and in all projects.
### Monitoring Before Rancher v2.5
- [Monitoring Before Rancher v2.5](#monitoring-before-rancher-v2-5)
- [Monitoring and Alerting via Cluster Explorer in Rancher v2.5](#monitoring-and-alerting-via-cluster-explorer-in-rancher-v2-5)
- [Changes to Role-based Access Control](#changes-to-role-based-access-control)
- [Migrating from Monitoring V1 to Monitoring V2](#migrating-from-monitoring-v1-to-monitoring-v2)
- [Migrating Grafana Dashboards](#migrating-grafana-dashboards)
- [Migrating Alerts](#migrating-alerts)
- [Migrating Notifiers](#migrating-notifiers)
- [Migrating for RKE Template Users](#migrating-for-rke-template-users)
# Monitoring Before Rancher v2.5
As of v2.2.0, Rancher's Cluster Manager allowed users to enable Monitoring & Alerting V1 (both powered by [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)) independently within a cluster.
@@ -17,7 +26,7 @@ Monitoring V1 could be configured on both a cluster-level and on a project-level
When Alerts or Notifiers are enabled, Alerting V1 deploys [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) and a set of Rancher controllers onto a cluster that allows users to define alerts and configure alert-based notifications via Email, Slack, PagerDuty, etc. Users can choose to create different types of alerts depending on what needs to be monitored (e.g. System Services, Resources, CIS Scans, etc.); however, PromQL Expression-based alerts can only be created if Monitoring V1 is enabled.
### Monitoring/Alerting via Cluster Explorer in Rancher 2.5
# Monitoring and Alerting via Cluster Explorer in Rancher 2.5
As of v2.5.0, Rancher's Cluster Explorer now allows users to enable Monitoring & Alerting V2 (both powered by [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)) together within a cluster.
@@ -27,28 +36,28 @@ Monitoring V2 can only be configured on the cluster level. Project-level monitor
For more information on how to configure Monitoring & Alerting V2, see [this page.]({{<baseurl>}}/rancher/v2.5/en/monitoring-alerting/v2.5/configuration)
### Changes to Role-based Access Control
# Changes to Role-based Access Control
Project owners and members no longer get access to Grafana or Prometheus by default. If view-only users had access to Grafana, they would be able to see data from any namespace. For Kiali, any user can edit things they dont own in any namespace.
For more information about role-based access control in `rancher-monitoring`, refer to [this page.](../rbac)
### Migrating from Monitoring V1 to Monitoring V2
# Migrating from Monitoring V1 to Monitoring V2
While there is no automatic migration available, it is possible to manually migrate custom Grafana dashboards and alerts that were created in Monitoring V1 to Monitoring V2.
Before you can install Monitoring V2, Monitoring V1 needs to be uninstalled completely. In order to uninstall Monitoring V1:
* Remove all cluster and project specific alerts and alerts groups
* Remove all notifiers
* Disable all project monitoring installations under Cluster -> Project -> Tools -> Monitoring
* Remove all cluster and project specific alerts and alerts groups.
* Remove all notifiers.
* Disable all project monitoring installations under Cluster -> Project -> Tools -> Monitoring.
* Ensure that all project-monitoring apps in all projects have been removed and are not recreated after a few minutes
* Disable the cluster monitoring installation under Cluster -> Tools -> Monitoring
* Ensure that the cluster-monitoring app and the monitoring-operator app in the System project have been removed and are not recreated after a few minutes
* Disable the cluster monitoring installation under Cluster -> Tools -> Monitoring.
* Ensure that the cluster-monitoring app and the monitoring-operator app in the System project have been removed and are not recreated after a few minutes.
#### RKE Template Clusters
To prevent v1 monitoring from being re-enabled, disable monitoring and in future RKE template revisions via modification of the RKE template yaml:
To prevent V1 monitoring from being re-enabled, disable monitoring and in future RKE template revisions via modification of the RKE template yaml:
```yaml
enable_cluster_alerting: false
@@ -86,7 +95,7 @@ data:
Once this ConfigMap is created, the dashboard will automatically be added to Grafana.
#### Migrating Alerts
### Migrating Alerts
It is only possible to directly migrate expression-based alerts to Monitoring V2. Fortunately, the event-based alerts that could be set up to alert on system component, node or workload events, are already covered out-of-the-box by the alerts that are part of Monitoring V2. So it is not necessary to migrate them.
@@ -121,6 +130,11 @@ or add the Prometheus Rule through the Cluster Explorer
For more details on how to configure PrometheusRules in Monitoring V2 see [Monitoring Configuration]({{<baseurl>}}/rancher/v2.5/en/monitoring-alerting/v2.5/configuration#prometheusrules).
#### Migrating notifiers
### Migrating Notifiers
There is no direct equivalent for how notifiers work in Monitoring V1. Instead you have to replicate the desired setup with [Routes and Receivers]({{<baseurl>}}/rancher/v2.5/en/monitoring-alerting/v2.5/configuration#alertmanager-config) in Monitoring V2.
### Migrating for RKE Template Users
If the cluster is managed using an RKE template, you will need to disable monitoring in future RKE template revisions to prevent legacy monitoring from being re-enabled.
@@ -0,0 +1,20 @@
---
title: Monitoring Rancher Apps
weight: 3
---
A common pattern for Rancher apps is to package a ServiceMonitor in the Helm chart for the application. The ServiceMonitor contains a preconfigured Prometheus target for monitoring.
When the ServiceMonitor is enabled and monitoring is also enabled, Prometheus will be able to scrape metrics from the Rancher application.
CIS application has a flag that lets you deploy a service monitor in it. As a general practice we expose charts for prometheus metrics to have that service monitor definition. The moment its deployed into the cluster, the prometheus scrape configuration will automatically be updated to reflect the service monitors that it has access to.
In logging v2 they will deploy a service monitor and we will just absorb it.
question: someone found out from looking through rancher helm charts that some of them already have a service monitor defined that you might have to turn on, and if you do, those metrics are prepackaged for Prometheus in the right format.
It's a common pattern to have service monitor packaged inside. Thats how we do it for cis scans.
@@ -0,0 +1,56 @@
---
title: Setting up Monitoring for a Workload
weight: 4
---
- [Display CPU and Memory Metrics for a Workload](#display-cpu-and-memory-metrics-for-a-workload)
- [Setting up Metrics Beyond CPU and Memory](#setting-up-metrics-beyond-cpu-and-memory)
If you only need CPU and memory time series for the workload, you don't need to deploy a ServiceMonitor or PodMonitor because the monitoring application already collects metrics data on resource usage by default.
The steps for setting up monitoring for workloads depends on whether you want basic metrics such as CPU and memory for the workload, or whether you want to scrape custom metrics from the workload.
If you only need CPU and memory time series for the workload, you don't need to deploy a ServiceMonitor or PodMonitor because the monitoring application already collects metrics data on resource usage by default. The resource usage time series data is in Prometheus's local time series database. Grafana shows the data in aggregate, but you can see the data for the individual workload by using a PromQL query that extracts the data for that workload. Once you have the PromQL query, you can execute the query individually in the Prometheus UI and see the time series visualized there, or you can use the query to customize a Grafana dashboard to display the workload metrics. For examples of PromQL queries for workload metrics, see [this section.](https://rancher.com/docs/rancher/v2.5/en/monitoring-alerting/configuration/expression/#workload-metrics)
To set up custom metrics for your workload, you will need to set up an exporter and create a new ServiceMonitor custom resource to configure Prometheus to scrape metrics from your exporter.
For more information, see [this section.](./monitoring-workloads)
explain how some applications come with a servicemonitor packaged within them
for example, some rancher applications come with servicemonitors (link to section)
### Display CPU and Memory Metrics for a Workload
By default, the monitoring application already scrapes CPU and memory.
To get some fine-grained detail for a particular workload, you can customize a Grafana dashboard to display the metrics for a particular workload.
- theres already a wealth of information provided by kube-state-metrics. Cpu utilization, memory utilization for different things across namespaces. If you just want resource metrics for prod, you dont need to create a new ServiceMonitor for it. All you need to do is go to the prometheus UI and do a PromQL query to get the information.
For more information on customizing Grafana to show the workload metrics, see this section. (Link)
### Setting up Metrics Beyond CPU and Memory
For custom metrics, you will need to expose the metrics on your application in a format supported by Prometheus.
Then we recommend that you should create a new ServiceMonitor custom resource. When this resource is created, the Prometheus custom resource will be automatically updated so that its scrape configuration includes the new custom metrics endpoint. Then Prometheus will begin scraping metrics from the endpoint.
You can also create a PodMonitor to expose the custom metrics endpoint, but ServiceMonitors are more appropriate for the majority of use cases.
- lets say we expose metrics at a particular endpoint. Lets take rancher-monitoring-kube-state-metrics. For example they have a container port where they expose metrics from.
- the approach I would take - although we dont have a clean UI from it - is to create it from YAML.
- for something like for grafana wed create it like this - like for rancher-monitoring-grafana - where the basic details we need to provide are:
- what is the actual endpoint that you want to hit (spec.endpoints, path and port) - whats the HTTP path that you want to hit and whats the port.
- namespaceSelector: what namespaces does that particular deployment exist in within Kubernetes, and use matchNames to select them.
- you can also use selector.matchLabels.
- Thats what it takes to add monitoring if a serviceMonitor is not already defined.
- example: use the rancher-monitoring-grafana YAML
@@ -1,6 +1,6 @@
---
title: Persistent Grafana Dashboards
weight: 4
weight: 6
aliases:
- /rancher/v2.5/en/monitoring-alerting/v2.5/persist-grafana
---
@@ -75,7 +75,7 @@ If you attempt to delete the dashboard in the Grafana UI, you will see the error
### Configuring Namespaces for the Grafana Dashboard ConfigMap
To specify that you would like Grafana to watch for ConfigMaps across all namespaces, set:
To specify that you would like Grafana to watch for ConfigMaps across all namespaces, set this value in the `rancher-monitoring` Helm chart:
```
grafana.sidecar.dashboards.searchNamespace=ALL
@@ -0,0 +1,14 @@
---
title: Uninstall Monitoring
weight: 2
---
1. From the **Cluster Explorer,** click Apps & Marketplace.
1. Click **Installed Apps.**
1. Go to the `cattle-monitoring-system` namespace and check the boxes for `rancher-monitoring-crd` and `rancher-monitoring`.
1. Click **Delete.**
1. Confirm **Delete.**
**Result:** `rancher-monitoring` is uninstalled.
> **Note on Persistent Grafana Dashboards:** For users who are using Monitoring V2 v9.4.203 or below, uninstalling the Monitoring chart will delete the cattle-dashboards namespace, which will delete all persisted dashboards, unless the namespace is marked with the annotation `helm.sh/resource-policy: "keep"`. This annotation is added by default in Monitoring V2 v14.5.100+ but can be manually applied on the cattle-dashboards namespace before an uninstall if an older version of the Monitoring chart is currently installed onto your cluster.
@@ -0,0 +1,168 @@
---
title: How Monitoring Works
weight: 1
---
- [1. How Data Flows through the Monitoring Application](#1-how-data-flows-through-the-monitoring-application)
- [2. How Prometheus Works](#2-how-prometheus-works)
- [2.1. Defining what Metrics are Scraped](#2-1-defining-what-metrics-are-scraped)
- [2.2. Scraping Metrics from Exporters](#2-2-scraping-metrics-from-exporters)
- [2.3. Storing Time Series Data](#2-3-storing-time-series-data)
- [2.4. Querying the Time Series Database](#2-4-querying-the-time-series-database)
- [2.5. Defining Rules for when Alerts Should be Fired](#2-5-defining-rules-for-when-alerts-should-be)
- [2.6. Firing Alerts](#2-6-firing-alerts)
- [3. How Alertmanager Works](#3-how-alertmanager-works)
- [3.1. Routing Alerts to Receivers](#3-1-routing-alerts-to-receivers)
- [3.2. Configuring Multiple Receivers](#3-2-configuring-multiple-receivers)
- [4. How the Monitoring Application Works](#4-how-the-monitoring-application-works)
- [4.1. Resources Deployed by Default](#4-1-resources-deployed-by-default)
- [4.2. PushProx](#4-2-pushprox)
- [4.3. Default Exporters](#4-3-default-exporters)
- [5. Components Exposed in the Rancher UI](#5-components-exposed-in-the-rancher-ui)
# 1. How Data Flows through the Monitoring Application
The below diagram shows the linear flow of data through the monitoring application in chronological order:
![Data Flow Through Monitoring Components]({{<baseurl>}}/img/rancher/monitoring-components.svg)
# 2. How Prometheus Works
### 2.1. Defining what Metrics are Scraped
ServiceMonitors define targets that are intended for Prometheus to scrape.
The [Prometheus custom resource tells](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#prometheus) Prometheus which ServiceMonitors it should use to find out where to scrape metrics from.
The Prometheus Operator observes the ServiceMonitors, continuously using them to auto-generate the scrape configuration in the Prometheus custom resource and keeping it in sync. This scrape configuration tells Prometheus which endpoints to scrape metrics from and how it will label the metrics from those endpoints.
Prometheus scrapes all of the metrics defined in its scrape configuration at every `scrape_interval`, which is one minute by default.
The scrape configuration can be viewed as part of the Prometheus custom resource that is exposed in the Rancher UI.
### 2.2. Scraping Metrics from Exporters
Prometheus scrapes metrics from deployments known as [exporters,](https://prometheus.io/docs/instrumenting/exporters/) which export the time series data in a format that Prometheus can ingest.
In Prometheus, time series consist of streams of timestamped values belonging to the same metric and the same set of labeled dimensions.
To allow monitoring to be installed on hardened Kubernetes clusters, `rancher-monitoring` application proxies the communication between Prometheus and the exporter through PushProx. For more information about PushProx, see [this section.](#pushprox)
### 2.3. Storing Time Series Data
After collecting metrics from exporters, Prometheus stores the time series in a local on-disk time series database. Prometheus optionally integrates with remote systems, but `rancher-monitoring` uses local storage for the time series database.
The database can then be queried using PromQL, the query language for Prometheus. Grafana dashboards use PromQL queries to generate data visualizations.
### 2.4. Querying the Time Series Database
The PromQL query language is the primary tool to query Prometheus for time series data.
In Grafana, you can right-click a CPU utilization and click Inspect. This opens a panel that shows the [raw query results.](https://grafana.com/docs/grafana/latest/panels/inspect-panel/#inspect-raw-query-results)The raw results demonstrate how each dashboard is powered by PromQL queries.
### 2.5. Defining Rules for when Alerts Should be Fired
Rules define the conditions for Prometheus to fire alerts.
When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Alertmanager to figure out which Route should receive this Alert.
For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label.
Prometheus rule files are held in PrometheusRule custom resources. A PrometheusRule allows you to define one or more RuleGroups. Each RuleGroup consists of a set of Rule objects that can each represent either an alerting or a recording rule with the following fields:
- The name of the new alert or record
- A PromQL expression for the new alert or record
- Labels that should be attached to the alert or record that identify it (e.g. cluster name or severity)
- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules.
### 2.6. Firing Alerts
Prometheus doesn't maintain the state of whether alerts are active. It fires alerts repetitively at every evaluation interval, relying on Alertmanager to group and filter the alerts into meaningful notifications.
The `evaluation_interval` constant defines how often Prometheus evaluates its alerting rules against the time series database. Similar to the `scrape_interval`, the `evaluation_interval` also defaults to one minute.
The rules are contained in a set of rule files. Rule files include both alerting rules and recording rules, but only alerting rules result in alerts being fired after their evaluation.
For recording rules, Prometheus runs a query, then stores it as a time series. This synthetic time series is useful for storing the results of an expensive or time-consuming query so that it can be queried more quickly in the future.
Alerting rules are more commonly used. Whenever an alerting rule evaluates to a positive number, Prometheus fires an alert.
The Rule file adds labels and annotations to alerts before firing them, depending on the use case:
- Labels indicate information that identifies the alert and could affect the routing of the alert. For example, if when sending an alert about a certain container, the container ID could be used as a label.
- Annotations denote information that doesn't affect where an alert is routed, for example, a runbook or an error message.
# 3. How Alertmanager Works
The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of the following tasks:
- Deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or OpsGenie
- Silencing and inhibition of alerts
- Tracking alerts that fire over time
- Sending out the status of whether an alert is currently firing, or if it is resolved
### 3.1. Routing Alerts to Receivers
Alertmanager coordinates where alerts are sent. It allows you to group alerts based on labels and fire them based on whether certain labels are matched. One top-level route accepts all alerts.
From there, Alertmanager continues routing alerts to receivers based on whether they match the conditions of the next route.
While the Rancher UI forms only allow editing a routing tree that is two levels deep, you can configure more deeply nested routing structures by editing the Alertmanager custom resource YAML.
### 3.2. Configuring Multiple Receivers
By editing the forms in the Rancher UI, you can set up a Receiver resource with all the information Alertmanager needs to send alerts to your notification system.
By editing custom YAML in the Alertmanager or Receiver configuration, you can also send alerts to multiple notification systems. For more information, see the section on configuring [Receivers.](./configuration/receiver/#configuring-multiple-receivers)
# 4. How the Monitoring Application Works
Prometheus Operator introduces a set of [Custom Resource Definitions](https://github.com/prometheus-operator/prometheus-operator#customresourcedefinitions) that allow users to deploy and manage Prometheus and Alertmanager instances by creating and modifying those custom resources on a cluster.
Prometheus Operator will automatically update your Prometheus configuration based on the live state of the resources and configuration options that are edited in the Rancher UI.
### 4.1. Resources Deployed by Default
By default, a set of resources curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project are deployed onto your cluster as part of installing the Rancher Monitoring Application to set up a basic Monitoring/Alerting stack.
The resources that get deployed onto your cluster to support this solution can be found in the [`rancher-monitoring`](https://github.com/rancher/charts/tree/main/charts/rancher-monitoring) Helm chart, which closely tracks the upstream [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) Helm chart maintained by the Prometheus community with certain changes tracked in the [CHANGELOG.md](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/CHANGELOG.md).
There are also certain special types of ConfigMaps and Secrets such as those corresponding to Grafana Dashboards, Grafana Datasources, and Alertmanager Configs that will automatically update your Prometheus configuration via sidecar proxies that observe the live state of those resources within your cluster.
### 4.2. PushProx
PushProx enhances the security of the monitoring application, allowing it to be installed on hardened Kubernetes clusters.
To expose Kubernetes metrics, PushProxes use a client proxy model to expose specific ports within default Kubernetes components. Node exporters expose metrics to PushProx through an outbound connection.
The proxy allows `rancher-monitoring` to scrape metrics from processes on the hostNetwork, such as the `kube-api-server`, without opening up node ports to inbound connections.
PushProx is a DaemonSet that listens for clients that seek to register. Once registered, it proxies scrape requests through the established connection. Then the client executes the request to etcd.
All of the default ServiceMonitors, such as `rancher-monitoring-kube-controller-manager`, are configured to hit the metrics endpoint of the client using this proxy.
### 4.3. Default Exporters
`rancher-monitoring` deploys two exporters to expose metrics to prometheus: `node-exporter` and `windows-exporter`. Both are deployed as DaemonSets.
`node-exporter` exports container, pod and node metrics for CPU and memory from each Linux node. `windows-exporter` does the same, but for Windows nodes.
For more information on `node-exporter`, refer to the [upstream documentation.](https://prometheus.io/docs/guides/node-exporter/)
[kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) is also useful because it exports metrics for Kubernetes components.
# 5. Components Exposed in the Rancher UI
When the monitoring application is installed, you will be able to edit the following components in the Rancher UI:
| Component | Type of Component | Purpose and Common Use Cases for Editing |
|--------------|------------------------|---------------------------|
| ServiceMonitor | Custom resource | Set up targets to scrape custom metrics from. Automatically updates the scrape configuration in the Prometheus custom resource. |
| PodMonitor | Custom resource | Set up targets to scrape custom metrics from. Automatically updates the scrape configuration in the Prometheus custom resource. |
| Receiver | Configuration block (part of Alertmanager) | Set up a notification system to receive alerts. Automatically updates the Alertmanager custom resource. |
| Route | Configuration block (part of Alertmanager) | Add identifying information to make alerts more meaningful and direct them to individual teams. Automatically updates the Alertmanager custom resource. |
| PrometheusRule | Custom resource | For more advanced use cases, you may want to define what Prometheus metrics or time series database queries should result in alerts being fired. Automatically updates the Prometheus custom resource. |
| Alertmanager | Custom resource | Edit this custom resource only if you need more advanced configuration options beyond what the Rancher UI exposes in the Routes and Receivers sections. For example, you might want to edit this resource to add a routing tree with more than two levels. |
| Prometheus | Custom resource | Edit this custom resource only if you need more advanced configuration beyond what can be configured using ServiceMonitors, PodMonitors, or [Rancher monitoring Helm chart options.](./configuration/helm-chart-options) |
@@ -1,9 +1,11 @@
---
title: RBAC
weight: 3
title: Role-based Access Control
shortTitle: RBAC
weight: 2
aliases:
- /rancher/v2.5/en/cluster-admin/tools/monitoring/rbac
- /rancher/v2.5/en/monitoring-alerting/v2.5/rbac
- /rancher/v2.5/en/monitoring-alerting/grafana
---
This section describes the expectations for RBAC for Rancher Monitoring.
@@ -17,6 +19,7 @@ This section describes the expectations for RBAC for Rancher Monitoring.
- [Users with Rancher Cluster Manager Based Permissions](#users-with-rancher-cluster-manager-based-permissions)
- [Differences in 2.5.x](#differences-in-2-5-x)
- [Assigning Additional Access](#assigning-additional-access)
- [Role-based Access Control for Grafana](#role-based-access-control-for-grafana)
# Cluster Admins
@@ -131,3 +134,19 @@ If cluster-admins would like to provide additional admin/edit access to users ou
|----------------------------| ------| ------| ----------------------------|
| <ul><li>`secrets`</li><li>`configmaps`</li></ul>| `cattle-monitoring-system` | Yes, Configs and Secrets in this namespace can impact the entire monitoring / alerting pipeline. | User will be able to create or edit Secrets / ConfigMaps such as the Alertmanager Config, Prometheus Adapter Config, TLS secrets, additional Grafana datasources, etc. This can have broad impact on all cluster monitoring / alerting. |
| <ul><li>`secrets`</li><li>`configmaps`</li></ul>| `cattle-dashboards` | Yes, Configs and Secrets in this namespace can create dashboards that make queries on all metrics collected at a cluster-level. | User will be able to create Secrets / ConfigMaps that persist new Grafana Dashboards only. |
# Role-based Access Control for Grafana
Rancher allows any users who are authenticated by Kubernetes and have access the Grafana service deployed by the Rancher Monitoring chart to access Grafana via the Rancher Dashboard UI. By default, all users who are able to access Grafana are given the [Viewer](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#viewer-role) role, which allows them to view any of the default dashboards deployed by Rancher.
However, users can choose to log in to Grafana as an [Admin](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#admin-role) if necessary. The default Admin username and password for the Grafana instance will be `admin`/`prom-operator`, but alternative credentials can also be supplied on deploying or upgrading the chart.
To see the Grafana UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Grafana.
<figcaption>Cluster Compute Resources Dashboard in Grafana</figcaption>
![Cluster Compute Resources Dashboard in Grafana]({{<baseurl>}}/img/rancher/cluster-compute-resources-dashboard.png)
<figcaption>Default Dashboards in Grafana</figcaption>
![Default Dashboards in Grafana]({{<baseurl>}}/img/rancher/grafana-default-dashboard.png)
@@ -1,6 +1,6 @@
---
title: Windows Cluster Support for Monitoring V2
shortTitle: Windows Clusters
shortTitle: Windows Support
weight: 5
---
File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 28 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 28 KiB