diff --git a/content/rancher/v2.6/en/monitoring-alerting/_index.md b/content/rancher/v2.6/en/monitoring-alerting/_index.md index beb043edef4..15b85a6347b 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/_index.md @@ -4,217 +4,110 @@ shortTitle: Monitoring/Alerting description: Prometheus lets you view metrics from your different Rancher and Kubernetes objects. Learn about the scope of monitoring and how to enable cluster monitoring weight: 13 aliases: - - /rancher/v2.6/en/dashboard/monitoring-alerting - - /rancher/v2.6/en/dashboard/notifiers - - /rancher/v2.6/en/cluster-admin/tools/monitoring/ + - /rancher/v2.5/en/dashboard/monitoring-alerting + - /rancher/v2.5/en/dashboard/notifiers + - /rancher/v2.5/en/cluster-admin/tools/monitoring/ --- -Using Rancher, you can quickly deploy leading open-source monitoring alerting solutions onto your cluster. +Using the `rancher-monitoring` application, you can quickly deploy leading open-source monitoring and alerting solutions onto your cluster. -The `rancher-monitoring` operator is powered by [Prometheus](https://prometheus.io/), [Grafana](https://grafana.com/grafana/), [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/), the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), and the [Prometheus adapter.](https://github.com/DirectXMan12/k8s-prometheus-adapter) This page describes how to enable monitoring and alerting within a cluster using the new monitoring application. +- [Features](#features) +- [How Monitoring Works](#how-monitoring-works) +- [Default Components and Deployments](#default-components-and-deployments) +- [Role-based Access Control](#role-based-access-control) +- [Guides](#guides) +- [Windows Cluster Support](#windows-cluster-support) +- [Known Issues](#known-issues) -Rancher's solution allows users to: +### Features -- Monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments via Prometheus, a leading open-source monitoring solution. +Prometheus lets you view metrics from your Rancher and Kubernetes objects. Using timestamps, Prometheus lets you query and view these metrics in easy-to-read graphs and visuals, either through the Rancher UI or Grafana, which is an analytics viewing platform deployed along with Prometheus. + +By viewing data that Prometheus scrapes from your cluster control plane, nodes, and deployments, you can stay on top of everything happening in your cluster. You can then use these analytics to better run your organization: stop system emergencies before they start, develop maintenance strategies, or restore crashed servers. + +The `rancher-monitoring` operator, introduced in Rancher v2.5, is powered by [Prometheus](https://prometheus.io/), [Grafana](https://grafana.com/grafana/), [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/), the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), and the [Prometheus adapter.](https://github.com/DirectXMan12/k8s-prometheus-adapter) + +The monitoring application allows you to: + +- Monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments - Define alerts based on metrics collected via Prometheus -- Create custom dashboards to make it easy to visualize collected metrics via Grafana +- Create custom Grafana dashboards - Configure alert-based notifications via Email, Slack, PagerDuty, etc. using Prometheus Alertmanager - Defines precomputed, frequently needed or computationally expensive expressions as new time series based on metrics collected via Prometheus - Expose collected metrics from Prometheus to the Kubernetes Custom Metrics API via Prometheus Adapter for use in HPA -More information about the resources that get deployed onto your cluster to support this solution can be found in the [`rancher-monitoring`](https://github.com/rancher/charts/tree/main/charts/rancher-monitoring) Helm chart, which closely tracks the upstream [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) Helm chart maintained by the Prometheus community with certain changes tracked in the [CHANGELOG.md](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/CHANGELOG.md). +# How Monitoring Works -> If you previously enabled Monitoring, Alerting, or Notifiers in Rancher before v2.5, there is no upgrade path for switching to the new monitoring/ alerting solution. You will need to disable monitoring/ alerting/notifiers in Cluster Manager before deploying the new monitoring solution via Cluster Explorer. +For an explanation of how the monitoring components work together, see [this page.](./how-monitoring-works) -For more information about upgrading the Monitoring app in Rancher 2.5, please refer to the [migration docs](./migrating). +# Default Components and Deployments -- [About Prometheus](#about-prometheus) -- [Enable Monitoring](#enable-monitoring) - - [Default Alerts, Targets, and Grafana Dashboards](#default-alerts-targets-and-grafana-dashboards) -- [Windows Cluster Support](#windows-cluster-support) -- [Using Monitoring](#using-monitoring) - - [Grafana UI](#grafana-ui) - - [Prometheus UI](#prometheus-ui) - - [Viewing the Prometheus Targets](#viewing-the-prometheus-targets) - - [Viewing the PrometheusRules](#viewing-the-prometheusrules) - - [Viewing Active Alerts in Alertmanager](#viewing-active-alerts-in-alertmanager) -- [Uninstall Monitoring](#uninstall-monitoring) -- [Setting Resource Limits and Requests](#setting-resource-limits-and-requests) -- [Known Issues](#known-issues) +### Built-in Dashboards -# About Prometheus +By default, the monitoring application deploys Grafana dashboards (curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project) onto a cluster. -Prometheus provides a time series of your data, which is, according to the [Prometheus documentation:](https://prometheus.io/docs/concepts/data_model/) +It also deploys an Alertmanager UI and a Prometheus UI. For more information about these tools, see [Built-in Dashboards.](./dashboards) +### Default Metrics Exporters -> A stream of timestamped values belonging to the same metric and the same set of labeled dimensions, along with comprehensive statistics and metrics of the monitored cluster. +By default, Rancher Monitoring deploys exporters (such as [node-exporter](https://github.com/prometheus/node_exporter) and [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)). -In other words, Prometheus lets you view metrics from your different Rancher and Kubernetes objects. Using timestamps, Prometheus lets you query and view these metrics in easy-to-read graphs and visuals, either through the Rancher UI or Grafana, which is an analytics viewing platform deployed along with Prometheus. +These default exporters automatically scrape metrics for CPU and memory from all components of your Kubernetes cluster, including your workloads. -By viewing data that Prometheus scrapes from your cluster control plane, nodes, and deployments, you can stay on top of everything happening in your cluster. You can then use these analytics to better run your organization: stop system emergencies before they start, develop maintenance strategies, restore crashed servers, etc. +### Default Alerts -# Enable Monitoring +The monitoring application deploys some alerts by default. To see the default alerts, go to the [Alertmanager UI](./dashboard/accessing-the-alertmanager-ui) and click **Expand all groups.** -As an [administrator]({{}}/rancher/v2.6/en/admin-settings/rbac/global-permissions/) or [cluster owner]({{}}/rancher/v2.6/en/admin-settings/rbac/cluster-project-roles/#cluster-roles), you can configure Rancher to deploy Prometheus to monitor your Kubernetes cluster. +### Components Exposed in the Rancher UI -> **Requirements:** -> -> - Make sure that you are allowing traffic on port 9796 for each of your nodes because Prometheus will scrape metrics from here. -> - Make sure your cluster fulfills the resource requirements. The cluster should have at least 1950Mi memory available, 2700m CPU, and 50Gi storage. A breakdown of the resource limits and requests is [here.](#setting-resource-limits-and-requests) -> - When installing monitoring on an RKE cluster using RancherOS or Flatcar Linux nodes, change the etcd node certificate directory to `/opt/rke/etc/kubernetes/ssl`. +For a list of monitoring components exposed in the Rancher UI, along with common use cases for editing them, see [this section.](./how-monitoring-works/#components-exposed-in-the-rancher-ui) -### Enable Monitoring for use without SSL +# Role-based Access Control -1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.** -1. Click **Apps.** -1. Click the `rancher-monitoring` app. -1. Optional: Click **Chart Options** and configure alerting, Prometheus and Grafana. For help, refer to the [configuration reference.](./configuration) -1. Scroll to the bottom of the Helm chart README and click **Install.** +For information on configuring access to monitoring, see [this page.](./rbac) -**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace. +# Guides -### Enable Monitoring for use with SSL +- [Enable monitoring](./guides/enable-monitoring) +- [Uninstall monitoring](./guides/uninstall) +- [Monitoring Rancher apps](./guides/monitoring-rancher-apps) +- [Monitoring workloads](./guides/monitoring-workloads) +- [Customizing Grafana dashboards](./guides/customize-grafana) +- [Persistent Grafana dashboards](./guides/persist-grafana) +- [Setting up metrics for horizontal pod autoscaling](./guides/hpa) +- [Debugging high memory usage](./guides/memory-usage) +- [Migrating from Monitoring V1 to V2](./guides/migrating) -1. Follow the steps on [this page]({{}}/rancher/v2.6/en/k8s-in-rancher/secrets/) to create a secret in order for SSL to be used for alerts. - - The secret should be created in the `cattle-monitoring-system` namespace. If it doesn't exist, create it first. - - Add the `ca`, `cert`, and `key` files to the secret. -1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.** -1. Click **Apps.** -1. Click the `rancher-monitoring` app. -1. Click **Alerting**. -1. Click **Additional Secrets** and add the secrets created earlier. - -**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace. +# Configuration -When [creating a receiver,]({{}}/rancher/v2.6/en/monitoring-alerting/configuration/alertmanager/#creating-receivers-in-the-rancher-ui) SSL-enabled receivers such as email or webhook will have a **SSL** section with fields for **CA File Path**, **Cert File Path**, and **Key File Path**. Fill in these fields with the paths to each of `ca`, `cert`, and `key`. The path will be of the form `/etc/alertmanager/secrets/name-of-file-in-secret`. +### Configuring Monitoring Resources in Rancher -For example, if you created a secret with these key-value pairs: +> The configuration reference assumes familiarity with how monitoring components work together. For more information, see [How Monitoring Works.](./how-monitoring-works) -```yaml -ca.crt=`base64-content` -cert.pem=`base64-content` -key.pfx=`base64-content` -``` +- [ServiceMonitor and PodMonitor](./configuration/servicemonitor-podmonitor) +- [Receiver](./configuration/receiver) +- [Route](./configuration/route) +- [PrometheusRule](./configuration/advanced/prometheusrule) +- [Prometheus](./configuration/advanced/prometheus) +- [Alertmanager](./configuration/advanced/alertmanager) -Then **Cert File Path** would be set to `/etc/alertmanager/secrets/cert.pem`. +### Configuring Helm Chart Options -### Default Alerts, Targets, and Grafana Dashboards - -By default, Rancher Monitoring deploys exporters (such as [node-exporter](https://github.com/prometheus/node_exporter) and [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)) as well as default Prometheus alerts and Grafana dashboards (curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project) onto a cluster. - -To see the default alerts, go to the [Alertmanager UI](#viewing-active-alerts-in-alertmanager) and click **Expand all groups.** - -To see what services you are monitoring, you will need to see your targets. To view the default targets, refer to [Viewing the Prometheus Targets.](#viewing-the-prometheus-targets) - -To see the default dashboards, go to the [Grafana UI.](#grafana-ui) In the left navigation bar, click the icon with four boxes and click **Manage.** - -### Next Steps - -To configure Prometheus resources from the Rancher UI, click **Apps & Marketplace > Monitoring** in the upper left corner. +For more information on `rancher-monitoring` chart options, including options to set resource limits and requests, see [this page.](./configuration/helm-chart-options) # Windows Cluster Support +_Available as of v2.5.8_ + When deployed onto an RKE1 Windows cluster, Monitoring V2 will now automatically deploy a [windows-exporter](https://github.com/prometheus-community/windows_exporter) DaemonSet and set up a ServiceMonitor to collect metrics from each of the deployed Pods. This will populate Prometheus with `windows_` metrics that are akin to the `node_` metrics exported by [node_exporter](https://github.com/prometheus/node_exporter) for Linux hosts. To be able to fully deploy Monitoring V2 for Windows, all of your Windows hosts must have a minimum [wins](https://github.com/rancher/wins) version of v0.1.0. For more details on how to upgrade wins on existing Windows hosts, refer to the section on [Windows cluster support for Monitoring V2.](./windows-clusters) -# Using Monitoring -Installing `rancher-monitoring` makes the following dashboards available from the Rancher UI. - -> **Note:** If you want to set up Alertmanager, Grafana or Ingress, it has to be done with the settings on the Helm chart deployment. It's problematic to create Ingress outside the deployment. - -### Grafana UI - -[Grafana](https://grafana.com/grafana/) allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture. - -Rancher allows any users who are authenticated by Kubernetes and have access the Grafana service deployed by the Rancher Monitoring chart to access Grafana via the Rancher Dashboard UI. By default, all users who are able to access Grafana are given the [Viewer](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#viewer-role) role, which allows them to view any of the default dashboards deployed by Rancher. - -However, users can choose to log in to Grafana as an [Admin](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#admin-role) if necessary. The default Admin username and password for the Grafana instance will be `admin`/`prom-operator`, but alternative credentials can also be supplied on deploying or upgrading the chart. - -> **Persistent Dashboards:** To allow the Grafana dashboard to persist after it restarts, add the dashboard configuration JSON into a ConfigMap. ConfigMaps also allow the dashboards to be deployed with a GitOps or CD based approach. This allows the dashboard to be put under version control. For details, refer to [this section.](./persist-grafana) - -To see the Grafana UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Grafana. - -
Cluster Compute Resources Dashboard in Grafana
-![Cluster Compute Resources Dashboard in Grafana]({{}}/img/rancher/cluster-compute-resources-dashboard.png) - -
Default Dashboards in Grafana
-![Default Dashboards in Grafana]({{}}/img/rancher/grafana-default-dashboard.png) - -### Prometheus UI - -To see the Prometheus UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Graph.** - -
Prometheus Graph UI
-![Prometheus Graph UI]({{}}/img/rancher/prometheus-graph-ui.png) - -### Viewing the Prometheus Targets - -To see the Prometheus Targets, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Targets.** - -
Targets in the Prometheus UI
-![Prometheus Targets UI]({{}}/img/rancher/prometheus-targets-ui.png) - -### Viewing the PrometheusRules - -To see the PrometheusRules, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Rules.** - -
Rules in the Prometheus UI
-![PrometheusRules UI]({{}}/img/rancher/prometheus-rules-ui.png) - -For more information on PrometheusRules in Rancher, see [this page.](./configuration/prometheusrules) - -### Viewing Active Alerts in Alertmanager - -When `rancher-monitoring` is installed, the Prometheus Alertmanager UI is deployed. - -The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts. - -In the Alertmanager UI, you can view your alerts and the current Alertmanager configuration. - -To see the PrometheusRules, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Alertmanager.** - -**Result:** The Alertmanager UI opens in a new tab. For help with configuration, refer to the [official Alertmanager documentation.](https://prometheus.io/docs/alerting/latest/alertmanager/) - -For more information on configuring Alertmanager in Rancher, see [this page.](./configuration/alertmanager) - -
The Alertmanager UI
-![Alertmanager UI]({{}}/img/rancher/alertmanager-ui.png) - -# Uninstall Monitoring - -1. From the **Cluster Explorer,** click Apps & Marketplace. -1. Click **Installed Apps.** -1. Go to the `cattle-monitoring-system` namespace and check the boxes for `rancher-monitoring-crd` and `rancher-monitoring`. -1. Click **Delete.** -1. Confirm **Delete.** - -**Result:** `rancher-monitoring` is uninstalled. - -> **Note on Persistent Grafana Dashboards:** For users who are using Monitoring V2 v9.4.203 or below, uninstalling the Monitoring chart will delete the cattle-dashboards namespace, which will delete all persisted dashboards, unless the namespace is marked with the annotation `helm.sh/resource-policy: "keep"`. This annotation is added by default in Monitoring V2 v14.5.100+ but can be manually applied on the cattle-dashboards namespace before an uninstall if an older version of the Monitoring chart is currently installed onto your cluster. - -# Setting Resource Limits and Requests - -The resource requests and limits can be configured when installing `rancher-monitoring`. - -The default values are in the [values.yaml](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/values.yaml) in the `rancher-monitoring` Helm chart. - -The default values in the table below are the minimum required resource limits and requests. - -| Resource Name | Memory Limit | CPU Limit | Memory Request | CPU Request | -| ------------- | ------------ | ----------- | ---------------- | ------------------ | -| alertmanager | 500Mi | 1000m | 100Mi | 100m | -| grafana | 200Mi | 200m | 100Mi | 100m | -| kube-state-metrics subchart | 200Mi | 100m | 130Mi | 100m | -| prometheus-node-exporter subchart | 50Mi | 200m | 30Mi | 100m | -| prometheusOperator | 500Mi | 200m | 100Mi | 100m | -| prometheus | 2500Mi | 1000m | 1750Mi | 750m | -| **Total** | **3950Mi** | **2700m** | **2210Mi** | **1250m** | - -At least 50Gi storage is recommended. # Known Issues There is a [known issue](https://github.com/rancher/rancher/issues/28787#issuecomment-693611821) that K3s clusters require more default memory. If you are enabling monitoring on a K3s cluster, we recommend to setting `prometheus.prometheusSpec.resources.memory.limit` to 2500 Mi and `prometheus.prometheusSpec.resources.memory.request` to 1750 Mi. + +For tips on debugging high memory usage, see [this page.](./memory-usage) diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/_index.md index 3571737590d..39adeaf073a 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/configuration/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/_index.md @@ -1,96 +1,61 @@ --- title: Configuration -weight: 3 +weight: 5 aliases: - - /rancher/v2.6/en/monitoring-alerting/v2.5/configuration + - /rancher/v2.5/en/monitoring-alerting/v2.5/configuration --- -This page captures some of the most important options for configuring the custom resources for monitoring. +This page captures some of the most important options for configuring Monitoring V2 in the Rancher UI. For information on configuring custom scrape targets and rules for Prometheus, please refer to the upstream documentation for the [Prometheus Operator.](https://github.com/prometheus-operator/prometheus-operator) Some of the most important custom resources are explained in the Prometheus Operator [design documentation.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md) The Prometheus Operator documentation can help also you set up RBAC, Thanos, or custom configuration. -- [Configuring Prometheus](#configuring-prometheus) -- [Configuring Targets with ServiceMonitors and PodMonitors](#configuring-targets-with-servicemonitors-and-podmonitors) - - [ServiceMonitors](#servicemonitors) - - [PodMonitors](#podmonitors) -- [PrometheusRules](#prometheusrules) -- [Alertmanager Config](#alertmanager-config) -- [Trusted CA for Notifiers](#trusted-ca-for-notifiers) -- [Additional Scrape Configurations](#additional-scrape-configurations) -- [Examples](#examples) +This section assumes that you understand how the Prometheus Operator’s custom resources work together. For more information, see [this section.] -# Configuring Prometheus +# Setting Resource Limits and Requests -The primary way that users will be able to customize this feature for specific Monitoring and Alerting use cases is by creating and/or modifying ConfigMaps, Secrets, and Custom Resources pertaining to this deployment. +The resource requests and limits for the monitoring application can be configured when installing `rancher-monitoring`. For more information about the default limits, see [this page.](./resource-limits) -Prometheus Operator introduces a set of [Custom Resource Definitions](https://github.com/prometheus-operator/prometheus-operator#customresourcedefinitions) that allow users to deploy and manage Prometheus and Alertmanager instances by creating and modifying those custom resources on a cluster. +# Prometheus Configuration -Prometheus Operator will automatically update your Prometheus configuration based on the live state of these custom resources. +It is usually not necessary to directly edit the Prometheus custom resource. -There are also certain special types of ConfigMaps/Secrets such as those corresponding to Grafana Dashboards, Grafana Datasources, and Alertmanager Configs that will automatically update your Prometheus configuration via sidecar proxies that observe the live state of those resources within your cluster. +Instead, to configure Prometheus to scrape custom metrics, you will only need to create a new ServiceMonitor or PodMonitor to configure Prometheus to scrape additional metrics. -By default, a set of these resources (curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project) are deployed onto your cluster as part of installing the Rancher Monitoring Application to set up a basic Monitoring / Alerting stack. For more information how to configure custom targets, alerts, notifiers, and dashboards after deploying the chart, see below. -# Configuring Targets with ServiceMonitors and PodMonitors +### ServiceMonitor and PodMonitor Configuration -Customizing the scrape configuration used by Prometheus to determine which resources to scrape metrics from will primarily involve creating / modifying the following resources within your cluster: +For details, see [this page.](./) -### ServiceMonitors +### Advanced Prometheus Configuration -This CRD declaratively specifies how groups of Kubernetes services should be monitored. Any Services in your cluster that match the labels located within the ServiceMonitor `selector` field will be monitored based on the `endpoints` specified on the ServiceMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#servicemonitor) provided by Prometheus Operator. +Link to ‘how monitoring works’ for the section about the Prometheus CR. -For more information about how ServiceMonitors work, refer to the [Prometheus Operator documentation.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md) +For more information about directly editing the Prometheus custom resource, which may be helpful in advanced use cases, see [this page.](./advanced/prometheus) -### PodMonitors +# Alertmanager Configuration -This CRD declaratively specifies how group of pods should be monitored. Any Pods in your cluster that match the labels located within the PodMonitor `selector` field will be monitored based on the `podMetricsEndpoints` specified on the PodMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#podmonitorspec) provided by Prometheus Operator. +The Alertmanager custom resource usually doesn't need to be edited directly. For most common use cases, you can manage alerts by updating Routes and Receivers. -# PrometheusRules +Routes and receivers are part of the configuration of the alertmanager custom resource. In the Rancher UI, Routes and Receivers are not true custom resources, but pseudo-custom resources that are mapped to sections within the Alertmanager custom resource. -This CRD defines a group of Prometheus alerting and/or recording rules. +When routes and receivers are updated, the monitoring application will automatically update Alertmanager to reflect those changes. -For information on configuring PrometheusRules, refer to [this page.](./prometheusrules) +For some advanced use cases, you may want to configure alertmanager directly. For more information, refer to [this page.](./advanced/alertmanager) -# Alertmanager Config -For information on configuring the Alertmanager, refer to [this page.](./alertmanager) -# Trusted CA for Notifiers +### Receivers -If you need to add a trusted CA to your notifier, follow these steps: +[link to section of how monitoring works that explains receivers] -1. Create the `cattle-monitoring-system` namespace. -1. Add your trusted CA secret to the `cattle-monitoring-system` namespace. -1. Deploy or upgrade the `rancher-monitoring` Helm chart. In the chart options, reference the secret in **Alerting > Additional Secrets.** +For details on how to configure receivers, see [this page.](./receiver) +### Routes +[link to section of how monitoring works that explains routes] -**Result:** The default Alertmanager custom resource will have access to your trusted CA. +The route needs to refer to a receiver that has already been configured. -# Additional Scrape Configurations +### Advanced -If the scrape configuration you want cannot be specified via a ServiceMonitor or PodMonitor at the moment, you can provide an `additionalScrapeConfigSecret` on deploying or upgrading `rancher-monitoring`. +Link to ‘how monitoring works’ for the section about the alertmanager CR. -A [scrape_config section](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) specifies a set of targets and parameters describing how to scrape them. In the general case, one scrape configuration specifies a single job. - -An example of where this might be used is with Istio. For more information, see [this section.](https://rancher.com/docs/rancher/v2.6/en/istio/v2.5/configuration-reference/selectors-and-scrape) - -# Examples - -### ServiceMonitor - -An example ServiceMonitor custom resource can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml) - -### PodMonitor - -An example PodMonitor can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/example-app-pod-monitor.yaml) An example Prometheus resource that refers to it can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/prometheus-pod-monitor.yaml) - -### PrometheusRule - -For users who are familiar with Prometheus, a PrometheusRule contains the alerting and recording rules that you would normally place in a [Prometheus rule file](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/). - -For a more fine-grained application of PrometheusRules within your cluster, the ruleSelector field on a Prometheus resource allows you to select which PrometheusRules should be loaded onto Prometheus based on the labels attached to the PrometheusRules resources. - -An example PrometheusRule is on [this page.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md) - -### Alertmanager Config - -For an example configuration, refer to [this section.](./alertmanager/#example-alertmanager-config) \ No newline at end of file +For more information about directly editing the Alertmanager custom resource, which may be helpful in advanced use cases, see [this page.](./advanced/alertmanager) \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/_index.md new file mode 100644 index 00000000000..45452f0ab33 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/_index.md @@ -0,0 +1,4 @@ +--- +title: Advanced Configuration +weight: 5 +--- \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/alertmanager/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/alertmanager/_index.md new file mode 100644 index 00000000000..14d5f305df7 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/alertmanager/_index.md @@ -0,0 +1,40 @@ +--- +title: Alertmanager Configuration +weight: 1 +--- + +It is usually not necessary to directly edit the Alertmanager custom resource. For most use cases, you will only need to edit the Receivers and Routes to configure notifications. + +When Receivers and Routes are updated, the monitoring application will automatically update the Alertmanager custom resource to be consistent with those changes. + +> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works) + +# About the Alertmanager Custom Resource + +By default, Rancher Monitoring deploys a single Alertmanager onto a cluster that uses a default Alertmanager Config Secret. + +You may want to edit the Alertmanager custom resource if you would like to take advantage of advanced options that are not exposed in the Rancher UI forms, such as the ability to create a routing tree structure that is more than two levels deep. + +It is also possible to create more than one Alertmanager in a cluster, which may be useful if you want to implement namespace-scoped monitoring. In this case, you should manage the Alertmanager custom resources using the same underlying Alertmanager Config Secret. + +### Deeply Nested Routes + +While the Rancher UI only supports a routing tree that is two levels deep, you can configure more deeply nested routing structures by editing the Alertmanager YAML. + +### Multiple Alertmanager Replicas + +As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster. The replicas can all be managed using the same underlying Alertmanager Config Secret. + +This Secret should be updated or modified any time you want to: + +- Add in new notifiers or receivers +- Change the alerts that should be sent to specific notifiers or receivers +- Change the group of alerts that are sent out + +By default, you can either choose to supply an existing Alertmanager Config Secret (i.e. any Secret in the `cattle-monitoring-system` namespace) or allow Rancher Monitoring to deploy a default Alertmanager Config Secret onto your cluster. + +By default, the Alertmanager Config Secret created by Rancher will never be modified or deleted on an upgrade or uninstall of the `rancher-monitoring` chart. This restriction prevents users from losing or overwriting their alerting configuration when executing operations on the chart. + +For more information on what fields can be specified in the Alertmanager Config Secret, please look at the [Prometheus Alertmanager docs.](https://prometheus.io/docs/alerting/latest/alertmanager/) + +The full spec for the Alertmanager configuration file and what it takes in can be found [here.](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file) \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/prometheus/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/prometheus/_index.md new file mode 100644 index 00000000000..5a0c898f710 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/prometheus/_index.md @@ -0,0 +1,21 @@ +--- +title: Prometheus Configuration +weight: 1 +aliases: + - /rancher/v2.5/en/monitoring-alerting/v2.5/configuration/prometheusrules + - /rancher/v2.5/en/monitoring-alerting/configuration/prometheusrules + - /rancher/v2.5/en/monitoring-alerting/configuration/advanced/prometheusrules +--- + +It is usually not necessary to directly edit the Prometheus custom resource because the monitoring application automatically updates it based on changes to ServiceMonitors and PodMonitors. + +> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works) + + + + +# About the Prometheus Custom Resource +- when the Prometheus operator observes it, it creates prometheus-rancher-monitoring-prometheus, which is the prometheus deployment that is created based on the configuration in the Prometheus CR. +- This is where we configure details like what Alertmanagers are connected to Prometheus, what are the external URLs, and other details that prometheus needs. Rancher builds this CR for you. It has fields for pod monitor and service monitor selectors - technically you can filter that to include only the ones in a certain namespace. +- monitoring v2 only supports one prometheus per cluster because we haven’t supported project level monitoring. But you might want to edit prometheus Cr if you want to limit the namespaces. +- prometheus also has the rules and routes in it. \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/prometheusrules/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/prometheusrules/_index.md similarity index 79% rename from content/rancher/v2.6/en/monitoring-alerting/configuration/prometheusrules/_index.md rename to content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/prometheusrules/_index.md index 587f2aadcfe..cd94a0532a2 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/configuration/prometheusrules/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/advanced/prometheusrules/_index.md @@ -1,45 +1,17 @@ --- -title: PrometheusRules -weight: 2 -aliases: - - /rancher/v2.6/en/monitoring-alerting/v2.5/configuration/prometheusrules +title: Configuring PrometheusRules +weight: 3 --- A PrometheusRule defines a group of Prometheus alerting and/or recording rules. -- [About PrometheusRule Custom Resources](#about-prometheusrule-custom-resources) -- [Connecting Routes and PrometheusRules](#connecting-routes-and-prometheusrules) -- [Creating PrometheusRules in the Rancher UI](#creating-prometheusrules-in-the-rancher-ui) -- [Configuration](#configuration) - - [Rule Group](#rule-group) - - [Alerting Rules](#alerting-rules) - - [Recording Rules](#recording-rules) +> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works) -### About PrometheusRule Custom Resources - -Prometheus rule files are held in PrometheusRule custom resources. - -A PrometheusRule allows you to define one or more RuleGroups. Each RuleGroup consists of a set of Rule objects that can each represent either an alerting or a recording rule with the following fields: - -- The name of the new alert or record -- A PromQL (Prometheus query language) expression for the new alert or record -- Labels that should be attached to the alert or record that identify it (e.g. cluster name or severity) -- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules. - -Alerting rules define alert conditions based on PromQL queries. Recording rules precompute frequently needed or computationally expensive queries at defined intervals. - -For more information on what fields can be specified, please look at the [Prometheus Operator spec.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#prometheusrulespec) - -Use the label selector field `ruleSelector` in the Prometheus object to define the rule files that you want to be mounted into Prometheus. - -For examples, refer to the Prometheus documentation on [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [alerting rules.](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) - -### Connecting Routes and PrometheusRules - -When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Prometheus to figure out which Route should receive this Alert. For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label. ### Creating PrometheusRules in the Rancher UI +_Available as of v2.5.4_ + > **Prerequisite:** The monitoring application needs to be installed. To create rule groups in the Rancher UI, @@ -52,8 +24,27 @@ To create rule groups in the Rancher UI, **Result:** Alerts can be configured to send notifications to the receiver(s). +### About the PrometheusRule Custom Resource + +When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Alertmanager to figure out which Route should receive this Alert. For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label. + +Prometheus rule files are held in PrometheusRule custom resources. A PrometheusRule allows you to define one or more RuleGroups. Each RuleGroup consists of a set of Rule objects that can each represent either an alerting or a recording rule with the following fields: + +- The name of the new alert or record +- A PromQL expression for the new alert or record +- Labels that should be attached to the alert or record that identify it (e.g. cluster name or severity) +- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules. + +For more information on what fields can be specified, please look at the [Prometheus Operator spec.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#prometheusrulespec) + +Use the label selector field `ruleSelector` in the Prometheus object to define the rule files that you want to be mounted into Prometheus. + +For examples, refer to the Prometheus documentation on [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [alerting rules.](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) + # Configuration +{{% tabs %}} +{{% tab "Rancher v2.5.4" %}} Rancher v2.5.4 introduced the capability to configure PrometheusRules by filling out forms in the Rancher UI. @@ -89,3 +80,8 @@ Rancher v2.5.4 introduced the capability to configure PrometheusRules by filling | PromQL Expression | The PromQL expression to evaluate. Prometheus will evaluate the current value of this PromQL expression on every evaluation cycle and the result will be recorded as a new set of time series with the metric name as given by 'record'. For more information about expressions, refer to the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/) or our [example PromQL expressions.](../expression) | | Labels | Labels to add or overwrite before storing the result. | +{{% /tab %}} +{{% tab "Rancher v2.5.0-v2.5.3" %}} +For Rancher v2.5.0-v2.5.3, PrometheusRules must be configured in YAML. For examples, refer to the Prometheus documentation on [recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [alerting rules.](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) +{{% /tab %}} +{{% /tabs %}} \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/examples/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/examples/_index.md new file mode 100644 index 00000000000..9302b40848e --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/examples/_index.md @@ -0,0 +1,25 @@ +--- +title: Examples +weight: 5 +--- + + +### ServiceMonitor + +An example ServiceMonitor custom resource can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml) + +### PodMonitor + +An example PodMonitor can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/example-app-pod-monitor.yaml) An example Prometheus resource that refers to it can be found [here.](https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/getting-started/prometheus-pod-monitor.yaml) + +### PrometheusRule + +For users who are familiar with Prometheus, a PrometheusRule contains the alerting and recording rules that you would normally place in a [Prometheus rule file](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/). + +For a more fine-grained application of PrometheusRules within your cluster, the ruleSelector field on a Prometheus resource allows you to select which PrometheusRules should be loaded onto Prometheus based on the labels attached to the PrometheusRules resources. + +An example PrometheusRule is on [this page.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/alerting.md) + +### Alertmanager Config + +For an example configuration, refer to [this section.](./alertmanager/#example-alertmanager-config) \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/helm-chart-options/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/helm-chart-options/_index.md new file mode 100644 index 00000000000..c1549407210 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/helm-chart-options/_index.md @@ -0,0 +1,77 @@ +--- +title: Helm Chart Options +weight: 8 +--- + +- [Configuring Resource Limits and Requests](#configuring-resource-limits-and-requests) +- [Trusted CA for Notifiers](#trusted-ca-for-notifiers) +- [Additional Scrape Configurations](#additional-scrape-configurations) +- [Configuring Applications Packaged within Monitoring V2](#configuring-applications-packaged-within-monitoring-v2) +- [Increase the Replicas of Alertmanager](#increase-the-replicas-of-alertmanager) +- [Configuring the Namespace for a Persistent Grafana Dashboard](#configuring-the-namespace-for-a-persistent-grafana-dashboard) + + +# Configuring Resource Limits and Requests + +The resource requests and limits can be configured when installing `rancher-monitoring`. + +The default values are in the [values.yaml](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/values.yaml) in the `rancher-monitoring` Helm chart. + +The default values in the table below are the minimum required resource limits and requests. + +| Resource Name | Memory Limit | CPU Limit | Memory Request | CPU Request | +| ------------- | ------------ | ----------- | ---------------- | ------------------ | +| alertmanager | 500Mi | 1000m | 100Mi | 100m | +| grafana | 200Mi | 200m | 100Mi | 100m | +| kube-state-metrics subchart | 200Mi | 100m | 130Mi | 100m | +| prometheus-node-exporter subchart | 50Mi | 200m | 30Mi | 100m | +| prometheusOperator | 500Mi | 200m | 100Mi | 100m | +| prometheus | 2500Mi | 1000m | 1750Mi | 750m | +| **Total** | **3950Mi** | **2700m** | **2210Mi** | **1250m** | + +At least 50Gi storage is recommended. + + +# Trusted CA for Notifiers + +If you need to add a trusted CA to your notifier, follow these steps: + +1. Create the `cattle-monitoring-system` namespace. +1. Add your trusted CA secret to the `cattle-monitoring-system` namespace. +1. Deploy or upgrade the `rancher-monitoring` Helm chart. In the chart options, reference the secret in **Alerting > Additional Secrets.** + +**Result:** The default Alertmanager custom resource will have access to your trusted CA. + + +# Additional Scrape Configurations + +If the scrape configuration you want cannot be specified via a ServiceMonitor or PodMonitor at the moment, you can provide an `additionalScrapeConfigSecret` on deploying or upgrading `rancher-monitoring`. + +A [scrape_config section](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) specifies a set of targets and parameters describing how to scrape them. In the general case, one scrape configuration specifies a single job. + +An example of where this might be used is with Istio. For more information, see [this section.](https://rancher.com/docs/rancher/v2.5/en/istio/v2.5/configuration-reference/selectors-and-scrape) + + +# Configuring Applications Packaged within Monitoring v2 + +We deploy kube-state-metrics and node-exporter with monitoring v2. Node exporter are deployed as DaemonSets. In the monitoring v2 helm chart, in the values.yaml, each of the things are deployed as sub charts. + +We also deploy grafana which is not managed by prometheus. + +If you look at what the helm chart is doing like in kube-state-metrics, there are plenty more values that you can set that aren’t exposed in the top level chart. + +But in the top level chart you can add values that override values that exist in the sub chart. + +### Increase the Replicas of Alertmanager + +As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster. The replicas can all be managed using the same underlying Alertmanager Config Secret. For more information on the Alertmanager Config Secret, refer to [this section.](../configuration/advanced/alertmanager/#multiple-alertmanager-replicas) + +### Configuring the Namespace for a Persistent Grafana Dashboard + +To specify that you would like Grafana to watch for ConfigMaps across all namespaces, set this value in the `rancher-monitoring` Helm chart: + +``` +grafana.sidecar.dashboards.searchNamespace=ALL +``` + +Note that the RBAC roles exposed by the Monitoring chart to add Grafana Dashboards are still restricted to giving permissions for users to add dashboards in the namespace defined in `grafana.dashboards.namespace`, which defaults to `cattle-dashboards`. \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/alertmanager/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/receiver/_index.md similarity index 67% rename from content/rancher/v2.6/en/monitoring-alerting/configuration/alertmanager/_index.md rename to content/rancher/v2.6/en/monitoring-alerting/configuration/receiver/_index.md index 5acd3f7cce2..807388dc984 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/configuration/alertmanager/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/receiver/_index.md @@ -1,17 +1,19 @@ --- -title: Alertmanager +title: Receiver Configuration +shortTitle: Receivers weight: 1 aliases: - - /rancher/v2.6/en/monitoring-alerting/v2.5/configuration/alertmanager - - rancher/v2.6/en/monitoring-alerting/legacy/notifiers/ - - /rancher/v2.6/en/cluster-admin/tools/notifiers - - /rancher/v2.6/en/cluster-admin/tools/alerts + - /rancher/v2.5/en/monitoring-alerting/v2.5/configuration/alertmanager + - rancher/v2.5/en/monitoring-alerting/legacy/notifiers/ + - /rancher/v2.5/en/cluster-admin/tools/notifiers + - /rancher/v2.5/en/cluster-admin/tools/alerts + - /rancher/v2.5/en/monitoring-alerting/configuration/alertmanager --- The [Alertmanager Config](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file) Secret contains the configuration of an Alertmanager instance that sends out notifications based on alerts it receives from Prometheus. -- [Overview](#overview) - - [Connecting Routes and PrometheusRules](#connecting-routes-and-prometheusrules) +> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works) + - [Creating Receivers in the Rancher UI](#creating-receivers-in-the-rancher-ui) - [Receiver Configuration](#receiver-configuration) - [Slack](#slack) @@ -26,32 +28,13 @@ The [Alertmanager Config](https://prometheus.io/docs/alerting/latest/configurati - [Receiver](#receiver) - [Grouping](#grouping) - [Matching](#matching) -- [Example Alertmanager Configs](#example-alertmanager-configs) +- [Configuring Multiple Receivers](#configuring-multiple-receivers) +- [Example Alertmanager Config](../examples/#example-alertmanager-config) - [Example Route Config for CIS Scan Alerts](#example-route-config-for-cis-scan-alerts) - -# Overview - -By default, Rancher Monitoring deploys a single Alertmanager onto a cluster that uses a default Alertmanager Config Secret. As part of the chart deployment options, you can opt to increase the number of replicas of the Alertmanager deployed onto your cluster that can all be managed using the same underlying Alertmanager Config Secret. - -This Secret should be updated or modified any time you want to: - -- Add in new notifiers or receivers -- Change the alerts that should be sent to specific notifiers or receivers -- Change the group of alerts that are sent out - -> By default, you can either choose to supply an existing Alertmanager Config Secret (i.e. any Secret in the `cattle-monitoring-system` namespace) or allow Rancher Monitoring to deploy a default Alertmanager Config Secret onto your cluster. By default, the Alertmanager Config Secret created by Rancher will never be modified / deleted on an upgrade / uninstall of the `rancher-monitoring` chart to prevent users from losing or overwriting their alerting configuration when executing operations on the chart. - -For more information on what fields can be specified in this secret, please look at the [Prometheus Alertmanager docs.](https://prometheus.io/docs/alerting/latest/alertmanager/) - -The full spec for the Alertmanager configuration file and what it takes in can be found [here.](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file) - -For more information, refer to the [official Prometheus documentation about configuring routes.](https://www.prometheus.io/docs/alerting/latest/configuration/#route) - -### Connecting Routes and PrometheusRules - -When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Prometheus to figure out which Route should receive this Alert. For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label. +- [Trusted CA for Notifiers](#trusted-ca-for-notifiers) # Creating Receivers in the Rancher UI +_Available as of v2.5.4_ > **Prerequisites:** > @@ -81,6 +64,17 @@ Currently the Rancher Alerting Drivers app provides access to the following inte - Microsoft Teams, based on the [prom2teams](https://github.com/idealista/prom2teams) driver - SMS, based on the [Sachet](https://github.com/messagebird/sachet) driver +### Changes in Rancher v2.5.8 + +Rancher v2.5.8 added Microsoft Teams and SMS as configurable receivers in the Rancher UI. + +### Changes in Rancher v2.5.4 + +Rancher v2.5.4 introduced the capability to configure receivers by filling out forms in the Rancher UI. + +{{% tabs %}} +{{% tab "Rancher v2.5.8+" %}} + The following types of receivers can be configured in the Rancher UI: - Slack @@ -233,35 +227,98 @@ name: telegram-receiver-1 url http://rancher-alerting-drivers-sachet.ns-1.svc:9876/alert ``` -# Route Configuration + -### Receiver -The route needs to refer to a [receiver](#receiver-configuration) that has already been configured. +{{% /tab %}} +{{% tab "Rancher v2.5.4-2.5.7" %}} -### Grouping +The following types of receivers can be configured in the Rancher UI: -| Field | Default | Description | -|-------|--------------|---------| -| Group By | N/a | The labels by which incoming alerts are grouped together. For example, `[ group_by: '[' , ... ']' ]` Multiple alerts coming in for labels such as `cluster=A` and `alertname=LatencyHigh` can be batched into a single group. To aggregate by all possible labels, use the special value `'...'` as the sole label name, for example: `group_by: ['...']` Grouping by `...` effectively disables aggregation entirely, passing through all alerts as-is. This is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping. | -| Group Wait | 30s | How long to wait to buffer alerts of the same group before sending initially. | -| Group Interval | 5m | How long to wait before sending an alert that has been added to a group of alerts for which an initial notification has already been sent. | -| Repeat Interval | 4h | How long to wait before re-sending a given alert that has already been sent. | +- Slack +- Email +- PagerDuty +- Opsgenie +- Webhook +- Custom -### Matching +The custom receiver option can be used to configure any receiver in YAML that cannot be configured by filling out the other forms in the Rancher UI. -The **Match** field refers to a set of equality matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs to the Rancher UI, they correspond to the YAML in this format: +### Slack {#slack-254-257} -```yaml -match: - [ : , ... ] -``` +| Field | Type | Description | +|------|--------------|------| +| URL | String | Enter your Slack webhook URL. For instructions to create a Slack webhook, see the [Slack documentation.](https://get.slack.help/hc/en-us/articles/115005265063-Incoming-WebHooks-for-Slack) | +| Default Channel | String | Enter the name of the channel that you want to send alert notifications in the following format: `#`. | +| Proxy URL | String | Proxy for the webhook notifications. | +| Enable Send Resolved Alerts | Bool | Whether to send a follow-up notification if an alert has been resolved (e.g. [Resolved] High CPU Usage). | -The **Match Regex** field refers to a set of regex-matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs in the Rancher UI, they correspond to the YAML in this format: +### Email {#email-254-257} + +| Field | Type | Description | +|------|--------------|------| +| Default Recipient Address | String | The email address that will receive notifications. | +| Enable Send Resolved Alerts | Bool | Whether to send a follow-up notification if an alert has been resolved (e.g. [Resolved] High CPU Usage). | + +SMTP options: + +| Field | Type | Description | +|------|--------------|------| +| Sender | String | Enter an email address available on your SMTP mail server that you want to send the notification from. | +| Host | String | Enter the IP address or hostname for your SMTP server. Example: `smtp.email.com`. | +| Use TLS | Bool | Use TLS for encryption. | +| Username | String | Enter a username to authenticate with the SMTP server. | +| Password | String | Enter a password to authenticate with the SMTP server. | + +### PagerDuty {#pagerduty-254-257} + +| Field | Type | Description | +|------|------|-------| +| Integration Type | String | `Events API v2` or `Prometheus`. | +| Default Integration Key | String | For instructions to get an integration key, see the [PagerDuty documentation.](https://www.pagerduty.com/docs/guides/prometheus-integration-guide/) | +| Proxy URL | String | Proxy for the PagerDuty notifications. | +| Enable Send Resolved Alerts | Bool | Whether to send a follow-up notification if an alert has been resolved (e.g. [Resolved] High CPU Usage). | + +### Opsgenie {#opsgenie-254-257} + +| Field | Description | +|------|-------------| +| API Key | For instructions to get an API key, refer to the [Opsgenie documentation.](https://docs.opsgenie.com/docs/api-key-management) | +| Proxy URL | Proxy for the Opsgenie notifications. | +| Enable Send Resolved Alerts | Whether to send a follow-up notification if an alert has been resolved (e.g. [Resolved] High CPU Usage). | + +Opsgenie Responders: + +| Field | Type | Description | +|-------|------|--------| +| Type | String | Schedule, Team, User, or Escalation. For more information on alert responders, refer to the [Opsgenie documentation.](https://docs.opsgenie.com/docs/alert-recipients-and-teams) | +| Send To | String | Id, Name, or Username of the Opsgenie recipient. | + +### Webhook {#webhook-1} + +| Field | Description | +|-------|--------------| +| URL | Webhook URL for the app of your choice. | +| Proxy URL | Proxy for the webhook notification. | +| Enable Send Resolved Alerts | Whether to send a follow-up notification if an alert has been resolved (e.g. [Resolved] High CPU Usage). | + +### Custom {#custom-254-257} + +The YAML provided here will be directly appended to your receiver within the Alertmanager Config Secret. + +{{% /tab %}} +{{% tab "Rancher v2.5.0-2.5.3" %}} +The Alertmanager must be configured in YAML, as shown in these [examples.](#example-alertmanager-configs) +{{% /tab %}} +{{% /tabs %}} + +# Configuring Multiple Receivers + +By editing the forms in the Rancher UI, you can set up a Receiver resource with all the information Alertmanager needs to send alerts to your notification system. + +It is also possible to send alerts to multiple notification systems. One way is to configure the Receiver using custom YAML, in which case you can add the configuration for multiple notification systems, as long as you are sure that both systems should receive the same messages. + +You can also set up multiple receivers by using the `continue` option for a route, so that the alerts sent to a receiver continue being evaluated in the next level of the routing tree, which could contain another receiver. -```yaml -match_re: - [ : , ... ] -``` # Example Alertmanager Configs @@ -332,4 +389,9 @@ spec: # key: string ``` -For more information on enabling alerting for `rancher-cis-benchmark`, see [this section.]({{}}/rancher/v2.6/en/cis-scans/v2.5/#enabling-alerting-for-rancher-cis-benchmark) +For more information on enabling alerting for `rancher-cis-benchmark`, see [this section.]({{}}/rancher/v2.5/en/cis-scans/v2.5/#enabling-alerting-for-rancher-cis-benchmark) + + +# Trusted CA for Notifiers + +If you need to add a trusted CA to your notifier, follow the steps in [this section.](../helm-chart-options/#trusted-ca-for-notifiers) \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/route/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/route/_index.md new file mode 100644 index 00000000000..94ded98878b --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/route/_index.md @@ -0,0 +1,74 @@ +--- +title: Route Configuration +shortTitle: Routes +weight: 5 +--- + +The route configuration is the section of the Alertmanager custom resource that controls how the alerts fired by Prometheus are grouped and filtered before they reach the receiver. + +When a Route is changed, the Prometheus Operator regenerates the Alertmanager custom resource to reflect the changes. + +For more information about configuring routes, refer to the [official Alertmanager documentation.](https://www.prometheus.io/docs/alerting/latest/configuration/#route) + +> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works) + +- [Route Restrictions](#route-restrictions) +- [Route Configuration](#route-configuration) + - [Receiver](#receiver) + - [Grouping](#grouping) + - [Matching](#matching) + +# Route Restrictions + + - Alertmanager proxies alerts for Prometheus based on a configuration. It has receivers and a routing tree. + - Receivers: One or more notification providers (Slack, PagerDuty, etc.) to send alerts to. + - Routing tree: A set of routes that filter alerts to certain receivers based on labels. + - Alerting drivers proxy alerts for Alertmanager to non-native receivers, such as Microsoft Teams and SMS. + - can configure a routing tree to send and then continue. We only support routing trees with one root and then a depth of one more, for a depth two tree. But technically a ‘continue’ route lets you make the tree deeper. + - the receiver is for one or more notification providers. So if you know every alert for slack should also go to pager duty, you can put both configs in the same receiver. + - we now support broad SMS, not just Aliyun. + +# Route Configuration + +### Note on Labels and Annotations + +Labels should be used for identifying information that can affect the routing of notifications. Identifying information about the alert could consist of a container name, or the name of the team that should be notified. + +Annotations should be used for information that does not affect who receives the alert, such as a runbook url or error message. + +{{% tabs %}} +{{% tab "Rancher v2.5.4+" %}} + +### Receiver +The route needs to refer to a [receiver](#receiver-configuration) that has already been configured. + +### Grouping + +| Field | Default | Description | +|-------|--------------|---------| +| Group By | N/a | The labels by which incoming alerts are grouped together. For example, `[ group_by: '[' , ... ']' ]` Multiple alerts coming in for labels such as `cluster=A` and `alertname=LatencyHigh` can be batched into a single group. To aggregate by all possible labels, use the special value `'...'` as the sole label name, for example: `group_by: ['...']` Grouping by `...` effectively disables aggregation entirely, passing through all alerts as-is. This is unlikely to be what you want, unless you have a very low alert volume or your upstream notification system performs its own grouping. | +| Group Wait | 30s | How long to wait to buffer alerts of the same group before sending initially. | +| Group Interval | 5m | How long to wait before sending an alert that has been added to a group of alerts for which an initial notification has already been sent. | +| Repeat Interval | 4h | How long to wait before re-sending a given alert that has already been sent. | + +### Matching + +The **Match** field refers to a set of equality matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs to the Rancher UI, they correspond to the YAML in this format: + +```yaml +match: + [ : , ... ] +``` + +The **Match Regex** field refers to a set of regex-matchers used to identify which alerts to send to a given Route based on labels defined on that alert. When you add key-value pairs in the Rancher UI, they correspond to the YAML in this format: + +```yaml +match_re: + [ : , ... ] +``` + +{{% /tab %}} +{{% tab "Rancher v2.5.0-2.5.3" %}} +The Alertmanager must be configured in YAML, as shown in this [example.](./examples/#alertmanager-config) +{{% /tab %}} +{{% /tabs %}} \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/servicemonitor-podmonitor/_index.md b/content/rancher/v2.6/en/monitoring-alerting/configuration/servicemonitor-podmonitor/_index.md new file mode 100644 index 00000000000..632e5c2a0af --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/configuration/servicemonitor-podmonitor/_index.md @@ -0,0 +1,31 @@ +--- +title: ServiceMonitor and PodMonitor Configuration +shortTitle: ServiceMonitors and PodMonitors +weight: 7 +--- + +ServiceMonitors and PodMonitors are both pseudo-CRDs that map the scrape configuration of the Prometheus custom resource. + +These configuration objects declaratively specify the endpoints that Prometheus will scrape metrics from. + +ServiceMonitors are more commonly used than PodMonitors, and we recommend them for most use cases. + +> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works) + +### ServiceMonitors + +This pseudo-CRD maps to a section of the Prometheus custom resource configuration. It declaratively specifies how groups of Kubernetes services should be monitored. + +When a ServiceMonitor is created, the Prometheus Operator updates the Prometheus scrape configuration to include the ServiceMonitor configuration. Then Prometheus begins scraping metrics from the endpoint defined in the ServiceMonitor. + +Any Services in your cluster that match the labels located within the ServiceMonitor `selector` field will be monitored based on the `endpoints` specified on the ServiceMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#servicemonitor) provided by Prometheus Operator. + +For more information about how ServiceMonitors work, refer to the [Prometheus Operator documentation.](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/running-exporters.md) + +### PodMonitors + +This pseudo-CRD maps to a section of the Prometheus custom resource configuration. It declaratively specifies how group of pods should be monitored. + +When a PodMonitor is created, the Prometheus Operator updates the Prometheus scrape configuration to include the PodMonitor configuration. Then Prometheus begins scraping metrics from the endpoint defined in the ServiceMonitor. + +Any Pods in your cluster that match the labels located within the PodMonitor `selector` field will be monitored based on the `podMetricsEndpoints` specified on the PodMonitor. For more information on what fields can be specified, please look at the [spec](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#podmonitorspec) provided by Prometheus Operator. diff --git a/content/rancher/v2.6/en/monitoring-alerting/dashboards/_index.md b/content/rancher/v2.6/en/monitoring-alerting/dashboards/_index.md new file mode 100644 index 00000000000..6c6b65bb883 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/dashboards/_index.md @@ -0,0 +1,86 @@ +--- +title: Built-in Dashboards +weight: 3 +--- + +- [Grafana UI](#grafana-ui) +- [Alertmanager UI](#alertmanager-ui) +- [Prometheus UI](#prometheus-ui) + +# Grafana UI + +[Grafana](https://grafana.com/grafana/) allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture. + +To see the default dashboards for time series data visualization, go to the Grafana UI. + +### Customizing Grafana + +To view and customize the PromQL queries powering the Grafana dashboard, see [this page.](./customize-grafana) + +### Persistent Grafana Dashboards + +To create a persistent Grafana dashboard, see [this page.](./persist-grafana) + +### Access to Grafana + +For information about role-based access control for Grafana, see [this section.](./rbac/#role-based-access-control-for-grafana) + + +# Alertmanager UI + +When `rancher-monitoring` is installed, the Prometheus Alertmanager UI is deployed, allowing you to view your alerts and the current Alertmanager configuration. + +> This section assumes familiarity with how monitoring components work together. For more information about Alertmanager, see [this section.](../how-monitoring-works/#how-alertmanager-works) + + +### Accessing the Alertmanager UI + +The Alertmanager UI lets you see the most recently fired alerts. + +> **Prerequisite:** The `rancher-monitoring` application must be installed. + +To see the Alertmanager UI, go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Alertmanager.** + +**Result:** The Alertmanager UI opens in a new tab. For help with configuration, refer to the [official Alertmanager documentation.](https://prometheus.io/docs/alerting/latest/alertmanager/) + +For more information on configuring Alertmanager in Rancher, see [this page.](./configuration/alertmanager) + +
The Alertmanager UI
+![Alertmanager UI]({{}}/img/rancher/alertmanager-ui.png) + + +### Viewing Default Alerts + +To see alerts that are fired by default, go to the [Alertmanager UI](./alertmanager-ui) and click **Expand all groups.** + + +# Prometheus UI + +By default, the [kube-state-metrics service](https://github.com/kubernetes/kube-state-metrics) provides a wealth of information about CPU and memory utilization to the monitoring application. These metrics cover Kubernetes resources across namespaces. This means that in order to see resource metrics for a service, you don't need to create a new ServiceMonitor for it. Because the data is already in the time series database, you can go to the Prometheus UI and run a PromQL query to get the information. The same query can be used to configure a Grafana dashboard to show a graph of those metrics over time. + +To see the Prometheus UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Graph.** + +
Prometheus Graph UI
+![Prometheus Graph UI]({{}}/img/rancher/prometheus-graph-ui.png) + +### Viewing the Prometheus Targets + +To see what services you are monitoring, you will need to see your targets. Targets are set up by ServiceMonitors and PodMonitors as sources to scrape metrics from. You won't need to directly edit targets, but the Prometheus UI can be useful for giving you an overview of all of the sources of metrics that are being scraped. + +To see the Prometheus Targets, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Targets.** + +
Targets in the Prometheus UI
+![Prometheus Targets UI]({{}}/img/rancher/prometheus-targets-ui.png) + +### Viewing the PrometheusRules + +When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Alertmanager to figure out which Route should receive a certain Alert. + +To see the PrometheusRules, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Prometheus Rules.** + +You can also see the rules in the Prometheus UI: + +
Rules in the Prometheus UI
+![PrometheusRules UI]({{}}/img/rancher/prometheus-rules-ui.png) + +For more information on configuring PrometheusRules in Rancher, see [this page.](./configuration/prometheusrules) \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/configuration/expression/_index.md b/content/rancher/v2.6/en/monitoring-alerting/expression/_index.md similarity index 98% rename from content/rancher/v2.6/en/monitoring-alerting/configuration/expression/_index.md rename to content/rancher/v2.6/en/monitoring-alerting/expression/_index.md index 2b2698dfef4..b6557e61642 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/configuration/expression/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/expression/_index.md @@ -1,16 +1,17 @@ --- -title: Prometheus Expressions -weight: 4 +title: PromQL Expression Reference +weight: 6 aliases: - - /rancher/v2.6/en/project-admin/tools/monitoring/expression - - /rancher/v2.6/en/cluster-admin/tools/monitoring/expression - - /rancher/v2.6/en/monitoring-alerting/legacy/monitoring/cluster-monitoring/expression - - /rancher/v2.6/en/monitoring-alerting/v2.5/configuration/expression + - /rancher/v2.5/en/project-admin/tools/monitoring/expression + - /rancher/v2.5/en/cluster-admin/tools/monitoring/expression + - /rancher/v2.5/en/monitoring-alerting/legacy/monitoring/cluster-monitoring/expression + - /rancher/v2.5/en/monitoring-alerting/v2.5/configuration/expression + - /rancher/v2.5/en/monitoring/alerting/configuration/expression --- The PromQL expressions in this doc can be used to configure alerts. -For more information about querying Prometheus, refer to the official [Prometheus documentation.](https://prometheus.io/docs/prometheus/latest/querying/basics/) +For more information about querying the Prometheus time series database, refer to the official [Prometheus documentation.](https://prometheus.io/docs/prometheus/latest/querying/basics/) diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/_index.md new file mode 100644 index 00000000000..5ed9b6bc27a --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/_index.md @@ -0,0 +1,15 @@ +--- +title: Monitoring Guides +shortTitle: Guides +weight: 4 +--- + +- [Enable monitoring](./enable-monitoring) +- [Uninstall monitoring](./uninstall) +- [Monitoring Rancher apps](./monitoring-rancher-apps) +- [Monitoring workloads](./monitoring-workloads) +- [Customizing Grafana dashboards](./customize-grafana) +- [Persistent Grafana dashboards](./persist-grafana) +- [Setting up metrics for horizontal pod autoscaling](./hpa) +- [Debugging high memory usage](./memory-usage) +- [Migrating from Monitoring V1 to V2](./migrating) \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/customize-grafana/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/customize-grafana/_index.md new file mode 100644 index 00000000000..8421c9b76c4 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/customize-grafana/_index.md @@ -0,0 +1,52 @@ +--- +title: Customizing Grafana Dashboards +weight: 5 +--- + +In this section, you'll learn how to customize the Grafana dashboard to show metrics that apply to a certain container. + +### Prerequisites + +Before you can customize a Grafana dashboard, the `rancher-monitoring` application must be installed. + +To see the links to the external monitoring UIs, including Grafana dashboards, you will need at least a [project-member role.]({{}}/rancher/v2.5/en/monitoring-alerting/rbac/#users-with-rancher-cluster-manager-based-permissions) + +### Signing in to Grafana + +1. In the Rancher UI, go to the cluster that has the dashboard you want to customize. +1. In the left navigation menu, click **Monitoring.** +1. Click **Grafana.** The Grafana dashboard should open in a new tab. +1. Go to the log in icon in the lower left corner and click **Sign In.** +1. Log in to Grafana. The default Admin username and password for the Grafana instance is `admin/prom-operator`. (Regardless of who has the password, cluster administrator permission in Rancher is still required access the Grafana instance.) Alternative credentials can also be supplied on deploying or upgrading the chart. + + +### Getting the PromQL Query Powering a Grafana Panel + +For any panel, you can click the title and click **Explore** to get the PromQL queries powering the graphic. + +For this example, we would like to get the CPU usage for the Alertmanager container, so we click **CPU Utilization > Inspect.** +1. The **Data** tab shows the underlying data as a time series, with the time in first column and the PromQL query result in the second column. Copy the PromQL query. + + ``` + (1 - (avg(irate({__name__=~"node_cpu_seconds_total|windows_cpu_time_total",mode="idle"}[5m])))) * 100 + + ``` + +### Modifying an Existing Grafana Panel + +1. Open the Grafana dashboard. + + +### Creating a New Grafana Panel in a Dashboard + + +- let’s say you want metrics that apply only for the container alertmanager. +- link to the promql queries used to make grafana dashboards. To get those queries, +- go to grafana + - right click on a graphic and click explore + - it shows you the PromQL queries that are embedded in it + + - can modify it + - grafana shows you updated based on your modifications to the query + + - also link to persisting grafana dashboards section diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/enable-monitoring/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/enable-monitoring/_index.md new file mode 100644 index 00000000000..b26af3c7b68 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/enable-monitoring/_index.md @@ -0,0 +1,79 @@ +--- +title: Enable Monitoring +weight: 1 +--- + +As an [administrator]({{}}/rancher/v2.5/en/admin-settings/rbac/global-permissions/) or [cluster owner]({{}}/rancher/v2.5/en/admin-settings/rbac/cluster-project-roles/#cluster-roles), you can configure Rancher to deploy Prometheus to monitor your Kubernetes cluster. + +This page describes how to enable monitoring and alerting within a cluster using the new monitoring application. + +You can enable monitoring with or without SSL. + +# Requirements + +- Make sure that you are allowing traffic on port 9796 for each of your nodes because Prometheus will scrape metrics from here. +- Make sure your cluster fulfills the resource requirements. The cluster should have at least 1950Mi memory available, 2700m CPU, and 50Gi storage. A breakdown of the resource limits and requests is [here.](./configuration/helm-chart-options/#setting-resource-limits-and-requests) +- When installing monitoring on an RKE cluster using RancherOS or Flatcar Linux nodes, change the etcd node certificate directory to `/opt/rke/etc/kubernetes/ssl`. + +> **Note:** If you want to set up Alertmanager, Grafana or Ingress, it has to be done with the settings on the Helm chart deployment. It's problematic to create Ingress outside the deployment. + +# Setting Resource Limits and Requests + +The resource requests and limits can be configured when installing `rancher-monitoring`. To configure Prometheus resources from the Rancher UI, click **Apps & Marketplace > Monitoring** in the upper left corner. + +For more information about the default limits, see [this page.](./configuration/helm-chart-options/#setting-resource-limits-and-requests) + +# Install the Monitoring Application + +{{% tabs %}} +{{% tab "Rancher v2.5.8" %}} + +### Enable Monitoring for use without SSL + +1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.** +1. Click **Apps.** +1. Click the `rancher-monitoring` app. +1. Optional: Click **Chart Options** and configure alerting, Prometheus and Grafana. For help, refer to the [configuration reference.](./configuration) +1. Scroll to the bottom of the Helm chart README and click **Install.** + +**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace. + +### Enable Monitoring for use with SSL + +1. Follow the steps on [this page]({{}}/rancher/v2.5/en/k8s-in-rancher/secrets/) to create a secret in order for SSL to be used for alerts. + - The secret should be created in the `cattle-monitoring-system` namespace. If it doesn't exist, create it first. + - Add the `ca`, `cert`, and `key` files to the secret. +1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.** +1. Click **Apps.** +1. Click the `rancher-monitoring` app. +1. Click **Alerting**. +1. Click **Additional Secrets** and add the secrets created earlier. + +**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace. + +When [creating a receiver,]({{}}/rancher/v2.5/en/monitoring-alerting/configuration/alertmanager/#creating-receivers-in-the-rancher-ui) SSL-enabled receivers such as email or webhook will have a **SSL** section with fields for **CA File Path**, **Cert File Path**, and **Key File Path**. Fill in these fields with the paths to each of `ca`, `cert`, and `key`. The path will be of the form `/etc/alertmanager/secrets/name-of-file-in-secret`. + +For example, if you created a secret with these key-value pairs: + +```yaml +ca.crt=`base64-content` +cert.pem=`base64-content` +key.pfx=`base64-content` +``` + +Then **Cert File Path** would be set to `/etc/alertmanager/secrets/cert.pem`. + +{{% /tab %}} +{{% tab "Rancher v2.5.0-2.5.7" %}} + +1. In the Rancher UI, go to the cluster where you want to install monitoring and click **Cluster Explorer.** +1. Click **Apps.** +1. Click the `rancher-monitoring` app. +1. Optional: Click **Chart Options** and configure alerting, Prometheus and Grafana. For help, refer to the [configuration reference.](./configuration) +1. Scroll to the bottom of the Helm chart README and click **Install.** + +**Result:** The monitoring app is deployed in the `cattle-monitoring-system` namespace. + +{{% /tab %}} + +{{% /tabs %}} diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/hpa/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/hpa/_index.md new file mode 100644 index 00000000000..e083712c6ac --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/hpa/_index.md @@ -0,0 +1,31 @@ +--- +title: Setting up Metrics for HPA +weight: 7 +--- + +The monitoring app installs a Prometheus adapter that can be used for making the metrics from monitoring available from the Kubernetes API. This is useful for horizontal pod autoscaling based on custom metrics. + + +- kube-state-metrics: monitors internal K8s components +- +For HPA it’s important to talk about kubernetes metrics APIs. For every rke cluster, metrics server is added on. HPA can hit that, can scale up or down based on pod or node usage. + +We package Prometheus Adapter. It implements a k8s metrics api, says I want to expose these metrics in the k8s api so it can be used for HPA. + + +- kubernetes metrics APIs are implemented as adapters. +- the default adapter that has been implemented for a long time is the resource metrics API. This is why when you deploy RKE, the default API that is added on is metrics server. +- Metrics server is a kubernetes project that is an adapter that implements the resource metrics API. It collects different node metrics and stores it in a way that is accessible by HPA. +- If you want prometheus metrics to be stored on the Kubernetes API for you to be able to do HPA on, then the relevant way to configure that is by using Prometheus Adapter. It is packaged by default in monitoring v2, but not v1. + - if you want to do the custom metrics API, there is a secret for Prometheus Adapter that you can modify that will start exposing selected metrics from Prometheus onto those APIs, which can then be consumed by HPA. +- resource metrics: implemented by metrics-server, deployed as an RKE add-on +- custom metrics Api: implemented by Prometheus Adapter, exposed for use within the cluster (e.g. HPA) +- External Metrics API: implemented by Prometheus Adapter, exposed for use outside the cluster. + + +Kubernetes metrics API +- for HPA, how do I query prometheus to use that? +- prometheus stores data within its own time series database +- there are times when you also want to expose that within kubernetes itself, so that things like HPA can use it. +- k8s has metrics apis that are implemented as adapters +- big one is metrics API diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/memory-usage/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/memory-usage/_index.md new file mode 100644 index 00000000000..9583570c444 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/memory-usage/_index.md @@ -0,0 +1,20 @@ +--- +title: Debugging High Memory Usage +weight: 8 +--- + +Every time series in Prometheus is uniquely identified by its [metric name](https://prometheus.io/docs/practices/naming/#metric-names) and optional key-value pairs called [labels.](https://prometheus.io/docs/practices/naming/#labels) + +The labels allow the ability to filter and aggregate the time series data, but they also multiply the amount of data that Prometheus collects. + +Each time series has a defined set of labels, and Prometheus generates a new time series for all unique combinations of labels. If a metric has two labels attached, two time series are generated for that metric. Changing any label value, including adding or removing a label, will create a new time series. + +Prometheus is optimized to store data that is index-based on series. It is designed for a relatively consistent number of time series and a relatively large number of samples that need to be collected from the exporters over time. + +Inversely, Prometheus is not optimized to accommodate a rapidly changing number of time series. For that reason, large bursts of memory usage can occur when monitoring is installed on clusters where many resources are being created and destroyed, especially on multi-tenant clusters. + +### Reducing Memory Bursts + +To reduce memory consumption, Prometheus can be configured to store fewer time series, by scraping fewer metrics or by attaching fewer labels to the time series. To see which series use the most memory, you can check the TSDB (time series database) status page in the Prometheus UI. + +Distributed Prometheus solutions such as [Thanos](https://thanos.io/) and [Cortex](https://cortexmetrics.io/) use an alternate architecture in which multiple small Prometheus instances are deployed. In the case of Thanos, the metrics from each Prometheus are aggregated into the common Thanos deployment, and then those metrics are exported to a persistent store, such as S3. This more robust architecture avoids burdening any single Prometheus instance with too many time series, while also preserving the ability to query metrics on a global level. \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/migrating/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/migrating/_index.md similarity index 79% rename from content/rancher/v2.6/en/monitoring-alerting/migrating/_index.md rename to content/rancher/v2.6/en/monitoring-alerting/guides/migrating/_index.md index 630361d0f7f..3a14cd9b46c 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/migrating/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/migrating/_index.md @@ -1,13 +1,22 @@ --- title: Migrating to Rancher v2.5 Monitoring -weight: 5 +weight: 9 aliases: - - /rancher/v2.6/en/monitoring-alerting/v2.5/migrating + - /rancher/v2.5/en/monitoring-alerting/v2.5/migrating --- If you previously enabled Monitoring, Alerting, or Notifiers in Rancher before v2.5, there is no automatic upgrade path for switching to the new monitoring/alerting solution. Before deploying the new monitoring solution via Cluster Explore, you will need to disable and remove all existing custom alerts, notifiers and monitoring installations for the whole cluster and in all projects. -### Monitoring Before Rancher v2.5 +- [Monitoring Before Rancher v2.5](#monitoring-before-rancher-v2-5) +- [Monitoring and Alerting via Cluster Explorer in Rancher v2.5](#monitoring-and-alerting-via-cluster-explorer-in-rancher-v2-5) +- [Changes to Role-based Access Control](#changes-to-role-based-access-control) +- [Migrating from Monitoring V1 to Monitoring V2](#migrating-from-monitoring-v1-to-monitoring-v2) + - [Migrating Grafana Dashboards](#migrating-grafana-dashboards) + - [Migrating Alerts](#migrating-alerts) + - [Migrating Notifiers](#migrating-notifiers) + - [Migrating for RKE Template Users](#migrating-for-rke-template-users) + +# Monitoring Before Rancher v2.5 As of v2.2.0, Rancher's Cluster Manager allowed users to enable Monitoring & Alerting V1 (both powered by [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)) independently within a cluster. @@ -17,7 +26,7 @@ Monitoring V1 could be configured on both a cluster-level and on a project-level When Alerts or Notifiers are enabled, Alerting V1 deploys [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) and a set of Rancher controllers onto a cluster that allows users to define alerts and configure alert-based notifications via Email, Slack, PagerDuty, etc. Users can choose to create different types of alerts depending on what needs to be monitored (e.g. System Services, Resources, CIS Scans, etc.); however, PromQL Expression-based alerts can only be created if Monitoring V1 is enabled. -### Monitoring/Alerting via Cluster Explorer in Rancher 2.5 +# Monitoring and Alerting via Cluster Explorer in Rancher 2.5 As of v2.5.0, Rancher's Cluster Explorer now allows users to enable Monitoring & Alerting V2 (both powered by [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator)) together within a cluster. @@ -25,26 +34,35 @@ Unlike in Monitoring & Alerting V1, both features are packaged in a single Helm Monitoring V2 can only be configured on the cluster level. Project-level monitoring and alerting is no longer supported. -For more information on how to configure Monitoring & Alerting V2, see [this page.]({{}}/rancher/v2.6/en/monitoring-alerting/v2.5/configuration) +For more information on how to configure Monitoring & Alerting V2, see [this page.]({{}}/rancher/v2.5/en/monitoring-alerting/v2.5/configuration) -### Changes to Role-based Access Control +# Changes to Role-based Access Control Project owners and members no longer get access to Grafana or Prometheus by default. If view-only users had access to Grafana, they would be able to see data from any namespace. For Kiali, any user can edit things they don’t own in any namespace. For more information about role-based access control in `rancher-monitoring`, refer to [this page.](../rbac) -### Migrating from Monitoring V1 to Monitoring V2 +# Migrating from Monitoring V1 to Monitoring V2 While there is no automatic migration available, it is possible to manually migrate custom Grafana dashboards and alerts that were created in Monitoring V1 to Monitoring V2. Before you can install Monitoring V2, Monitoring V1 needs to be uninstalled completely. In order to uninstall Monitoring V1: -* Remove all cluster and project specific alerts and alerts groups -* Remove all notifiers -* Disable all project monitoring installations under Cluster -> Project -> Tools -> Monitoring +* Remove all cluster and project specific alerts and alerts groups. +* Remove all notifiers. +* Disable all project monitoring installations under Cluster -> Project -> Tools -> Monitoring. * Ensure that all project-monitoring apps in all projects have been removed and are not recreated after a few minutes -* Disable the cluster monitoring installation under Cluster -> Tools -> Monitoring -* Ensure that the cluster-monitoring app and the monitoring-operator app in the System project have been removed and are not recreated after a few minutes +* Disable the cluster monitoring installation under Cluster -> Tools -> Monitoring. +* Ensure that the cluster-monitoring app and the monitoring-operator app in the System project have been removed and are not recreated after a few minutes. + +#### RKE Template Clusters + +To prevent V1 monitoring from being re-enabled, disable monitoring and in future RKE template revisions via modification of the RKE template yaml: + +```yaml +enable_cluster_alerting: false +enable_cluster_monitoring: false +``` #### Migrating Grafana Dashboards @@ -77,7 +95,7 @@ data: Once this ConfigMap is created, the dashboard will automatically be added to Grafana. -#### Migrating Alerts +### Migrating Alerts It is only possible to directly migrate expression-based alerts to Monitoring V2. Fortunately, the event-based alerts that could be set up to alert on system component, node or workload events, are already covered out-of-the-box by the alerts that are part of Monitoring V2. So it is not necessary to migrate them. @@ -110,8 +128,13 @@ or add the Prometheus Rule through the Cluster Explorer {{< img "/img/rancher/monitoring/migration/alert_2.4_to_2.5_target.png" "">}} -For more details on how to configure PrometheusRules in Monitoring V2 see [Monitoring Configuration]({{}}/rancher/v2.6/en/monitoring-alerting/v2.5/configuration#prometheusrules). +For more details on how to configure PrometheusRules in Monitoring V2 see [Monitoring Configuration]({{}}/rancher/v2.5/en/monitoring-alerting/v2.5/configuration#prometheusrules). -#### Migrating notifiers +### Migrating Notifiers -There is no direct equivalent for how notifiers work in Monitoring V1. Instead you have to replicate the desired setup with [Routes and Receivers]({{}}/rancher/v2.6/en/monitoring-alerting/v2.5/configuration#alertmanager-config) in Monitoring V2. +There is no direct equivalent for how notifiers work in Monitoring V1. Instead you have to replicate the desired setup with [Routes and Receivers]({{}}/rancher/v2.5/en/monitoring-alerting/v2.5/configuration#alertmanager-config) in Monitoring V2. + + +### Migrating for RKE Template Users + +If the cluster is managed using an RKE template, you will need to disable monitoring in future RKE template revisions to prevent legacy monitoring from being re-enabled. \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/monitoring-rancher-apps/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/monitoring-rancher-apps/_index.md new file mode 100644 index 00000000000..b78777fea01 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/monitoring-rancher-apps/_index.md @@ -0,0 +1,20 @@ +--- +title: Monitoring Rancher Apps +weight: 3 +--- + + +A common pattern for Rancher apps is to package a ServiceMonitor in the Helm chart for the application. The ServiceMonitor contains a preconfigured Prometheus target for monitoring. + +When the ServiceMonitor is enabled and monitoring is also enabled, Prometheus will be able to scrape metrics from the Rancher application. + + + +CIS application has a flag that lets you deploy a service monitor in it. As a general practice we expose charts for prometheus metrics to have that service monitor definition. The moment it’s deployed into the cluster, the prometheus scrape configuration will automatically be updated to reflect the service monitors that it has access to. + +In logging v2 they will deploy a service monitor and we will just absorb it. + + +question: someone found out from looking through rancher helm charts that some of them already have a service monitor defined that you might have to turn on, and if you do, those metrics are prepackaged for Prometheus in the right format. + +It's a common pattern to have service monitor packaged inside. That’s how we do it for cis scans. \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/monitoring-workloads/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/monitoring-workloads/_index.md new file mode 100644 index 00000000000..d51973fd40a --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/monitoring-workloads/_index.md @@ -0,0 +1,56 @@ +--- +title: Setting up Monitoring for a Workload +weight: 4 +--- + +- [Display CPU and Memory Metrics for a Workload](#display-cpu-and-memory-metrics-for-a-workload) +- [Setting up Metrics Beyond CPU and Memory](#setting-up-metrics-beyond-cpu-and-memory) + +If you only need CPU and memory time series for the workload, you don't need to deploy a ServiceMonitor or PodMonitor because the monitoring application already collects metrics data on resource usage by default. + + + + +The steps for setting up monitoring for workloads depends on whether you want basic metrics such as CPU and memory for the workload, or whether you want to scrape custom metrics from the workload. + +If you only need CPU and memory time series for the workload, you don't need to deploy a ServiceMonitor or PodMonitor because the monitoring application already collects metrics data on resource usage by default. The resource usage time series data is in Prometheus's local time series database. Grafana shows the data in aggregate, but you can see the data for the individual workload by using a PromQL query that extracts the data for that workload. Once you have the PromQL query, you can execute the query individually in the Prometheus UI and see the time series visualized there, or you can use the query to customize a Grafana dashboard to display the workload metrics. For examples of PromQL queries for workload metrics, see [this section.](https://rancher.com/docs/rancher/v2.5/en/monitoring-alerting/configuration/expression/#workload-metrics) + +To set up custom metrics for your workload, you will need to set up an exporter and create a new ServiceMonitor custom resource to configure Prometheus to scrape metrics from your exporter. + +For more information, see [this section.](./monitoring-workloads) + + + + + +explain how some applications come with a servicemonitor packaged within them + +for example, some rancher applications come with servicemonitors (link to section) + +### Display CPU and Memory Metrics for a Workload + +By default, the monitoring application already scrapes CPU and memory. + +To get some fine-grained detail for a particular workload, you can customize a Grafana dashboard to display the metrics for a particular workload. + +- there’s already a wealth of information provided by kube-state-metrics. Cpu utilization, memory utilization for different things across namespaces. If you just want resource metrics for prod, you don’t need to create a new ServiceMonitor for it. All you need to do is go to the prometheus UI and do a PromQL query to get the information. + +For more information on customizing Grafana to show the workload metrics, see this section. (Link) + + +### Setting up Metrics Beyond CPU and Memory + +For custom metrics, you will need to expose the metrics on your application in a format supported by Prometheus. + +Then we recommend that you should create a new ServiceMonitor custom resource. When this resource is created, the Prometheus custom resource will be automatically updated so that its scrape configuration includes the new custom metrics endpoint. Then Prometheus will begin scraping metrics from the endpoint. + +You can also create a PodMonitor to expose the custom metrics endpoint, but ServiceMonitors are more appropriate for the majority of use cases. + +- let’s say we expose metrics at a particular endpoint. Let’s take rancher-monitoring-kube-state-metrics. For example they have a container port where they expose metrics from. + - the approach I would take - although we don’t have a clean UI from it - is to create it from YAML. + - for something like for grafana we’d create it like this - like for rancher-monitoring-grafana - where the basic details we need to provide are: + - what is the actual endpoint that you want to hit (spec.endpoints, path and port) - what’s the HTTP path that you want to hit and what’s the port. + - namespaceSelector: what namespaces does that particular deployment exist in within Kubernetes, and use matchNames to select them. + - you can also use selector.matchLabels. + - That’s what it takes to add monitoring if a serviceMonitor is not already defined. + - example: use the rancher-monitoring-grafana YAML \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/persist-grafana/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/persist-grafana/_index.md similarity index 62% rename from content/rancher/v2.6/en/monitoring-alerting/persist-grafana/_index.md rename to content/rancher/v2.6/en/monitoring-alerting/guides/persist-grafana/_index.md index 2b09bfa9fc3..e9d9e1afe6d 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/persist-grafana/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/persist-grafana/_index.md @@ -1,8 +1,8 @@ --- title: Persistent Grafana Dashboards -weight: 4 +weight: 6 aliases: - - /rancher/v2.6/en/monitoring-alerting/v2.5/persist-grafana + - /rancher/v2.5/en/monitoring-alerting/v2.5/persist-grafana --- To allow the Grafana dashboard to persist after the Grafana instance restarts, add the dashboard configuration JSON into a ConfigMap. ConfigMaps also allow the dashboards to be deployed with a GitOps or CD based approach. This allows the dashboard to be put under version control. @@ -12,11 +12,14 @@ To allow the Grafana dashboard to persist after the Grafana instance restarts, a # Creating a Persistent Grafana Dashboard +{{% tabs %}} +{{% tab "Rancher v2.5.8+" %}} + > **Prerequisites:** > > - The monitoring application needs to be installed. > - To create the persistent dashboard, you must have at least the **Manage Config Maps** Rancher RBAC permissions assigned to you in the project or namespace that contains the Grafana Dashboards. This correlates to the `monitoring-dashboard-edit` or `monitoring-dashboard-admin` Kubernetes native RBAC Roles exposed by the Monitoring chart. -> - To see the links to the external monitoring UIs, including Grafana dashboards, you will need at least a [project-member role.]({{}}/rancher/v2.6/en/monitoring-alerting/rbac/#users-with-rancher-cluster-manager-based-permissions) +> - To see the links to the external monitoring UIs, including Grafana dashboards, you will need at least a [project-member role.]({{}}/rancher/v2.5/en/monitoring-alerting/rbac/#users-with-rancher-cluster-manager-based-permissions) ### 1. Get the JSON model of the dashboard that you want to persist @@ -72,7 +75,7 @@ If you attempt to delete the dashboard in the Grafana UI, you will see the error ### Configuring Namespaces for the Grafana Dashboard ConfigMap -To specify that you would like Grafana to watch for ConfigMaps across all namespaces, set: +To specify that you would like Grafana to watch for ConfigMaps across all namespaces, set this value in the `rancher-monitoring` Helm chart: ``` grafana.sidecar.dashboards.searchNamespace=ALL @@ -80,3 +83,50 @@ grafana.sidecar.dashboards.searchNamespace=ALL Note that the RBAC roles exposed by the Monitoring chart to add Grafana Dashboards are still restricted to giving permissions for users to add dashboards in the namespace defined in `grafana.dashboards.namespace`, which defaults to `cattle-dashboards`. +{{% /tab %}} +{{% tab "Rancher before v2.5.8" %}} +> **Prerequisites:** +> +> - The monitoring application needs to be installed. +> - You must have the cluster-admin ClusterRole permission. + +1. Open the Grafana dashboard. From the **Cluster Explorer,** click **Cluster Explorer > Monitoring.** +1. Log in to Grafana. Note: The default Admin username and password for the Grafana instance is `admin/prom-operator`. Alternative credentials can also be supplied on deploying or upgrading the chart. + + > **Note:** Regardless of who has the password, cluster administrator permission in Rancher is still required to access the Grafana instance. +1. Go to the dashboard that you want to persist. In the top navigation menu, go to the dashboard settings by clicking the gear icon. +1. In the left navigation menu, click **JSON Model.** +1. Copy the JSON data structure that appears. +1. Create a ConfigMap in the `cattle-dashboards` namespace. The ConfigMap needs to have the label `grafana_dashboard: "1"`. Paste the JSON into the ConfigMap in the format shown in the example below: + + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + labels: + grafana_dashboard: "1" + name: + namespace: cattle-dashboards + data: + .json: |- + + ``` + +**Result:** After the ConfigMap is created, it should show up on the Grafana UI and be persisted even if the Grafana pod is restarted. + +Dashboards that are persisted using ConfigMaps cannot be deleted from the Grafana UI. If you attempt to delete the dashboard in the Grafana UI, you will see the error message "Dashboard cannot be deleted because it was provisioned." To delete the dashboard, you will need to delete the ConfigMap. + +To prevent the persistent dashboard from being deleted when Monitoring v2 is uninstalled, add the following annotation to the `cattle-dashboards` namespace: + +``` +helm.sh/resource-policy: "keep" +``` + +{{% /tab %}} +{{% /tabs %}} + +# Known Issues + +For users who are using Monitoring V2 v9.4.203 or below, uninstalling the Monitoring chart will delete the `cattle-dashboards` namespace, which will delete all persisted dashboards, unless the namespace is marked with the annotation `helm.sh/resource-policy: "keep"`. + +This annotation will be added by default in the new monitoring chart released by Rancher v2.5.8, but it still needs to be manually applied for users of earlier Rancher versions. diff --git a/content/rancher/v2.6/en/monitoring-alerting/guides/uninstall/_index.md b/content/rancher/v2.6/en/monitoring-alerting/guides/uninstall/_index.md new file mode 100644 index 00000000000..0c9d681cc16 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/guides/uninstall/_index.md @@ -0,0 +1,14 @@ +--- +title: Uninstall Monitoring +weight: 2 +--- + +1. From the **Cluster Explorer,** click Apps & Marketplace. +1. Click **Installed Apps.** +1. Go to the `cattle-monitoring-system` namespace and check the boxes for `rancher-monitoring-crd` and `rancher-monitoring`. +1. Click **Delete.** +1. Confirm **Delete.** + +**Result:** `rancher-monitoring` is uninstalled. + +> **Note on Persistent Grafana Dashboards:** For users who are using Monitoring V2 v9.4.203 or below, uninstalling the Monitoring chart will delete the cattle-dashboards namespace, which will delete all persisted dashboards, unless the namespace is marked with the annotation `helm.sh/resource-policy: "keep"`. This annotation is added by default in Monitoring V2 v14.5.100+ but can be manually applied on the cattle-dashboards namespace before an uninstall if an older version of the Monitoring chart is currently installed onto your cluster. \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/how-monitoring-works/_index.md b/content/rancher/v2.6/en/monitoring-alerting/how-monitoring-works/_index.md new file mode 100644 index 00000000000..a9c205ba2c6 --- /dev/null +++ b/content/rancher/v2.6/en/monitoring-alerting/how-monitoring-works/_index.md @@ -0,0 +1,251 @@ +--- +title: How Monitoring Works +weight: 1 +--- + +1. [Architecture Overview](#1-architecture-overview) +2. [How Prometheus Works](#2-how-prometheus-works) +3. [How Alertmanager Works](#3-how-alertmanager-works) +4. [Monitoring V2 Specific Components](#4-monitoring-v2-specific-components) +5. [Scraping and Exposing Metrics](#5-scraping-and-exposing-metrics) + +# 1. Architecture Overview + +This diagram shows how data flows through the Monitoring V2 application: + +{{% row %}} +{{% column %}} + +![How data flows through the monitoring application]({{}}/img/rancher/monitoring-v2-architecture-overview.svg) + +{{% /column %}} +{{% column %}} + + +1. Rules define what Prometheus metrics or time series database queries should result in alerts being fired. +2. ServiceMonitors and PodMonitors declaratively specify how services and pods should be monitored. They use labels to scrape metrics from pods. +3. Prometheus Operator observes ServiceMonitors, PodMonitors and PrometheusRules being created. +4. When the Prometheus configuration resources are created, Prometheus Operator calls the Prometheus API to sync the new configuration. +5. Recording Rules are not directly used for alerting. They create new time series of precomputed queries. These new time series data can then be queried to generate alerts. +6. Prometheus scrapes all targets in the scrape configuration on a recurring schedule based on the scrape interval, storing the results in its time series database.Depending on the Kubernetes master component and Kubernetes distribution, the metrics from a certain Kubernetes component could be directly exposed to Prometheus, proxied through PushProx, or not available. For details, see Scraping and Exposing Metrics. +7. Prometheus evaluates the alerting rules against the time series database. It fires alerts to Alertmanager whenever an alerting rule evaluates to a positive number. +8. Alertmanager uses routes to group, label and filter the fired alerts to translate them into useful notifications. +9. Alertmanager uses the Receiver configuration to send notifications to Slack, PagerDuty, SMS, or other types of receivers. + +{{% /column %}} +{{% /row %}} + + + + +# 2. How Prometheus Works + +### 2.1. Storing Time Series Data + +After collecting metrics from exporters, Prometheus stores the time series in a local on-disk time series database. Prometheus optionally integrates with remote systems, but `rancher-monitoring` uses local storage for the time series database. + +The database can then be queried using PromQL, the query language for Prometheus. Grafana dashboards use PromQL queries to generate data visualizations. + +### 2.2. Querying the Time Series Database + +The PromQL query language is the primary tool to query Prometheus for time series data. + +In Grafana, you can right-click a CPU utilization and click Inspect. This opens a panel that shows the [raw query results.](https://grafana.com/docs/grafana/latest/panels/inspect-panel/#inspect-raw-query-results)The raw results demonstrate how each dashboard is powered by PromQL queries. + +### 2.3. Defining Rules for when Alerts Should be Fired + +Rules define the conditions for Prometheus to fire alerts. When PrometheusRule custom resources are created or updated, the Prometheus Operator observes the change and calls the Prometheus API to synchronize the rule configuration with the Alerting Rules and Recording Rules in Prometheus. + +When you define a Rule (which is declared within a RuleGroup in a PrometheusRule resource), the [spec of the Rule itself](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#rule) contains labels that are used by Alertmanager to figure out which Route should receive this Alert. For example, an Alert with the label `team: front-end` will be sent to all Routes that match on that label. + +A PrometheusRule allows you to define one or more RuleGroups. Each RuleGroup consists of a set of Rule objects that can each represent either an alerting or a recording rule with the following fields: + +- The name of the new alert or record +- A PromQL expression for the new alert or record +- Labels that should be attached to the alert or record that identify it (e.g. cluster name or severity) +- Annotations that encode any additional important pieces of information that need to be displayed on the notification for an alert (e.g. summary, description, message, runbook URL, etc.). This field is not required for recording rules. + +### 2.4. Firing Alerts + +Prometheus doesn't maintain the state of whether alerts are active. It fires alerts repetitively at every evaluation interval, relying on Alertmanager to group and filter the alerts into meaningful notifications. + +The `evaluation_interval` constant defines how often Prometheus evaluates its alerting rules against the time series database. Similar to the `scrape_interval`, the `evaluation_interval` also defaults to one minute. + +The rules are contained in a set of rule files. Rule files include both alerting rules and recording rules, but only alerting rules result in alerts being fired after their evaluation. + +For recording rules, Prometheus runs a query, then stores it as a time series. This synthetic time series is useful for storing the results of an expensive or time-consuming query so that it can be queried more quickly in the future. + +Alerting rules are more commonly used. Whenever an alerting rule evaluates to a positive number, Prometheus fires an alert. + +The Rule file adds labels and annotations to alerts before firing them, depending on the use case: + +- Labels indicate information that identifies the alert and could affect the routing of the alert. For example, if when sending an alert about a certain container, the container ID could be used as a label. +- Annotations denote information that doesn't affect where an alert is routed, for example, a runbook or an error message. + +# 3. How Alertmanager Works + +The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of the following tasks: + +- Deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or OpsGenie +- Silencing and inhibition of alerts +- Tracking alerts that fire over time +- Sending out the status of whether an alert is currently firing, or if it is resolved + +### 3.1. Routing Alerts to Receivers + +Alertmanager coordinates where alerts are sent. It allows you to group alerts based on labels and fire them based on whether certain labels are matched. One top-level route accepts all alerts. From there, Alertmanager continues routing alerts to receivers based on whether they match the conditions of the next route. + +While the Rancher UI forms only allow editing a routing tree that is two levels deep, you can configure more deeply nested routing structures by editing the Alertmanager custom resource YAML. + +### 3.2. Configuring Multiple Receivers + +By editing the forms in the Rancher UI, you can set up a Receiver resource with all the information Alertmanager needs to send alerts to your notification system. + +By editing custom YAML in the Alertmanager or Receiver configuration, you can also send alerts to multiple notification systems. For more information, see the section on configuring [Receivers.](./configuration/receiver/#configuring-multiple-receivers) + +# 4. Monitoring V2 Specific Components + +Prometheus Operator introduces a set of [Custom Resource Definitions](https://github.com/prometheus-operator/prometheus-operator#customresourcedefinitions) that allow users to deploy and manage Prometheus and Alertmanager instances by creating and modifying those custom resources on a cluster. + +Prometheus Operator will automatically update your Prometheus configuration based on the live state of the resources and configuration options that are edited in the Rancher UI. + +### 4.1. Resources Deployed by Default + +By default, a set of resources curated by the [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project are deployed onto your cluster as part of installing the Rancher Monitoring Application to set up a basic Monitoring/Alerting stack. + +The resources that get deployed onto your cluster to support this solution can be found in the [`rancher-monitoring`](https://github.com/rancher/charts/tree/main/charts/rancher-monitoring) Helm chart, which closely tracks the upstream [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) Helm chart maintained by the Prometheus community with certain changes tracked in the [CHANGELOG.md](https://github.com/rancher/charts/blob/main/charts/rancher-monitoring/CHANGELOG.md). + +There are also certain special types of ConfigMaps and Secrets such as those corresponding to Grafana Dashboards, Grafana Datasources, and Alertmanager Configs that will automatically update your Prometheus configuration via sidecar proxies that observe the live state of those resources within your cluster. + +### 4.2. PushProx + +PushProx enhances the security of the monitoring application, allowing it to be installed on hardened Kubernetes clusters. + +To expose Kubernetes metrics, PushProxes use a client proxy model to expose specific ports within default Kubernetes components. Node exporters expose metrics to PushProx through an outbound connection. + +The proxy allows `rancher-monitoring` to scrape metrics from processes on the hostNetwork, such as the `kube-api-server`, without opening up node ports to inbound connections. + +PushProx is a DaemonSet that listens for clients that seek to register. Once registered, it proxies scrape requests through the established connection. Then the client executes the request to etcd. + +All of the default ServiceMonitors, such as `rancher-monitoring-kube-controller-manager`, are configured to hit the metrics endpoint of the client using this proxy. + +For more details about how PushProx works, refer to [Scraping Metrics with PushProx.](#5-5-scraping-metrics-with-pushprox) + + +### 4.3. Default Exporters + +`rancher-monitoring` deploys two exporters to expose metrics to prometheus: `node-exporter` and `windows-exporter`. Both are deployed as DaemonSets. + +`node-exporter` exports container, pod and node metrics for CPU and memory from each Linux node. `windows-exporter` does the same, but for Windows nodes. + +For more information on `node-exporter`, refer to the [upstream documentation.](https://prometheus.io/docs/guides/node-exporter/) + +[kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) is also useful because it exports metrics for Kubernetes components. + +# 4.4. Components Exposed in the Rancher UI + +When the monitoring application is installed, you will be able to edit the following components in the Rancher UI: + +| Component | Type of Component | Purpose and Common Use Cases for Editing | +|--------------|------------------------|---------------------------| +| ServiceMonitor | Custom resource | Set up targets to scrape custom metrics from. Automatically updates the scrape configuration in the Prometheus custom resource. | +| PodMonitor | Custom resource | Set up targets to scrape custom metrics from. Automatically updates the scrape configuration in the Prometheus custom resource. | +| Receiver | Configuration block (part of Alertmanager) | Set up a notification system to receive alerts. Automatically updates the Alertmanager custom resource. | +| Route | Configuration block (part of Alertmanager) | Add identifying information to make alerts more meaningful and direct them to individual teams. Automatically updates the Alertmanager custom resource. | +| PrometheusRule | Custom resource | For more advanced use cases, you may want to define what Prometheus metrics or time series database queries should result in alerts being fired. Automatically updates the Prometheus custom resource. | +| Alertmanager | Custom resource | Edit this custom resource only if you need more advanced configuration options beyond what the Rancher UI exposes in the Routes and Receivers sections. For example, you might want to edit this resource to add a routing tree with more than two levels. | +| Prometheus | Custom resource | Edit this custom resource only if you need more advanced configuration beyond what can be configured using ServiceMonitors, PodMonitors, or [Rancher monitoring Helm chart options.](./configuration/helm-chart-options) | + +# 5. Scraping and Exposing Metrics + +### 5.1. Defining what Metrics are Scraped + +ServiceMonitors define targets that are intended for Prometheus to scrape. The [Prometheus custom resource tells](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#prometheus) Prometheus which ServiceMonitors it should use to find out where to scrape metrics from. + +The Prometheus Operator observes the ServiceMonitors. When it observes that ServiceMonitors are created or updated, it calls the Prometheus API to update the scrape configuration in the Prometheus custom resource and keep it in sync with the scrape configuration in the ServiceMonitors. This scrape configuration tells Prometheus which endpoints to scrape metrics from and how it will label the metrics from those endpoints. + +Prometheus scrapes all of the metrics defined in its scrape configuration at every `scrape_interval`, which is one minute by default. + +The scrape configuration can be viewed as part of the Prometheus custom resource that is exposed in the Rancher UI. + +### 5.2. How the Prometheus Operator Sets up Metrics Scraping + +The Prometheus Deployment or StatefulSet scrapes metrics, and the configuration of Prometheus is controlled by the Prometheus custom resources. The Prometheus Operator watches for Prometheus and Alertmanager resources, and when they are created, the Prometheus Operator creates a Deployment or StatefulSet for Prometheus or Alertmanager with the user-defined configuration. + +
How the Prometheus Operator Sets up Metrics Scraping
+ +![How the Prometheus Operator sets up metrics scraping]({{}}/img/rancher/set-up-scraping.svg) + +When the Prometheus Operator observes ServiceMonitors, PodMonitors and PrometheusRules being created, it knows that the scrape configuration needs to be updated in Prometheus. It updates Prometheus by first updating the configuration and rules files in the volumes of Prometheus's Deployment or StatefulSet. Then it calls the Prometheus API to sync the new configuration, resulting in the Prometheus Deployment or StatefulSet to be modified in place. + +![How the Prometheus Operator Updates Scrape Configuration]({{}}/img/rancher/update-scrape-config.svg) + +### 5.3. How Kubernetes Component Metrics are Exposed + +Prometheus scrapes metrics from deployments known as [exporters,](https://prometheus.io/docs/instrumenting/exporters/) which export the time series data in a format that Prometheus can ingest. In Prometheus, time series consist of streams of timestamped values belonging to the same metric and the same set of labeled dimensions. + +To allow monitoring to be installed on hardened Kubernetes clusters, `rancher-monitoring` application proxies the communication between Prometheus and the exporter through PushProx for some Kubernetes master components. + +### 5.4. Scraping Metrics without PushProx + +The Kubernetes components that directly expose metrics to Prometheus are the following: + +- kubelet +- ingress-nginx* +- coreDns/kubeDns +- kube-api-server + +\* For RKE and RKE2 clusters, ingress-nginx is deployed by default and treated as an internal Kubernetes component. + +### 5.5. Scraping Metrics with PushProx + +The purpose of this architecture is to allow us to scrape internal Kubernetes components without exposing those ports to inbound requests. As a result, Prometheus can scrape metrics across a network boundary. + +The Kubernetes components that expose metrics to Prometheus through PushProx are the following: + +- kube-controller-manager +- kube-scheduler +- etcd +- kube-proxy + +For each PushProx exporter, we deploy one PushProx client onto all target nodes. For example, a PushProx client is deployed onto all controlplane nodes for kube-controller-manager, all etcd nodes for kube-etcd, and all nodes for kubelet. We deploy exactly one PushProx proxy per exporter. + +The process for exporting metrics is as follows: + +1. The PushProx Client establishes an outbound connection with the PushProx Proxy. +2. The client then polls the proxy for scrape requests that have come into the proxy. +3. When the proxy receives a scrape request from Prometheus, the client sees it as a result of the poll. +4. The client scrapes the internal component. +5. The internal component responds by pushing metrics back to the proxy. + +
Process for Exporting Metrics with PushProx
+ +![Process for Exporting Metrics with PushProx]({{}}/img/rancher/pushprox-process.svg) + +Metrics are scraped differently based on the Kubernetes distribution. For help with terminology, see Terminology(#terminology). For details, see the table below: + +
How Metrics are Exposed to Prometheus
+ +| Kubernetes Component | RKE | RKE2 | KubeADM | K3s | +|-----|-----|-----|-----|-----| +| kube-controller-manager | rkeControllerManager.enabled |rke2ControllerManager.enabled | kubeAdmControllerManager.enabled | k3sServer.enabled | +| kube-scheduler | rkeScheduler.enabled | rke2Scheduler.enabled |kubeAdmScheduler.enabled | k3sServer.enabled | +| etcd | rkeEtcd.enabled | rke2Etcd.enabled | kubeAdmEtcd.enabled | Not available | +| kube-proxy | rkeProxy.enabled | rke2Proxy.enabled | kubeAdmProxy.enabled | k3sServer.enabled | +| kubelet | Collects metrics directly exposed by kubelet | Collects metrics directly exposed by kubelet | Collects metrics directly exposed by kubelet | Collects metrics directly exposed by kubelet | +| ingress-nginx* | Collects metrics directly exposed by kubelet, exposed by rkeIngressNginx.enabled | Collects metrics directly exposed by kubelet, Exposed by rke2IngressNginx.enabled | Not available | Not available | +| coreDns/kubeDns | Collects metrics directly exposed by coreDns/kubeDns | Collects metrics directly exposed by coreDns/kubeDns | Collects metrics directly exposed by coreDns/kubeDns | Collects metrics directly exposed by coreDns/kubeDns | +| kube-api-server | Collects metrics directly exposed by kube-api-server |Collects metrics directly exposed by kube-api-server | Collects metrics directly exposed by kube-appi-server | Collects metrics directly exposed by kube-api-server | + +\* For RKE and RKE2 clusters, ingress-nginx is deployed by default and treated as an internal Kubernetes component. + +### 5.6. Terminology + +- **kube-scheduler:** The internal Kubernetes component that uses information in the pod spec to decide on which node to run a pod. +- **kube-controller-manager:** The internal Kubernetes component that is responsible for node management (detecting if a node fails), pod replication and endpoint creation. +- **etcd:** The internal Kubernetes component that is the distributed key/value store which Kubernetes uses for persistent storage of all cluster information. +- **kube-proxy:** The internal Kubernetes component that watches the API server for pods/services changes in order to maintain the network up to date. +- **kubelet:** The internal Kubernetes component that watches the API server for pods on a node and makes sure they are running. +- **ingress-nginx:** An Ingress controller for Kubernetes using NGINX as a reverse proxy and load balancer. +- **coreDns/kubeDns:** The internal Kubernetes component responsible for DNS. +- **kube-api-server:** The main internal Kubernetes component that is responsible for exposing APIs for the other master components. \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/rbac/_index.md b/content/rancher/v2.6/en/monitoring-alerting/rbac/_index.md index bab36068771..04ecdfb0d7a 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/rbac/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/rbac/_index.md @@ -1,9 +1,11 @@ --- -title: RBAC -weight: 3 +title: Role-based Access Control +shortTitle: RBAC +weight: 2 aliases: - - /rancher/v2.6/en/cluster-admin/tools/monitoring/rbac - - /rancher/v2.6/en/monitoring-alerting/v2.5/rbac + - /rancher/v2.5/en/cluster-admin/tools/monitoring/rbac + - /rancher/v2.5/en/monitoring-alerting/v2.5/rbac + - /rancher/v2.5/en/monitoring-alerting/grafana --- This section describes the expectations for RBAC for Rancher Monitoring. @@ -17,6 +19,7 @@ This section describes the expectations for RBAC for Rancher Monitoring. - [Users with Rancher Cluster Manager Based Permissions](#users-with-rancher-cluster-manager-based-permissions) - [Differences in 2.5.x](#differences-in-2-5-x) - [Assigning Additional Access](#assigning-additional-access) +- [Role-based Access Control for Grafana](#role-based-access-control-for-grafana) # Cluster Admins @@ -131,3 +134,19 @@ If cluster-admins would like to provide additional admin/edit access to users ou |----------------------------| ------| ------| ----------------------------| |
  • `secrets`
  • `configmaps`
| `cattle-monitoring-system` | Yes, Configs and Secrets in this namespace can impact the entire monitoring / alerting pipeline. | User will be able to create or edit Secrets / ConfigMaps such as the Alertmanager Config, Prometheus Adapter Config, TLS secrets, additional Grafana datasources, etc. This can have broad impact on all cluster monitoring / alerting. | |
  • `secrets`
  • `configmaps`
| `cattle-dashboards` | Yes, Configs and Secrets in this namespace can create dashboards that make queries on all metrics collected at a cluster-level. | User will be able to create Secrets / ConfigMaps that persist new Grafana Dashboards only. | + + + +# Role-based Access Control for Grafana + +Rancher allows any users who are authenticated by Kubernetes and have access the Grafana service deployed by the Rancher Monitoring chart to access Grafana via the Rancher Dashboard UI. By default, all users who are able to access Grafana are given the [Viewer](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#viewer-role) role, which allows them to view any of the default dashboards deployed by Rancher. + +However, users can choose to log in to Grafana as an [Admin](https://grafana.com/docs/grafana/latest/permissions/organization_roles/#admin-role) if necessary. The default Admin username and password for the Grafana instance will be `admin`/`prom-operator`, but alternative credentials can also be supplied on deploying or upgrading the chart. + +To see the Grafana UI, install `rancher-monitoring`. Then go to the **Cluster Explorer.** In the top left corner, click **Cluster Explorer > Monitoring.** Then click **Grafana. + +
Cluster Compute Resources Dashboard in Grafana
+![Cluster Compute Resources Dashboard in Grafana]({{}}/img/rancher/cluster-compute-resources-dashboard.png) + +
Default Dashboards in Grafana
+![Default Dashboards in Grafana]({{}}/img/rancher/grafana-default-dashboard.png) \ No newline at end of file diff --git a/content/rancher/v2.6/en/monitoring-alerting/windows-clusters/_index.md b/content/rancher/v2.6/en/monitoring-alerting/windows-clusters/_index.md index 3e0928f28a6..3a2b0e3deda 100644 --- a/content/rancher/v2.6/en/monitoring-alerting/windows-clusters/_index.md +++ b/content/rancher/v2.6/en/monitoring-alerting/windows-clusters/_index.md @@ -1,10 +1,12 @@ --- title: Windows Cluster Support for Monitoring V2 -shortTitle: Windows Clusters +shortTitle: Windows Support weight: 5 --- -Monitoring V2 can be deployed on a Windows cluster and will scrape metrics from Windows nodes using [prometheus-community/windows_exporter](https://github.com/prometheus-community/windows_exporter) (previously named `wmi_exporter`). +_Available as of v2.5.8_ + +Starting at Monitoring V2 14.5.100 (used by default in Rancher 2.5.8), Monitoring V2 can now be deployed on a Windows cluster and will scrape metrics from Windows nodes using [prometheus-community/windows_exporter](https://github.com/prometheus-community/windows_exporter) (previously named `wmi_exporter`). - [Comparison to Monitoring V1](#comparison-to-monitoring-v1) - [Cluster Requirements](#cluster-requirements) @@ -20,7 +22,7 @@ In addition, Monitoring V2 for Windows will no longer require users to keep port Monitoring V2 for Windows can only scrape metrics from Windows hosts that have a minimum `wins` version of v0.1.0. To be able to fully deploy Monitoring V2 for Windows, all of your hosts must meet this requirement. -If you provision a fresh RKE1 cluster in Rancher, your cluster should already meet this requirement. +If you provision a fresh RKE1 cluster in Rancher 2.5.8, your cluster should already meet this requirement. ### Upgrading Existing Clusters to wins v0.1.0