Merge pull request #2809 from rancher/staging

Merge staging into master
2026-05-13 00:23:23 +00:00 · 2020-10-29 11:08:23 -07:00
parent c8440bc168 7be2614dfb
commit 280bccd0db
19 changed files with 729 additions and 20 deletions
@@ -3,20 +3,8 @@ title: Best Practices Guide
 weight: 4
 ---

-> The Best Practices Guide will be updated for Rancher v2.5.
+The purpose of this section is to consolidate best practices for Rancher implementations.

-The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
+If you are using Rancher v2.0-v2.4, refer to the Best Practices Guide [here.](./v2.0-v2.4)

-If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
-
-Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
-
-For more guidance on best practices, you can consult these resources:
-
- [Security]({{<baseurl>}}/rancher/v2.x/en/security/)
- [Rancher Blog](https://rancher.com/blog/)
-    - [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/)
-    - [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
- [Rancher Forum](https://forums.rancher.com/)
- [Rancher Users Slack](https://slack.rancher.io/)
- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
+If you are using Rancher v2.5, refer to the Best Practices Guide [here.](./v2.5)
@@ -0,0 +1,21 @@
+---
+title: Best Practices Guide for Rancher v2.0-v2.4
+shortTitle: v2.0-v2.4
+weight: 2
+---
+
+The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
+
+If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
+
+Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
+
+For more guidance on best practices, you can consult these resources:
+
+- [Security]({{<baseurl>}}/rancher/v2.x/en/security/)
+- [Rancher Blog](https://rancher.com/blog/)
+    - [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/)
+    - [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
+- [Rancher Forum](https://forums.rancher.com/)
+- [Rancher Users Slack](https://slack.rancher.io/)
+- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
@@ -1,6 +1,8 @@
 ---
 title: Tips for Setting Up Containers
 weight: 100
+aliases:
+  - /rancher/v2.x/en/best-practices/containers
 ---

 Running well built containers can greatly impact the overall performance and security of your environment.
@@ -1,6 +1,8 @@
 ---
 title: Rancher Deployment Strategies
 weight: 100
+aliases:
+  - /rancher/v2.x/en/best-practices/deployment-strategies
 ---

 There are two recommended deployment strategies. Each one has its own pros and cons. Read more about which one would fit best for your use case:
@@ -1,6 +1,8 @@
 ---
 title: Tips for Running Rancher
 weight: 100
+aliases:
+  - /rancher/v2.x/en/best-practices/deployment-types
 ---

 A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment.
@@ -1,6 +1,8 @@
 ---
 title: Tips for Scaling, Security and Reliability
 weight: 101
+aliases:
+  - /v2.x/en/best-practices/management
 ---

 Rancher allows you to set up numerous combinations of configurations. Some configurations are more appropriate for development and testing, while there are other best practices for production environments for maximum availability and fault tolerance. The following best practices should be followed for production.
@@ -0,0 +1,21 @@
+---
+title: Best Practices Guide for Rancher v2.5
+shortTitle: v2.5
+weight: 1
+---
+
+The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
+
+If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
+
+Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
+
+For more guidance on best practices, you can consult these resources:
+
+- [Security]({{<baseurl>}}/rancher/v2.x/en/security/)
+- [Rancher Blog](https://rancher.com/blog/)
+    - [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/)
+    - [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
+- [Rancher Forum](https://forums.rancher.com/)
+- [Rancher Users Slack](https://slack.rancher.io/)
+- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
@@ -0,0 +1,21 @@
+---
+title: Best Practices for Rancher Managed Clusters
+shortTitle: Rancher Managed Clusters
+weight: 2
+---
+
+### Logging
+
+Refer to [this guide](./logging) for our recommendations for cluster-level logging and application logging.
+
+### Monitoring
+
+Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. Refer to this [guide](./monitoring) for our recommendations.
+
+### Tips for Setting Up Containers
+
+Running well built containers can greatly impact the overall performance and security of your environment. Refer to this [guide](./containers) for tips.
+
+### Best Practices for Rancher Managed vSphere Clusters
+
+This [guide](./managed-vsphere) outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
@@ -0,0 +1,51 @@
+---
+title: Tips for Setting Up Containers
+weight: 100
+aliases:
+  - /rancher/v2.x/en/best-practices/containers
+---
+
+Running well built containers can greatly impact the overall performance and security of your environment.
+
+Below are a few tips for setting up your containers.
+
+For a more detailed discussion of security for containers, you can also refer to Rancher's [Guide to Container Security.](https://rancher.com/complete-guide-container-security)
+
+### Use a Common Container OS
+
+When possible, you should try to standardize on a common container base OS. 
+
+Smaller distributions such as Alpine and BusyBox reduce container image size and generally have a smaller attack/vulnerability surface.
+
+Popular distributions such as Ubuntu, Fedora, and CentOS are more field-tested and offer more functionality.
+
+### Start with a FROM scratch container
+If your microservice is a standalone static binary, you should use a FROM scratch container. 
+
+The FROM scratch container is an [official Docker image](https://hub.docker.com/_/scratch) that is empty so that you can use it to design minimal images.
+
+This will have the smallest attack surface and smallest image size.
+
+### Run Container Processes as Unprivileged
+When possible, use a non-privileged user when running processes within your container. While container runtimes provide isolation, vulnerabilities and attacks are still possible. Inadvertent or accidental host mounts can also be impacted if the container is running as root. For details on configuring a security context for a pod or container, refer to the [Kubernetes docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/).
+
+### Define Resource Limits
+Apply CPU and memory limits to your pods. This can help manage the resources on your worker nodes and avoid a malfunctioning microservice from impacting other microservices.
+
+In standard Kubernetes, you can set resource limits on the namespace level. In Rancher, you can set resource limits on the project level and they will propagate to all the namespaces within the project. For details, refer to the Rancher docs.
+
+When setting resource quotas, if you set anything related to CPU or Memory (i.e. limits or reservations) on a project or namespace, all containers will require a respective CPU or Memory field set during creation. To avoid setting these limits on each and every container during workload creation, a default container resource limit can be specified on the namespace.
+
+The Kubernetes docs have more information on how resource limits can be set at the [container level](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) and the namespace level.
+
+### Define Resource Requirements
+You should apply CPU and memory requirements to your pods. This is crucial for informing the scheduler which type of compute node your pod needs to be placed on, and ensuring it does not over-provision that node. In Kubernetes, you can set a resource requirement by defining `resources.requests` in the resource requests field in a pod's container spec. For details, refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container).
+
+> **Note:** If you set a resource limit for the namespace that the pod is deployed in, and the container doesn't have a specific resource request, the pod will not be allowed to start. To avoid setting these fields on each and every container during workload creation, a default container resource limit can be specified on the namespace.
+
+It is recommended to define resource requirements on the container level because otherwise, the scheduler makes assumptions that will likely not be helpful to your application when the cluster experiences load.
+
+### Liveness and Readiness Probes
+Set up liveness and readiness probes for your container. Unless your container completely crashes, Kubernetes will not know it's unhealthy unless you create an endpoint or mechanism that can report container status. Alternatively, make sure your container halts and crashes if unhealthy.
+
+The Kubernetes docs show how to [configure liveness and readiness probes for containers.](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/)
@@ -0,0 +1,91 @@
+---
+title: Logging Best Practices
+weight: 1
+---
+In this guide, we recommend best practices for cluster-level logging and application logging.
+
+- [Changes in Logging in Rancher v2.5](#changes-in-logging-in-rancher-v2-5)
+- [Cluster-level Logging](#cluster-level-logging)
+- [Application Logging](#application-logging)
+- [General Best Practices](#general-best-practices)
+
+# Changes in Logging in Rancher v2.5
+
+Prior to Rancher v2.5, logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level).
+
+Logging in 2.5 has been completely overhauled to provide a more flexible experience for log aggregation. With the new logging feature, administrators and users alike can deploy logging that meets fine-grained collection criteria while offering a wider array of destinations and configuration options. 
+
+"Under the hood", Rancher logging uses the Banzai Cloud logging operator. We provide manageability of this operator (and its resources), and tie that experience in with managing your Rancher clusters. 
+
+# Cluster-level Logging
+
+### Cluster-wide Scraping
+
+For some users, it is desirable to scrape logs from every container running in the cluster. This usually coincides with your security team's request (or requirement) to collect all logs from all points of execution.
+
+In this scenario, it is recommended to create at least two _ClusterOutput_ objects - one for your security team (if you have that requirement), and one for yourselves, the cluster administrators. When creating these objects take care to choose an output endpoint that can handle the significant log traffic coming from the entire cluster. Also make sure to choose an appropriate index to receive all these logs.
+
+Once you have created these _ClusterOutput_ objects, create a _ClusterFlow_ to collect all the logs. Do not define any _Include_ or _Exclude_ rules on this flow. This will ensure that all logs from across the cluster are collected. If you have two _ClusterOutputs_, make sure to send logs to both of them. 
+
+### Kubernetes Components
+
+_ClusterFlows_ have the ability to collect logs from all containers on all hosts in the Kubernetes cluster. This works well in cases where those containers are part of a Kubernetes pod; however, RKE containers exist outside of the scope of Kubernetes.
+
+Currently (as of v2.5.1) the logs from RKE containers are collected, but are not able to easily be filtered. This is because those logs do not contain information as to the source container (e.g. `etcd` or `kube-apiserver`). 
+
+A future release of Rancher will include the source container name which will enable filtering of these component logs. Once that change is made, you will be able to customize a _ClusterFlow_ to retrieve **only** the Kubernetes component logs, and direct them to an appropriate output.
+
+# Application Logging
+
+Best practice not only in Kubernetes but in all container-based applications is to direct application logs to `stdout`/`stderr`. The container runtime will then trap these logs and do **something** with them - typically writing them to a file. Depending on the container runtime (and its configuration), these logs can end up in any number of locations.
+
+In the case of writing the logs to a file, Kubernetes helps by creating a `/var/log/containers` directory on each host. This directory symlinks the log files to their actual destination (which can differ based on configuration or container runtime). 
+
+Rancher logging will read all log entries in `/var/log/containers`, ensuring that all log entries from all containers (assuming a default configuration) will have the opportunity to be collected and processed. 
+
+### Specific Log Files
+
+Log collection only retrieves `stdout`/`stderr` logs from pods in Kubernetes. But what if we want to collect logs from other files that are generated by applications? Here, a log streaming sidecar (or two) may come in handy.
+
+The goal of setting up a streaming sidecar is to take log files that are written to disk, and have their contents streamed to `stdout`. This way, the Banzai Logging Operator can pick up those logs and send them to your desired output. 
+
+To set this up, edit your workload resource (e.g. Deployment) and add the following sidecar definition:
+
+```
+...
+containers:
+- args:
+  - -F
+  - /path/to/your/log/file.log
+  command:
+  - tail
+  image: busybox
+  name: stream-log-file-[name]
+  volumeMounts:
+  - mountPath: /path/to/your/log
+    name: mounted-log
+...
+```
+
+This will add a container to your workload definition that will now stream the contents of (in this example) `/path/to/your/log/file.log` to `stdout`.
+
+This log stream is then automatically collected according to any _Flows_ or _ClusterFlows_ you have setup. You may also wish to consider creating a _Flow_ specifically for this log file by targeting the name of the container. See example:
+
+```
+...
+spec:
+  match:
+  - select:
+      container_names:
+      - stream-log-file-name
+...
+```
+
+
+# General Best Practices
+
+- Where possible, output structured log entries (e.g. `syslog`, JSON). This makes handling of the log entry easier as there are already parsers written for these formats. 
+- Try to provide the name of the application that is creating the log entry, in the entry itself. This can make troubleshooting easier as Kubernetes objects do not always carry the name of the application as the object name. For instance, a pod ID may be something like `myapp-098kjhsdf098sdf98` which does not provide much information about the application running inside the container. 
+- Except in the case of collecting all logs cluster-wide, try to scope your _Flow_ and _ClusterFlow_ objects tightly. This makes it easier to troubleshoot when problems arise, and also helps ensure unrelated log entries do not show up in your aggregator. An example of tight scoping would be to constrain a _Flow_ to a single _Deployment_ in a namespace, or perhaps even a single container within a _Pod_.
+- Keep the log verbosity down except when troubleshooting. High log verbosity poses a number of issues, chief among them being **noise**: significant events can be drowned out in a sea of `DEBUG` messages. This is somewhat mitigated with automated alerting and scripting, but highly verbose logging still places an inordinate amount of stress on the logging infrastructure. 
+- Where possible, try to provide a transaction or request ID with the log entry. This can make tracing application activity across multiple log sources easier, especially when dealing with distributed applications. 
@@ -0,0 +1,59 @@
+---
+title: Best Practices for Rancher Managed vSphere Clusters
+shortTitle: Rancher Managed Clusters in vSphere
+---
+
+This guide outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
+
+- [1. VM Considerations](#1-vm-considerations)
+- [2. Network Considerations](#2-network-considerations)
+- [3. Storage Considerations](#3-storage-considerations)
+- [4. Backups and Disaster Recovery](#4-backups-and-disaster-recovery)
+
+<figcaption>Solution Overview</figcaption>
+
+![Solution Overview](/img/rancher/solution_overview.drawio.svg)
+
+# 1. VM Considerations
+
+### Leverage VM Templates to Construct the Environment
+
+To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts
+
+Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores
+
+Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
+
+### Configure VM's as Appropriate for Kubernetes
+
+It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
+
+# 2. Network Considerations 
+
+### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
+
+Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
+
+### Consistent IP Addressing for VM's
+
+Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
+
+# 3. Storage Considerations
+
+### Leverage SSD Drives for ETCD Nodes
+
+ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible. 
+
+# 4. Backups and Disaster Recovery
+
+### Perform Regular Downstream Cluster Backups
+
+Kubernetes uses etcd to store all its data - from configuration, state and metadata. Backing this up is crucial in the event of disaster recovery.
+
+### Back up Downstream Node VMs
+
+Incorporate the Rancher downstream node VM's within a standard VM backup policy.
@@ -0,0 +1,120 @@
+---
+title: Monitoring Best Practices
+weight: 2
+---
+
+Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. This is not different when using Kubernetes and Rancher. Fortunately the integrated monitoring and alerting functionality makes this whole process a lot easier.
+
+The [Rancher Documentation]({{<baseurl>}}/rancher/v2.x/en/monitoring-alerting/v2.5/) describes in detail, how you can set up a complete Prometheus and Grafana stack. Out of the box this will scrape monitoring data from all system and Kubernetes components in your cluster and provide sensible dashboards and alerts for them to get started. But for a reliable setup, you also need to monitor your own workloads and adapt Prometheus and Grafana to your own specific use cases and cluster sizes. This document aims to give you best practices for this.
+
+- [What to Monitor](#what-to-monitor)
+- [Configuring Prometheus Resource Usage](#configuring-prometheus-resource-usage)
+- [Scraping Custom Workloads](#scraping-custom-workloads)
+- [Monitoring in a (Micro)Service Architecture](#monitoring-in-a-micro-service-architecture)
+- [Real User Monitoring](#real-user-monitoring)
+- [Security Monitoring](#security-monitoring)
+- [Setting up Alerts](#setting-up-alerts)
+
+# What to Monitor
+
+Kubernetes itself, as well as applications running inside of it, form a distributed system where different components interact with each other. For the whole system and each individual component, you have to ensure performance, availability, reliability and scalability. A good resource with more details and information is Google's free [Site Reliability Engineering Book](https://landing.google.com/sre/sre-book/), especially the chapter about [Monitoring distributed systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/).
+
+# Configuring Prometheus Resource Usage
+
+When installing the integrated monitoring stack, Rancher allows to configure several settings that are dependent on the size of your cluster and the workloads running in it. This chapter covers these in more detail.
+
+### Storage and Data Retention
+
+The amount of storage needed for Prometheus directly correlates to the amount of time series and labels that you store and the data retention you have configured. It is important to note that Prometheus is not meant to be used as a long-term metrics storage. Data retention time is usually only a couple of days and not weeks or months. The reason for this is, that Prometheus does not perform any aggregation on its stored metrics. This is great because aggregation can dilute data, but it also means that the needed storage grows linearly over time without retention. 
+
+One way to calculate the necessary storage is to look at the average size of a storage chunk in Prometheus with this query
+
+```
+rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h])
+```
+
+Next, find out your data ingestion rate per second:
+
+```
+rate(prometheus_tsdb_head_samples_appended_total[1h])
+```
+
+and then multiply this with the retention time, adding a few percentage points as buffer:
+
+```
+average chunk size in bytes * ingestion rate per second * retention time in seconds * 1.1 = necessary storage in bytes
+```
+
+You can find more information about how to calculate the necessary storage in this [blog post](https://www.robustperception.io/how-much-disk-space-do-prometheus-blocks-use).
+
+You can read more about the Prometheus storage concept in the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/storage).
+
+### CPU and Memory Requests and Limits
+
+In larger Kubernetes clusters Prometheus can consume quite a bit of memory. The amount of memory Prometheus needs directly correlates to the amount of time series and amount of labels it stores and the scrape interval in which these are filled.
+
+You can find more information about how to calculate the necessary memory in this [blog post](https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion).
+
+The amount of necessary CPUs correlate with the amount of queries you are performing.
+
+### Federation and Long-term Storage
+
+Prometheus is not meant to store metrics for a long amount of time, but should only be used for short term storage.
+
+In order to store some, or all metrics for a long time, you can leverage Prometheus' [remote read/write](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations) capabilities to connect it to storage systems like [Thanos](https://thanos.io/), [InfluxDB](https://www.influxdata.com/), [M3DB](https://www.m3db.io/), or others. You can find an example setup in this [blog post](https://rancher.com/blog/2020/prometheus-metric-federation).
+
+# Scraping Custom Workloads
+
+While the integrated Rancher Monitoring already scrapes system metrics from a cluster's nodes and system components, the custom workloads that you deploy on Kubernetes should also be scraped for data. For that you can configure Prometheus to do an HTTP request to an endpoint of your applications in a certain interval. These endpoints should then return their metrics in a Prometheus format.
+
+In general, you want to scrape data from all the workloads running in your cluster so that you can use them for alerts or debugging issues. Oftentimes, you recognize, that you need some data only when you actually need the metrics during an incident. It is good, if it is already scraped and stored. Since Prometheus is only meant to be a short-term metrics storage, scraping and keeping lots of data is usually not that expensive. If you are using a long-term storage solution with Prometheus, you can then still decide which data you are actually persisting and keeping there.
+
+### About Prometheus Exporters
+
+A lot of 3rd party workloads like databases, queues or web-servers either already support exposing metrics in a Prometheus format, or there are so called exporters available that translate between the tool's metrics and the format that Prometheus understands. Usually you can add these exporters as additional sidecar containers to the workload's Pods. A lot of helm charts already include options to deploy the correct exporter. Additionally you can find a curated list of exports by SysDig on [promcat.io](https://promcat.io/) and on [ExporterHub](https://exporterhub.io/).
+
+### Prometheus support in Programming Languages and Frameworks
+
+To get your own custom application metrics into Prometheus, you have to collect and expose these metrics directly from your applications code. Fortunately, there are already libraries and integrations available to help with this for most popular programming languages and frameworks. One example for this is the Prometheus support in the [Spring Framework](https://docs.spring.io/spring-metrics/docs/current/public/prometheus).
+
+### ServiceMonitors and PodMonitors
+
+Once all your workloads expose metrics in a Prometheus format, you have to configure Prometheus to scrape it. Under the hood Rancher is using the [prometheus-operator](https://github.com/prometheus-operator/prometheus-operator). This makes it easy to add additional scraping targets with ServiceMonitors and PodMonitors. A lot of helm charts already include an option to create these monitors directly. You can also find more information in the [Rancher Documentation](TODO).
+
+### Prometheus Push Gateway
+
+There are some workloads that are traditionally hard to scrape by Prometheus. Examples for these are short lived workloads like Jobs and CronJobs, or applications that do not allow sharing data between individual handled incoming requests, like PHP applications.
+
+To still get metrics for these use cases, you can set up [prometheus-pushgateways](https://github.com/prometheus/pushgateway). The CronJob or PHP application would push metric updates to the pushgateway. The pushgateway aggregates and exposes them through an HTTP endpoint, which then can be scraped by Prometheus.
+
+### Prometheus Blackbox Monitor
+
+Sometimes it is useful to monitor workloads from the outside. For this, you can use the [Prometheus blackbox-exporter](https://github.com/prometheus/blackbox_exporter) which allows probing any kind of endpoint over HTTP, HTTPS, DNS, TCP and ICMP.
+
+# Monitoring in a (Micro)Service Architecture
+
+If you have a (micro)service architecture where multiple individual workloads within your cluster are communicating with each other, it is really important to have detailed metrics and traces about this traffic to understand how all these workloads are communicating with each other and where a problem or bottleneck may be.
+
+Of course you can monitor all this internal traffic in all your workloads and expose these metrics to Prometheus. But this can quickly become quite work intensive. Service Meshes like Istio, which can be installed with [a click](https://rancher.com/docs/rancher/v2.x/en/cluster-admin/tools/istio/) in Rancher, can do this automatically and provide rich telemetry about the traffic between all services.
+
+# Real User Monitoring
+
+Monitoring the availability and performance of all your internal workloads is vitally important to run stable, reliable and fast applications. But these metrics only show you parts of the picture. To get a complete view it is also necessary to know how your end users are actually perceiving it. For this you can look into various [Real user monitoring solutions](https://en.wikipedia.org/wiki/Real_user_monitoring).
+
+# Security Monitoring
+
+In addition to monitoring workloads to detect performance, availability or scalability problems, the cluster and the workloads running into it should also be monitored for potential security problems. A good starting point is to frequently run and alert on [CIS Scans]({{<baseurl>}}/rancher/v2.x/en/cis-scans/v2.5/) which check if the cluster is configured according to security best practices.
+
+For the workloads, you can have a look at Kubernetes and Container security solutions like [Falko](https://falco.org/), [Aqua Kubernetes Security](https://www.aquasec.com/solutions/kubernetes-container-security/), [SysDig](https://sysdig.com/).
+
+# Setting up Alerts
+
+Getting all the metrics into a monitoring systems and visualizing them in dashboards is great, but you also want to be pro-actively alerted if something goes wrong.
+
+The integrated Rancher monitoring already configures a sensible set of alerts that make sense in any Kubernetes cluster. You should extend these to cover your specific workloads and use cases.
+
+When setting up alerts, configure them for all the workloads that are critical to the availability of your applications. But also make sure that they are not too noisy. Ideally every alert you are receiving should be because of a problem that needs your attention and needs to be fixed. If you have alerts that are firing all the time but are not that critical, there is a danger that you start ignoring your alerts all together and then miss the real important ones. Less may be more here. Start to focus on the real important metrics first, for example alert if your application is offline. Fix all the problems that start to pop up and then start to create more detailed alerts.
+
+If an alert starts firing, but there is nothing you can do about it at the moment, it's also fine to silence the alert for a certain amount of time, so that you can look at it later.
+
+You can find more information on how to set up alerts and notification channels in the [Rancher Documentation]({{<baseurl>}}/rancher/v2.x/en/monitoring-alerting/v2.5).
@@ -0,0 +1,19 @@
+---
+title: Best Practices for the Rancher Server
+shortTitle: Rancher Server
+weight: 1
+---
+
+This guide contains our recommendations for running the Rancher server, and is intended to be used in situations in which Rancher manages downstream Kubernetes clusters.
+
+### Recommended Architecture and Infrastructure
+
+Refer to this [guide](./deployment-types) for our general advice for setting up the Rancher server on a high-availability Kubernetes cluster.
+
+### Deployment Strategies
+
+This [guide](./deployment-strategies) is designed to help you choose whether a regional deployment strategy or a hub-and-spoke deployment strategy is better for a Rancher server that manages downstream Kubernetes clusters.
+
+### Installing Rancher in a vSphere Environment
+
+This [guide](./rancher-in-vsphere) outlines a reference architecture for installing Rancher in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
@@ -0,0 +1,45 @@
+---
+title: Rancher Deployment Strategy
+weight: 100
+---
+
+There are two recommended deployment strategies for a Rancher server that manages downstream Kubernetes clusters. Each one has its own pros and cons. Read more about which one would fit best for your use case:
+
+* [Hub and Spoke](#hub-and-spoke)
+* [Regional](#regional)
+
+# Hub & Spoke Strategy
+---
+
+In this deployment scenario, there is a single Rancher control plane managing Kubernetes clusters across the globe. The control plane would be run on a high-availability Kubernetes cluster, and there would be impact due to latencies.
+
+{{< img "/img/rancher/bpg/hub-and-spoke.png" "Hub and Spoke Deployment">}}
+
+### Pros
+
+* Environments could have nodes and network connectivity across regions.
+* Single control plane interface to view/see all regions and environments.
+* Kubernetes does not require Rancher to operate and can tolerate losing connectivity to the Rancher control plane.
+
+### Cons
+
+* Subject to network latencies.
+* If the control plane goes out, global provisioning of new services is unavailable until it is restored. However, each Kubernetes cluster can continue to be managed individually.
+
+# Regional Strategy
+---
+In the regional deployment model a control plane is deployed in close proximity to the compute nodes.
+
+{{< img "/img/rancher/bpg/regional.png" "Regional Deployment">}}
+
+### Pros
+
+* Rancher functionality in regions stay operational if a control plane in another region goes down.
+* Network latency is greatly reduced, improving the performance of functionality in Rancher.
+* Upgrades of the Rancher control plane can be done independently per region.
+
+### Cons
+
+* Overhead of managing multiple Rancher installations.
+* Visibility across global Kubernetes clusters requires multiple interfaces/panes of glass.
+* Deploying multi-cluster apps in Rancher requires repeating the process for each Rancher server.
@@ -0,0 +1,39 @@
+---
+title: Tips for Running Rancher
+weight: 100
+aliases:
+  - /rancher/v2.x/en/best-practices/deployment-types
+---
+
+This guide is geared toward use cases where Rancher is used to manage downstream Kubernetes clusters. The high-availability setup is intended to prevent losing access to downstream clusters if the Rancher server is not available.
+
+A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment.
+
+If you are installing Rancher in a vSphere environment, refer to the best practices documented [here.](../rancher-in-vsphere)
+
+When you set up your high-availability Rancher installation, consider the following:
+
+### Run Rancher on a Separate Cluster
+Don't run other workloads or microservices in the Kubernetes cluster that Rancher is installed on.
+
+### Make sure nodes are configured correctly for Kubernetes ###
+It's important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node, checking that all correct ports are opened, and deploying with ssd backed etcd.  More details can be found in the [kubernetes docs](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#before-you-begin) and [etcd's performance op guide](https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/performance.md)
+
+### When using RKE: Back up the Statefile
+RKE keeps record of the cluster state in a file called `cluster.rkestate`. This file is important for the recovery of a cluster and/or the continued maintenance of the cluster through RKE. Because this file contains certificate material, we strongly recommend encrypting this file before backing up. After each run of `rke up` you should backup the state file. 
+
+### Run All Nodes in the Cluster in the Same Datacenter
+For best performance, run all three of your nodes in the same geographic datacenter. If you are running nodes in the cloud, such as AWS, run each node in a separate Availability Zone. For example, launch node 1 in us-west-2a, node 2 in us-west-2b, and node 3 in us-west-2c.
+
+### Development and Production Environments Should be Similar
+It's strongly recommended to have a "staging" or "pre-production" environment of the Kubernetes cluster that Rancher runs on. This environment should mirror your production environment as closely as possible in terms of software and hardware configuration.
+
+### Monitor Your Clusters to Plan Capacity
+The Rancher server's Kubernetes cluster should run within the [system and hardware requirements]({{<baseurl>}}/rancher/v2.x/en/installation/requirements/) as closely as possible. The more you deviate from the system and hardware requirements, the more risk you take.
+
+However, metrics-driven capacity planning analysis should be the ultimate guidance for scaling Rancher, because the published requirements take into account a variety of workload types.
+
+Using Rancher, you can monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open-source monitoring solution, and Grafana, which lets you visualize the metrics from Prometheus. 
+
+After you [enable monitoring]({{<baseurl>}}/rancher/v2.x/en/monitoring-alerting/legacy/monitoring/cluster-monitoring/) in the cluster, you can set up [a notification channel]({{<baseurl>}}/rancher/v2.x/en/cluster-admin/tools/notifiers/) and [cluster alerts]({{<baseurl>}}/rancher/v2.x/en/cluster-admin/tools/alerts/) to let you know if your cluster is approaching its capacity. You can also use the Prometheus and Grafana monitoring framework to establish a baseline for key metrics as you scale.
+
@@ -0,0 +1,91 @@
+---
+title: Installing Rancher in a vSphere Environment
+shortTitle: On-Premises Rancher in vSphere
+weight: 3
+---
+
+This guide outlines a reference architecture for installing Rancher on an RKE Kubernetes cluster in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
+
+- [1. Load Balancer Considerations](#1-load-balancer-considerations)
+- [2. VM Considerations](#2-vm-considerations)
+- [3. Network Considerations](#3-network-considerations)
+- [4. Storage Considerations](#4-storage-considerations)
+- [5. Backups and Disaster Recovery](#5-backups-and-disaster-recovery)
+
+<figcaption>Solution Overview</figcaption>
+
+![Solution Overview](/img/rancher/rancher-on-prem-vsphere.svg)
+
+# 1. Load Balancer Considerations
+
+A load balancer is required to direct traffic to the Rancher workloads residing on the RKE nodes.
+
+### Leverage Fault Tolerance and High Availability
+
+Leverage the use of an external (hardware or software) load balancer that has inherit high-availability functionality (F5, NSX-T, Keepalived, etc).
+
+### Back Up Load Balancer Configuration
+
+In the event of a Disaster Recovery activity, availability of the Load balancer configuration will expedite the recovery process.
+
+### Configure Health Checks
+
+Configure the Load balancer to automatically mark nodes as unavailable if a health check is failed. For example, NGINX can facilitate this with:
+
+`max_fails=3 fail_timeout=5s` 
+
+### Leverage an External Load Balancer
+
+Avoid implementing a software load balancer within the management cluster.
+
+### Secure Access to Rancher
+
+Configure appropriate Firewall / ACL rules to only expose access to Rancher
+
+# 2. VM Considerations
+
+### Size the VM's According to Rancher Documentation
+
+https://rancher.com/docs/rancher/v2.x/en/installation/requirements/
+
+### Leverage VM Templates to Construct the Environment
+
+To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts
+
+Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores
+
+Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
+
+### Configure VM's as Appropriate for Kubernetes
+
+It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
+
+# 3. Network Considerations 
+
+### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
+
+Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
+
+### Consistent IP Addressing for VM's
+
+Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
+
+# 4. Storage Considerations
+
+### Leverage SSD Drives for ETCD Nodes
+
+ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible. 
+
+# 5. Backups and Disaster Recovery
+
+### Perform Regular Management Cluster Backups
+
+Rancher stores its data in the ETCD datastore of the Kubernetes cluster it resides on. Like with any Kubernetes cluster, perform frequent, tested backups of this cluster.
+
+### Back up Rancher Cluster Node VMs
+
+Incorporate the Rancher management node VM's within a standard VM backup policy.
@@ -10,13 +10,16 @@ This section describes the permissions required to use the rancher-cis-benchmark

 The rancher-cis-benchmark is a cluster-admin only feature by default.

-However, the `rancher-cis-benchmark` chart installs three default `ClusterRoles`:
+However, the `rancher-cis-benchmark` chart installs these two default `ClusterRoles`:
 - cis-admin
- cis-edit
 - cis-view

 In Rancher, only cluster owners and global administrators have `cis-admin` access by default. 

+Note: If you were using the `cis-edit` role added in Rancher v2.5 setup, it has now been removed since
+Rancher v2.5.2 because it essentially is same as `cis-admin`. If you happen to create any clusterrolebindings
+for `cis-edit`, please update them to use `cis-admin` ClusterRole instead.
+
 # Cluster-Admin Access

 Rancher CIS Scans is a cluster-admin only feature by default.
@@ -37,11 +40,12 @@ The rancher-cis-benchmark creates three `ClusterRoles` and adds the CIS Benchmar
 | ClusterRole created by chart | Default K8s ClusterRole  | Permissions given with Role
 | ------------------------------| ---------------------------| ---------------------------|
 | `cis-admin` | `admin`| Ability to CRUD clusterscanbenchmarks, clusterscanprofiles, clusterscans, clusterscanreports CR
-| `cis-edit`| `edit` | Ability to CRUD clusterscanbenchmarks, clusterscanprofiles, clusterscans, clusterscanreports CR
 | `cis-view` | `view `| Ability to List(R) clusterscanbenchmarks, clusterscanprofiles, clusterscans, clusterscanreports CR

+
 By default only cluster-owner role will have ability to manage and use `rancher-cis-benchmark` feature.

-The other Rancher roles (cluster-member, project-owner, project-member) do not have default permissions to manage and use rancher-cis-benchmark resources.
+The other Rancher roles (cluster-member, project-owner, project-member) do not have any default permissions to manage and use rancher-cis-benchmark resources.

-But if a cluster-owner wants to delegate access to other users, they can do so by creating ClusterRoleBindings between these users and the CIS ClusterRoles manually.
+But if a cluster-owner wants to delegate access to other users, they can do so by creating ClusterRoleBindings between these users and the above CIS ClusterRoles manually.
+There is no automatic role aggregation supported for the `rancher-cis-benchmark` ClusterRoles.