Add v2.14 preview docs (#2212)

2026-05-30 08:35:32 +00:00 · 2026-03-05 12:30:57 -08:00
parent 4a0d71b3f3
commit 2dcfa6f6b8
874 changed files with 92618 additions and 0 deletions
@@ -0,0 +1,21 @@
+---
+title: Best Practice Guides
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices"/>
+</head>
+
+The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
+
+If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
+
+Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
+
+For more guidance on best practices, you can consult these resources:
+
+- [Security](../rancher-security/rancher-security.md)
+- [Rancher Blog](https://www.suse.com/c/rancherblog/)
+- [Rancher Forum](https://forums.rancher.com/)
+- [Rancher Users Slack](https://slack.rancher.io/)
+- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
@@ -0,0 +1,19 @@
+---
+title: Best Practices for Disconnected Clusters
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/disconnected-clusters"/>
+</head>
+
+Rancher supports managing clusters that may not always be online due to network disruptions, control plane availability, or because all cluster nodes are down. At the moment there are no known issues with disconnected clusters in the latest released Rancher version.
+
+While a managed cluster is disconnected from Rancher, management operations will be unavailable, and the Rancher UI will not allow navigation to the cluster. However, once the connection is reestablished, functionality is fully restored.
+
+### Best Practices for Managing Disconnected Clusters
+
+- **Cluster Availability During Rancher Upgrades**: It is recommended to have all, or at least most, managed clusters online during a Rancher upgrade. The reason is that upgrading Rancher automatically upgrades the Rancher agent software running on managed clusters. Keeping the agent and Rancher versions aligned ensures consistent functionality. Any clusters that are disconnected during the upgrade will have their agents updated as soon as they reconnect.
+
+- **Cleaning Up Disconnected Clusters**: Regularly remove clusters that will no longer reconnect to Rancher (e.g., clusters that have been decommissioned or destroyed). Keeping such clusters in the Rancher management system consumes unnecessary resources, which could impact Rancher's performance over time.
+
+- **Certificate Rotation Considerations**: When designing processes that involve regularly shutting down clusters, whether connected to Rancher or not, take into account certificate rotation policies. For example, RKE2/K3s clusters may rotate certificates on startup if they exceeded their lifetime.
@@ -0,0 +1,90 @@
+---
+title: Logging Best Practices
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/logging-best-practices"/>
+</head>
+
+In this guide, we recommend best practices for cluster-level logging and application logging.
+
+- [Cluster-level Logging](#cluster-level-logging)
+- [Application Logging](#application-logging)
+- [General Best Practices](#general-best-practices)
+
+Before Rancher v2.5, logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level).
+
+Rancher provides a flexible experience for log aggregation. With the logging feature, administrators and users alike can deploy logging that meets fine-grained collection criteria while offering a wider array of destinations and configuration options.
+
+"Under the hood", Rancher logging uses the [Logging operator](https://github.com/kube-logging/logging-operator). We provide manageability of this operator (and its resources), and tie that experience in with managing your Rancher clusters.
+
+## Cluster-level Logging
+
+### Cluster-wide Scraping
+
+For some users, it is desirable to scrape logs from every container running in the cluster. This usually coincides with your security team's request (or requirement) to collect all logs from all points of execution.
+
+In this scenario, it is recommended to create at least two _ClusterOutput_ objects - one for your security team (if you have that requirement), and one for yourselves, the cluster administrators. When creating these objects take care to choose an output endpoint that can handle the significant log traffic coming from the entire cluster. Also make sure to choose an appropriate index to receive all these logs.
+
+Once you have created these _ClusterOutput_ objects, create a _ClusterFlow_ to collect all the logs. Do not define any _Include_ or _Exclude_ rules on this flow. This will ensure that all logs from across the cluster are collected. If you have two _ClusterOutputs_, make sure to send logs to both of them.
+
+### Kubernetes Components
+
+_ClusterFlows_ have the ability to collect logs from all containers on all hosts in the Kubernetes cluster. This works well in cases where those containers are part of a Kubernetes pod.
+
+A future release of Rancher will include the source container name which will enable filtering of these component logs. Once that change is made, you will be able to customize a _ClusterFlow_ to retrieve **only** the Kubernetes component logs, and direct them to an appropriate output.
+
+## Application Logging
+
+Best practice not only in Kubernetes but in all container-based applications is to direct application logs to `stdout`/`stderr`. The container runtime will then trap these logs and do **something** with them - typically writing them to a file. Depending on the container runtime (and its configuration), these logs can end up in any number of locations.
+
+In the case of writing the logs to a file, Kubernetes helps by creating a `/var/log/containers` directory on each host. This directory symlinks the log files to their actual destination (which can differ based on configuration or container runtime).
+
+Rancher logging will read all log entries in `/var/log/containers`, ensuring that all log entries from all containers (assuming a default configuration) will have the opportunity to be collected and processed.
+
+### Specific Log Files
+
+Log collection only retrieves `stdout`/`stderr` logs from pods in Kubernetes. But what if we want to collect logs from other files that are generated by applications? Here, a log streaming sidecar (or two) may come in handy.
+
+The goal of setting up a streaming sidecar is to take log files that are written to disk, and have their contents streamed to `stdout`. This way, the Logging Operator can pick up those logs and send them to your desired output.
+
+To set this up, edit your workload resource (e.g. Deployment) and add the following sidecar definition:
+
+```yaml
+...
+containers:
+- args:
+  - -F
+  - /path/to/your/log/file.log
+  command:
+  - tail
+  image: busybox
+  name: stream-log-file-[name]
+  volumeMounts:
+  - mountPath: /path/to/your/log
+    name: mounted-log
+...
+```
+
+This will add a container to your workload definition that will now stream the contents of (in this example) `/path/to/your/log/file.log` to `stdout`.
+
+This log stream is then automatically collected according to any _Flows_ or _ClusterFlows_ you have setup. You may also wish to consider creating a _Flow_ specifically for this log file by targeting the name of the container. See example:
+
+```yaml
+...
+spec:
+  match:
+  - select:
+      container_names:
+      - stream-log-file-name
+...
+```
+
+
+## General Best Practices
+
+- Where possible, output structured log entries (e.g. `syslog`, JSON). This makes handling of the log entry easier as there are already parsers written for these formats.
+- Try to provide the name of the application that is creating the log entry, in the entry itself. This can make troubleshooting easier as Kubernetes objects do not always carry the name of the application as the object name. For instance, a pod ID may be something like `myapp-098kjhsdf098sdf98` which does not provide much information about the application running inside the container.
+- Except in the case of collecting all logs cluster-wide, try to scope your _Flow_ and _ClusterFlow_ objects tightly. This makes it easier to troubleshoot when problems arise, and also helps ensure unrelated log entries do not show up in your aggregator. An example of tight scoping would be to constrain a _Flow_ to a single _Deployment_ in a namespace, or perhaps even a single container within a _Pod_.
+- Keep the log verbosity down except when troubleshooting. High log verbosity poses a number of issues, chief among them being **noise**: significant events can be drowned out in a sea of `DEBUG` messages. This is somewhat mitigated with automated alerting and scripting, but highly verbose logging still places an inordinate amount of stress on the logging infrastructure.
+- Where possible, try to provide a transaction or request ID with the log entry. This can make tracing application activity across multiple log sources easier, especially when dealing with distributed applications.
@@ -0,0 +1,115 @@
+---
+title: Monitoring Best Practices
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/monitoring-best-practices"/>
+</head>
+
+Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. This is not different when using Kubernetes and Rancher. Fortunately the integrated monitoring and alerting functionality makes this whole process a lot easier.
+
+The [Rancher monitoring documentation](../../../integrations-in-rancher/monitoring-and-alerting/monitoring-and-alerting.md) describes how you can set up a complete Prometheus and Grafana stack. Out of the box this will scrape monitoring data from all system and Kubernetes components in your cluster and provide sensible dashboards and alerts for them to get started. But for a reliable setup, you also need to monitor your own workloads and adapt Prometheus and Grafana to your own specific use cases and cluster sizes. This document aims to give you best practices for this.
+
+## What to Monitor
+
+Kubernetes itself, as well as applications running inside of it, form a distributed system where different components interact with each other. For the whole system and each individual component, you have to ensure performance, availability, reliability and scalability. A good resource with more details and information is Google's free [Site Reliability Engineering Book](https://sre.google/sre-book/table-of-contents/), especially the chapter about [Monitoring distributed systems](https://sre.google/sre-book/monitoring-distributed-systems/).
+
+## Configuring Prometheus Resource Usage
+
+When installing the integrated monitoring stack, Rancher allows to configure several settings that are dependent on the size of your cluster and the workloads running in it. This chapter covers these in more detail.
+
+### Storage and Data Retention
+
+The amount of storage needed for Prometheus directly correlates to the amount of time series and labels that you store and the data retention you have configured. It is important to note that Prometheus is not meant to be used as a long-term metrics storage. Data retention time is usually only a couple of days and not weeks or months. The reason for this is that Prometheus does not perform any aggregation on its stored metrics. This is great because aggregation can dilute data, but it also means that the needed storage grows linearly over time without retention.
+
+One way to calculate the necessary storage is to look at the average size of a storage chunk in Prometheus with this query
+
+```
+rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h])
+```
+
+Next, find out your data ingestion rate per second:
+
+```
+rate(prometheus_tsdb_head_samples_appended_total[1h])
+```
+
+and then multiply this with the retention time, adding a few percentage points as buffer:
+
+```
+average chunk size in bytes * ingestion rate per second * retention time in seconds * 1.1 = necessary storage in bytes
+```
+
+You can find more information about how to calculate the necessary storage in this [blog post](https://www.robustperception.io/how-much-disk-space-do-prometheus-blocks-use).
+
+You can read more about the Prometheus storage concept in the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/storage).
+
+### CPU and Memory Requests and Limits
+
+In larger Kubernetes clusters Prometheus can consume quite a bit of memory. The amount of memory Prometheus needs directly correlates to the amount of time series and amount of labels it stores and the scrape interval in which these are filled.
+
+You can find more information about how to calculate the necessary memory in this [blog post](https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion).
+
+The amount of necessary CPUs correlate with the amount of queries you are performing.
+
+### Federation and Long-term Storage
+
+Prometheus is not meant to store metrics for a long amount of time, but should only be used for short term storage.
+
+In order to store some, or all metrics for a long time, you can leverage Prometheus' [remote read/write](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations) capabilities to connect it to storage systems like [Thanos](https://thanos.io/), [InfluxDB](https://www.influxdata.com/), [M3DB](https://www.m3db.io/), or others. You can find an example setup in this [blog post](https://rancher.com/blog/2020/prometheus-metric-federation).
+
+## Scraping Custom Workloads
+
+While the integrated Rancher Monitoring already scrapes system metrics from a cluster's nodes and system components, the custom workloads that you deploy on Kubernetes should also be scraped for data. For that you can configure Prometheus to do an HTTP request to an endpoint of your applications in a certain interval. These endpoints should then return their metrics in a Prometheus format.
+
+In general, you want to scrape data from all the workloads running in your cluster so that you can use them for alerts or debugging issues. Often, you recognize that you need some data only when you actually need the metrics during an incident. It is good, if it is already scraped and stored. Since Prometheus is only meant to be a short-term metrics storage, scraping and keeping lots of data is usually not that expensive. If you are using a long-term storage solution with Prometheus, you can then still decide which data you are actually persisting and keeping there.
+
+### About Prometheus Exporters
+
+Many 3rd party workloads, such as databases, queues, and web-servers, already support exposing metrics in a Prometheus format, or offer exporters that translate between the tool's metrics and a format that Prometheus understands. You can usually add these exporters as additional sidecar containers to the workload's Pods. Many Helm charts already include options to deploy the correct exporter. You can find a curated list of exports on [ExporterHub](https://exporterhub.io/).
+
+### Prometheus support in Programming Languages and Frameworks
+
+To get your own custom application metrics into Prometheus, you have to collect and expose these metrics directly from your application's code. Fortunately, there are already libraries and integrations available to help with this for most popular programming languages and frameworks. One example for this is the Prometheus support in the [Spring Framework](https://docs.spring.io/spring-metrics/docs/current/public/prometheus).
+
+### ServiceMonitors and PodMonitors
+
+Once all of your workloads expose metrics in a Prometheus format, you must configure Prometheus to scrape them. Under the hood, Rancher uses the [prometheus-operator](https://github.com/prometheus-operator/prometheus-operator). This makes it easy to add scraping targets with ServiceMonitors and PodMonitors. Many Helm charts already include an option to create these monitors directly. You can also find more information in the Rancher documentation.
+
+### Prometheus Push Gateway
+
+There are some workloads that are traditionally hard to scrape by Prometheus. Examples for these are short lived workloads like Jobs and CronJobs, or applications that do not allow sharing data between individual handled incoming requests, like PHP applications.
+
+To still get metrics for these use cases, you can set up [prometheus-pushgateways](https://github.com/prometheus/pushgateway). The CronJob or PHP application would push metric updates to the pushgateway. The pushgateway aggregates and exposes them through an HTTP endpoint, which then can be scraped by Prometheus.
+
+### Prometheus Blackbox Monitor
+
+Sometimes it is useful to monitor workloads from the outside. For this, you can use the [Prometheus blackbox-exporter](https://github.com/prometheus/blackbox_exporter) which allows probing any kind of endpoint over HTTP, HTTPS, DNS, TCP and ICMP.
+
+## Monitoring in a (Micro)Service Architecture
+
+If you have a (micro)service architecture where multiple individual workloads within your cluster are communicating with each other, it is really important to have detailed metrics and traces about this traffic to understand how all these workloads are communicating with each other and where a problem or bottleneck may be.
+
+Of course you can monitor all this internal traffic in all your workloads and expose these metrics to Prometheus. But this can quickly become quite work intensive. Service Meshes like Istio, which can be installed with [a click](../../../integrations-in-rancher/istio/istio.md) in Rancher, can do this automatically and provide rich telemetry about the traffic between all services.
+
+## Real User Monitoring
+
+Monitoring the availability and performance of all your internal workloads is vitally important to run stable, reliable and fast applications. But these metrics only show you parts of the picture. To get a complete view it is also necessary to know how your end users are actually perceiving it. For this you can look into various [Real user monitoring solutions](https://en.wikipedia.org/wiki/Real_user_monitoring).
+
+## Security Monitoring
+
+In addition to monitoring workloads to detect performance, availability or scalability problems, the cluster and the workloads running into it should also be monitored for potential security problems. A good starting point is to frequently run and alert on [Compliance Scans](../../../how-to-guides/advanced-user-guides/compliance-scan-guides/compliance-scan-guides.md) which check if the cluster is configured according to security best practices.
+
+For the workloads, you can have a look at Kubernetes and Container security solutions like [NeuVector](https://www.suse.com/products/neuvector/), [Falco](https://falco.org/), [Aqua Kubernetes Security](https://www.aquasec.com/products/kubernetes-security/), [SysDig](https://sysdig.com/).
+
+## Setting up Alerts
+
+Getting all the metrics into a monitoring systems and visualizing them in dashboards is great, but you also want to be pro-actively alerted if something goes wrong.
+
+The integrated Rancher monitoring already configures a sensible set of alerts that make sense in any Kubernetes cluster. You should extend these to cover your specific workloads and use cases.
+
+When setting up alerts, configure them for all the workloads that are critical to the availability of your applications. But also make sure that they are not too noisy. Ideally every alert you are receiving should be because of a problem that needs your attention and needs to be fixed. If you have alerts that are firing all the time but are not that critical, there is a danger that you start ignoring your alerts all together and then miss the real important ones. Less may be more here. Start to focus on the real important metrics first, for example alert if your application is offline. Fix all the problems that start to pop up and then start to create more detailed alerts.
+
+If an alert starts firing, but there is nothing you can do about it at the moment, it's also fine to silence the alert for a certain amount of time, so that you can look at it later.
+
+You can find more information on how to set up alerts and notification channels in the [Rancher Documentation](../../../integrations-in-rancher/monitoring-and-alerting/monitoring-and-alerting.md).
@@ -0,0 +1,62 @@
+---
+title: Best Practices for Rancher Managed VMware vSphere Clusters
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/rancher-managed-clusters-in-vsphere"/>
+</head>
+
+This guide outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
+
+- [1. VM Considerations](#1-vm-considerations)
+- [2. Network Considerations](#2-network-considerations)
+- [3. Storage Considerations](#3-storage-considerations)
+- [4. Backups and Disaster Recovery](#4-backups-and-disaster-recovery)
+
+<figcaption>Solution Overview</figcaption>
+
+![Solution Overview](/img/solution_overview.drawio.svg)
+
+## 1. VM Considerations
+
+### Leverage VM Templates to Construct the Environment
+
+To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts
+
+Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores
+
+Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
+
+### Configure VM's as Appropriate for Kubernetes
+
+It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
+
+## 2. Network Considerations
+
+### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
+
+Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
+
+### Consistent IP Addressing for VM's
+
+Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
+
+## 3. Storage Considerations
+
+### Leverage SSD Drives for ETCD Nodes
+
+ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
+
+## 4. Backups and Disaster Recovery
+
+### Perform Regular Downstream Cluster Backups
+
+Kubernetes uses etcd to store all its data - from configuration, state and metadata. Backing this up is crucial in the event of disaster recovery.
+
+### Back up Downstream Node VMs
+
+Incorporate the Rancher downstream node VM's within a standard VM backup policy.
@@ -0,0 +1,27 @@
+---
+title: Best Practices for Rancher Managed Clusters
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters"/>
+</head>
+
+### Logging
+
+Refer to [this guide](logging-best-practices.md) for our recommendations for cluster-level logging and application logging.
+
+### Monitoring
+
+Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. Refer to this [guide](monitoring-best-practices.md) for our recommendations.
+
+### Disconnected clusters
+
+Rancher supports managing clusters that may not always be online due to network disruptions, control plane availability, or because all cluster nodes are down. Refer to this [guide](disconnected-clusters.md) for our recommendations.
+
+### Tips for Setting Up Containers
+
+Running well-built containers can greatly impact the overall performance and security of your environment. Refer to this [guide](tips-to-set-up-containers.md) for tips.
+
+### Best Practices for Rancher Managed VMware vSphere Clusters
+
+This [guide](rancher-managed-clusters-in-vsphere.md) outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
@@ -0,0 +1,56 @@
+---
+title: Tips for Setting Up Containers
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/tips-to-set-up-containers"/>
+</head>
+
+Running well-built containers can greatly impact the overall performance and security of your environment.
+
+Below are a few tips for setting up your containers.
+
+For a more detailed discussion of security for containers, you can also refer to Rancher's [Guide to Container Security.](https://rancher.com/complete-guide-container-security)
+
+### Use a Common Container OS
+
+When possible, you should try to standardize on a common container base OS.
+
+Smaller distributions such as Alpine and BusyBox reduce container image size and generally have a smaller attack/vulnerability surface.
+
+Popular distributions such as Ubuntu, Fedora, and CentOS are more field-tested and offer more functionality.
+
+### Start with a FROM scratch container
+If your microservice is a standalone static binary, you should use a FROM scratch container.
+
+The FROM scratch container is an [official Docker image](https://hub.docker.com/_/scratch) that is empty so that you can use it to design minimal images.
+
+This will have the smallest attack surface and smallest image size.
+
+### Run Container Processes as Unprivileged
+When possible, use a non-privileged user when running processes within your container. While container runtimes provide isolation, vulnerabilities and attacks are still possible. Inadvertent or accidental host mounts can also be impacted if the container is running as root. For details on configuring a security context for a pod or container, refer to the [Kubernetes docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/).
+
+### Define Resource Limits
+Apply CPU and memory limits to your pods. This can help manage the resources on your worker nodes and avoid a malfunctioning microservice from impacting other microservices.
+
+In standard Kubernetes, you can set resource limits on the namespace level. In Rancher, you can set resource limits on the project level and they will propagate to all the namespaces within the project. For details, refer to the Rancher docs.
+
+When setting resource quotas, if you set anything related to CPU or Memory (i.e. limits or reservations) on a project or namespace, all containers will require a respective CPU or Memory field set during creation. To avoid setting these limits on each and every container during workload creation, a default container resource limit can be specified on the namespace.
+
+The Kubernetes docs have more information on how resource limits can be set at the [container level](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) and the namespace level.
+
+### Define Resource Requirements
+You should apply CPU and memory requirements to your pods. This is crucial for informing the scheduler which type of compute node your pod needs to be placed on, and ensuring it does not over-provision that node. In Kubernetes, you can set a resource requirement by defining `resources.requests` in the resource requests field in a pod's container spec. For details, refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container).
+
+:::note
+
+If you set a resource limit for the namespace that the pod is deployed in, and the container doesn't have a specific resource request, the pod will not be allowed to start. To avoid setting these fields on each and every container during workload creation, a default container resource limit can be specified on the namespace.
+
+:::
+
+It is recommended to define resource requirements on the container level because otherwise, the scheduler makes assumptions that will likely not be helpful to your application when the cluster experiences load.
+
+### Liveness and Readiness Probes
+Set up liveness and readiness probes for your container. Unless your container completely crashes, Kubernetes will not know it's unhealthy unless you create an endpoint or mechanism that can report container status. Alternatively, make sure your container halts and crashes if unhealthy.
+
+The Kubernetes docs show how to [configure liveness and readiness probes for containers.](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/)
@@ -0,0 +1,88 @@
+---
+title: Installing Rancher in a VMware vSphere Environment
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/on-premises-rancher-in-vsphere"/>
+</head>
+
+This guide outlines a reference architecture for installing Rancher on an RKE Kubernetes cluster in a VMware vSphere environment. It also desctibes standard vSphere best practices as documented by VMware.
+
+
+<figcaption>Solution Overview</figcaption>
+
+![Solution Overview](/img/rancher-on-prem-vsphere.svg)
+
+## 1. Load Balancer Considerations
+
+A load balancer is required to direct traffic to the Rancher workloads residing on the RKE nodes.
+
+### Leverage Fault Tolerance and High Availability
+
+Leverage the use of an external (hardware or software) load balancer that has inherit high-availability functionality (F5, NSX-T, Keepalived, etc).
+
+### Back Up Load Balancer Configuration
+
+In the event of a Disaster Recovery activity, availability of the Load balancer configuration will expedite the recovery process.
+
+### Configure Health Checks
+
+Configure the Load balancer to automatically mark nodes as unavailable if a health check is failed. For example, NGINX can facilitate this with:
+
+`max_fails=3 fail_timeout=5s`
+
+### Leverage an External Load Balancer
+
+Avoid implementing a software load balancer within the management cluster.
+
+### Secure Access to Rancher
+
+Configure appropriate Firewall / ACL rules to only expose access to Rancher
+
+## 2. VM Considerations
+
+### Size the VM's According to Rancher Documentation
+
+See [Installation Requirements](../../../getting-started/installation-and-upgrade/installation-requirements/installation-requirements.md).
+
+### Leverage VM Templates to Construct the Environment
+
+To facilitate the consistency of Virtual Machines deployed across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customization options.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts
+
+Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
+
+### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores
+
+Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
+
+### Configure VM's as Appropriate for Kubernetes
+
+It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
+
+## 3. Network Considerations
+
+### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
+
+Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
+
+### Consistent IP Addressing for VM's
+
+Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
+
+## 4. Storage Considerations
+
+### Leverage SSD Drives for ETCD Nodes
+
+ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
+
+## 5. Backups and Disaster Recovery
+
+### Perform Regular Management Cluster Backups
+
+Rancher stores its data in the ETCD datastore of the Kubernetes cluster it resides on. Like with any Kubernetes cluster, perform frequent, tested backups of this cluster.
+
+### Back up Rancher Cluster Node VMs
+
+Incorporate the Rancher management node VM's within a standard VM backup policy.
@@ -0,0 +1,40 @@
+---
+title: Rancher Deployment Strategy
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/rancher-deployment-strategy"/>
+</head>
+
+There are two recommended deployment strategies for a Rancher instance that manages downstream Kubernetes clusters. Each one has its own pros and cons. Read more about which one would fit best for your use case.
+
+## Hub & Spoke Strategy
+---
+
+In this deployment scenario, there is a single Rancher instance managing Kubernetes clusters across the globe. The Rancher instance would be run on a high-availability Kubernetes cluster, and there would be impact due to latencies.
+
+### Pros
+
+* Single control plane interface to view/see all regions and environments.
+* Kubernetes does not require Rancher to operate and can tolerate losing connectivity to the Rancher instance.
+
+### Cons
+
+* Subject to network latencies.
+* If Rancher goes down, global provisioning of new services is unavailable until it is restored. However, each Kubernetes cluster can continue to be managed individually.
+
+## Regional Strategy
+---
+In the regional deployment model a Rancher instance is deployed in close proximity to the downstream Kubernetes clusters.
+
+### Pros
+
+* Rancher functionality in regions stay operational if a Rancher instance in another region goes down.
+* Network latency between Rancher and downstream clusters is greatly reduced, improving the performance of functionality in Rancher.
+* Upgrades of Rancher can be done independently per region.
+
+### Cons
+
+* Overhead of managing multiple Rancher installations.
+* Visibility into Kubernetes clusters in different regions requires multiple interfaces/panes of glass.
+* Deploying multi-cluster apps in Rancher requires repeating the process for each Rancher server.
@@ -0,0 +1,21 @@
+---
+title: Best Practices for the Rancher Server
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server"/>
+</head>
+
+This guide contains our recommendations for running the Rancher server, and is intended to be used in situations in which Rancher manages downstream Kubernetes clusters.
+
+### Recommended Architecture and Infrastructure
+
+Refer to this [guide](tips-for-running-rancher.md) for our general advice for setting up the Rancher server for a production installation.
+
+### Deployment Strategies
+
+This [guide](rancher-deployment-strategy.md) is designed to help you choose whether a regional deployment strategy or a hub-and-spoke deployment strategy is better for a Rancher server that manages downstream Kubernetes clusters.
+
+### Installing Rancher in a VMware vSphere Environment
+
+This [guide](on-premises-rancher-in-vsphere.md) outlines a reference architecture for installing Rancher in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
@@ -0,0 +1,70 @@
+---
+title: Tips for Running Rancher
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/tips-for-running-rancher"/>
+</head>
+
+This guide is geared toward use cases where Rancher is used to manage downstream Kubernetes clusters. The high-availability setup is intended to prevent losing access to downstream clusters if the Rancher server is not available.
+
+A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment.
+
+If you are installing Rancher in a vSphere environment, refer to the best practices documented [here.](on-premises-rancher-in-vsphere.md)
+
+When you set up your high-availability Rancher installation, consider the following:
+
+### Minimize Third-Party Software on the Upstream Cluster
+
+We generally recommend running Rancher on a dedicated cluster, free of other workloads, to avoid potential performance and compatibility issues.
+
+Rancher, especially when managing a growing number of clusters, nodes, and workloads, places a significant load on core Kubernetes components like `etcd` and `kube-apiserver` on the upstream cluster. Third-party software can interfere with the performance of these components and Rancher, potentially leading to instability.
+
+Furthermore, third-party software can functionally interfere with Rancher. To minimize compatibility risks, deploy only essential Kubernetes system components and Rancher on the upstream cluster.
+
+The following applications and components generally do not interfere with Rancher or the Kubernetes system:
+ * Rancher internal components, such as Fleet
+ * Rancher extensions
+ * Cluster API components
+ * CNIs, CPIs, CSIs
+ * Cloud controller managers
+ * Observability and monitoring tools (except prometheus-rancher-exporter)
+
+Note that each of these components has its own minimum resource requirements, which must be met in addition to Rancher's. For high-scale deployments, also consider dedicating separate nodes to non-Rancher software using [taints and tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) to minimize interference.
+
+The following software can interfere with Rancher performance and is therefore discouraged on the upstream cluster:
+ * [CrossPlane](https://www.crossplane.io/)
+ * [Argo CD](https://argoproj.github.io/cd/)
+ * [Flux](https://fluxcd.io/)
+ * [prometheus-rancher-exporter](https://github.com/David-VTUK/prometheus-rancher-exporter) (see [issue 33](https://github.com/David-VTUK/prometheus-rancher-exporter/issues/33))
+ * Container registries such as [Harbor](https://goharbor.io/), which can require significant bandwidth for serving images
+
+### Guidance for Container Registries
+
+Container registries, such as [Harbor](https://goharbor.io/), can consume significant network bandwidth when serving images. This demand increases with the number of images, the frequency of image pulls, and the quantity of clusters and container runtimes they serve. Due to this potential for interference with Rancher UI and API traffic, we recommend against running container registries on the same cluster as the Rancher management server.
+
+Regardless of your deployment strategy for a container registry, ensure sufficient bandwidth is available, ideally reserved using Quality of Service (QoS) mechanisms.
+
+Consider the following recommendations based on your needs:
+
+* **Simple Setups (HA Not a Primary Concern):** A container registry deployed as a single Virtual Machine (VM) can be a viable solution.
+* **High Availability (HA) Requirements:** We recommend running the registry in a dedicated Kubernetes cluster. All other clusters should then be configured to pull images from this centralized, HA registry.
+* **Very Large-Scale or Complex Network Topologies:** Multiple registry clusters might be necessary. These can be deployed in a hierarchical or federated model to efficiently distribute images and manage traffic.
+
+### Make sure nodes are configured correctly for Kubernetes
+It's important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node, checking that all correct ports are opened, and deploying with ssd backed etcd. More details can be found in the [kubernetes docs](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#before-you-begin) and [etcd's performance op guide](https://etcd.io/docs/v3.5/op-guide/performance/).
+
+### Run All Nodes in the Cluster in the Same Datacenter
+For best performance, run all three of your nodes in the same geographic datacenter. If you are running nodes in the cloud, such as AWS, run each node in a separate Availability Zone. For example, launch node 1 in us-west-2a, node 2 in us-west-2b, and node 3 in us-west-2c.
+
+### Development and Production Environments Should be Similar
+It's strongly recommended to have a "staging" or "pre-production" environment of the Kubernetes cluster that Rancher runs on. This environment should mirror your production environment as closely as possible in terms of software and hardware configuration.
+
+### Monitor Your Clusters to Plan Capacity
+The Rancher server's Kubernetes cluster should run within the [system and hardware requirements](../../../getting-started/installation-and-upgrade/installation-requirements/installation-requirements.md) as closely as possible. The more you deviate from the system and hardware requirements, the more risk you take.
+
+However, metrics-driven capacity planning analysis should be the ultimate guidance for scaling Rancher, because the published requirements take into account a variety of workload types.
+
+Using Rancher, you can monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open-source monitoring solution, and Grafana, which lets you visualize the metrics from Prometheus.
+
+After you [enable monitoring](../../../integrations-in-rancher/monitoring-and-alerting/monitoring-and-alerting.md) in the cluster, you can set up alerts to let you know if your cluster is approaching its capacity. You can also use the Prometheus and Grafana monitoring framework to establish a baseline for key metrics as you scale.
@@ -0,0 +1,125 @@
+---
+title: Tuning and Best Practices for Rancher at Scale
+---
+
+<head>
+  <link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/tuning-and-best-practices-for-rancher-at-scale"/>
+</head>
+
+
+This guide describes the best practices and tuning approaches to scale Rancher setups and the associated challenges with doing so. As systems grow, performance will naturally reduce, but there are steps that can minimize the load put on Rancher and optimize Rancher's ability to manage larger infrastructures.
+
+## Optimizing Rancher Performance
+
+* Keep Rancher up to date with patch releases. We are continuously improving Rancher with performance enhancements and bug fixes. The latest Rancher release contains all accumulated improvements to performance and stability, plus updates based on developer experience and user feedback.
+
+* Always scale up gradually, and monitor and observe any changes in behavior while doing do. It is usually easier to resolve performance problems as soon as they surface, before other problems obscure the root cause.
+
+* Reduce network latency between the upstream Rancher cluster and downstream clusters to the extent possible. Note that latency is, among other factors, a function of geographic distance - if you require clusters or nodes spread across the world, consider multiple Rancher installations.
+
+## Minimizing Load on the Upstream Cluster
+
+When scaling up Rancher, one typical bottleneck is resource growth in the upstream (local) Kubernetes cluster. The upstream cluster contains information for all downstream clusters. Many operations that apply to downstream clusters create new objects in the upstream cluster and require computation from handlers running in the upstream cluster.
+
+### Minimizing Third-Party Software on the Upstream Cluster
+
+Recommendations outlined in the [general Rancher recommendations](./tips-for-running-rancher.md#minimize-third-party-software-on-the-upstream-cluster) are particularly important in a high scale context.
+
+### Managing Your Object Counts
+
+Etcd is the backing database for Kubernetes and for Rancher. The database may eventually encounter limitations to the number of a single Kubernetes resource type it can store. Exact limits vary and depend on a number of factors. However, experience indicates that performance issues frequently arise once a single resource type's object count exceeds 60,000. Often that type is `RoleBinding`.
+
+This is typical in Rancher, as many operations create new `RoleBinding` objects in the upstream cluster as a side effect.
+
+You can reduce the number of `RoleBindings` in the upstream cluster in the following ways:
+* Only add users to clusters and projects when necessary.
+* Remove clusters and projects when they are no longer needed.
+* Only use custom roles if necessary.
+* Use as few rules as possible in custom roles.
+* Consider whether adding a role to a user is redundant.
+* Consider using less, but more powerful, clusters.
+* Kubernetes permissions are always "additive" (allow-list) rather than "subtractive" (deny-list). Try to minimize configurations that gives access to all but one aspect of a cluster, project, or namespace, as that will result in the creation of a high number of `RoleBinding` objects.
+* Experiment to see if creating new projects or clusters manifests in fewer `RoleBindings` for your specific use case.
+
+### Using External Authentication
+
+If you have fifty or more users, you should configure an [external authentication provider](../../../how-to-guides/new-user-guides/authentication-permissions-and-global-configuration/authentication-config/authentication-config.md). This is necessary for better performance.
+
+After you configure external authentication, make sure to assign permissions to groups instead of to individual users. This helps reduce the `RoleBinding` object count.
+
+### RoleBinding Count Estimation
+
+Predicting how many `RoleBinding` objects a given configuration will create is complicated. However, the following considerations can offer a rough estimate:
+* For a minimum estimate, use the formula `32C + U + 2UaC + 8P + 5Pa`. 
+  * `C` is the total number of clusters.
+  * `U` is the total number of users.
+  * `Ua` is the average number of users with a membership on a cluster.
+  * `P` is the total number of projects.
+  * `Pa` is the average number of users with a membership on a project.
+* The number of `RoleBindings` increases linearly with the number of clusters, projects, and users.
+
+### Using New Apps Over Legacy Apps
+
+Rancher uses two Kubernetes app resources: `apps.projects.cattle.io` and `apps.cattle.cattle.io`. Legacy apps, represented by `apps.projects.cattle.io`, were introduced with the former Cluster Manager UI and are now outdated. Current apps, represented by `apps.catalog.cattle.io`, are found in the Cluster Explorer UI for their respective cluster. `Apps.cattle.cattle.io` apps are preferable because their data resides in downstream clusters, which frees up resources in the upstream cluster.
+
+You should remove any remaining legacy apps that appear in the Cluster Manager UI, and replace them with apps in the Cluster Explorer UI. Create any new apps only in the Cluster Explorer UI.
+
+### Using the Authorized Cluster Endpoint (ACE)
+
+An [Authorized Cluster Endpoint](../../../reference-guides/rancher-manager-architecture/communicating-with-downstream-user-clusters.md#4-authorized-cluster-endpoint) (ACE) provides access to the Kubernetes API of Rancher-provisioned RKE2 and K3s clusters. When enabled, the ACE adds a context to kubeconfig files generated for the cluster. The context uses a direct endpoint to the cluster, thereby bypassing Rancher. This reduces load on Rancher for cases where unmediated API access is acceptable or preferable. See [Authorized Cluster Endpoint](../../../reference-guides/rancher-manager-architecture/communicating-with-downstream-user-clusters.md#4-authorized-cluster-endpoint) for more information and configuration instructions.
+
+### Reducing Event Handler Executions
+
+The bulk of Rancher's logic occurs on event handlers. These event handlers run on an object whenever the object is updated, and when Rancher is started. Additionally, they run every 10 hours when Rancher syncs caches. In scaled setups these scheduled runs come with huge performance costs because every handler is being run on every applicable object. However, the scheduled handler execution can be disabled with the `CATTLE_SYNC_ONLY_CHANGED_OBJECTS` environment variable. If resource allocation spikes are seen every 10 hours, this setting can help.
+
+The value for `CATTLE_SYNC_ONLY_CHANGED_OBJECTS` can be a comma separated list of the following options. The values refer to types of handlers and controllers (the structures that contain and run handlers). Adding the controller types to the variable disables that set of controllers from running their handlers as part of cache resyncing.
+
+* `mgmt` refers to management controllers which only run on one Rancher node.
+* `user` refers to user controllers which run for every cluster. Some of these run on the same node as management controllers, while others run in the downstream cluster. This option targets the former.
+* `scaled` refers to scaled controllers which run on every Rancher node. You should avoid setting this value, as the scaled handlers are responsible for critical functions and changes may disrupt cluster stability.
+
+In short, if you notice CPU usage peaks every 10 hours, add the `CATTLE_SYNC_ONLY_CHANGED_OBJECTS` environment variable to your Rancher deployment (in the `spec.containers.env` list) with the value `mgmt,user`
+
+## Optimizations Outside of Rancher
+
+Important influencing factors are the underlying cluster's own performance and configuration. The upstream cluster, if misconfigured, can introduce a bottleneck Rancher software has no chance to resolve.
+
+### Manage Upstream Cluster Nodes Directly with RKE2
+
+As Rancher can be very demanding on the upstream cluster, especially at scale, you should have full administrative control of the cluster's configuration and nodes. To identify the root cause of excess resource consumption, use standard Linux troubleshooting techniques and tools. This can aid in distinguishing between whether Rancher, Kubernetes, or operating system components are causing issues. 
+
+Although managed Kubernetes services make it easier to deploy and run Kubernetes clusters, they are discouraged for the upstream cluster in high scale scenarios. Managed Kubernetes services typically limit access to configuration and insights on individual nodes and services.
+
+Use RKE2 for large scale use cases.
+
+### Keep all Upstream Cluster Nodes co-located
+
+To provide high availability, Kubernetes is designed to run nodes and control components in different zones. However, if nodes and control plane components are located in different zones, network traffic may be slower.
+
+Traffic between Rancher components and the Kubernetes API is especially sensitive to network latency, as is etcd traffic between nodes.
+
+To improve performance, run all upstream node clusters in the same location. In particular, make sure that latency between etcd nodes and Rancher is as low as possible.
+
+### Keeping Kubernetes Versions Up to Date
+
+You should keep the local Kubernetes cluster up to date. This will ensure that your cluster has all available performance enhancements and bug fixes.
+
+### Optimizing etcd
+
+Etcd is the backend database for Kubernetes and for Rancher. It plays a very important role in Rancher performance.
+
+The two main bottlenecks to [etcd performance](https://etcd.io/docs/v3.5/op-guide/performance/) are disk and network speed. Etcd should run on dedicated nodes with a fast network setup and with SSDs that have high input/output operations per second (IOPS). For more information regarding etcd performance, see [Slow etcd performance (performance testing and optimization)](https://www.suse.com/support/kb/doc/?id=000020100) and [Tuning etcd for Large Installations](../../../how-to-guides/advanced-user-guides/tune-etcd-for-large-installs.md). Information on disks can also be found in the [Installation Requirements](../../../getting-started/installation-and-upgrade/installation-requirements/installation-requirements.md#disks).
+
+It's best to run etcd on exactly three nodes, as adding more nodes will reduce operation speed. This may be counter-intuitive to common scaling approaches, but it's due to etcd's [replication mechanisms](https://etcd.io/docs/v3.5/faq/#what-is-maximum-cluster-size).
+
+Etcd performance will also be negatively affected by network latency between nodes as that will slow down network communication. Etcd nodes should be located together with Rancher nodes.
+
+### Browser Requirements
+
+At high scale, Rancher transfers more data from the upstream cluster to UI components running in the browser, and those components also need to perform more processing.
+
+For best performance, ensure that the host running the hardware meets these requirements:
+ - 2020 i5 10th generation Intel (4 cores) or equivalent
+ - 8 GB RAM
+ - Total network bandwidth to the upstream cluster: 72 Mb/s (equivalent to a single 802.11n Wi-Fi 4 link stream, ~8 MB/s http download throughput)
+ - Round-trip time (ping time) from browser to upstream cluster: 150 ms or less