Add version 2.8 preview

This commit is contained in:
Billy Tat
2023-10-06 14:28:07 -07:00
parent bf90006277
commit 1b6d950651
1748 changed files with 291361 additions and 4 deletions
@@ -0,0 +1,92 @@
---
title: Logging Best Practices
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/logging-best-practices"/>
</head>
In this guide, we recommend best practices for cluster-level logging and application logging.
- [Cluster-level Logging](#cluster-level-logging)
- [Application Logging](#application-logging)
- [General Best Practices](#general-best-practices)
Before Rancher v2.5, logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level).
Rancher provides a flexible experience for log aggregation. With the logging feature, administrators and users alike can deploy logging that meets fine-grained collection criteria while offering a wider array of destinations and configuration options.
"Under the hood", Rancher logging uses the [Logging operator](https://github.com/kube-logging/logging-operator). We provide manageability of this operator (and its resources), and tie that experience in with managing your Rancher clusters.
## Cluster-level Logging
### Cluster-wide Scraping
For some users, it is desirable to scrape logs from every container running in the cluster. This usually coincides with your security team's request (or requirement) to collect all logs from all points of execution.
In this scenario, it is recommended to create at least two _ClusterOutput_ objects - one for your security team (if you have that requirement), and one for yourselves, the cluster administrators. When creating these objects take care to choose an output endpoint that can handle the significant log traffic coming from the entire cluster. Also make sure to choose an appropriate index to receive all these logs.
Once you have created these _ClusterOutput_ objects, create a _ClusterFlow_ to collect all the logs. Do not define any _Include_ or _Exclude_ rules on this flow. This will ensure that all logs from across the cluster are collected. If you have two _ClusterOutputs_, make sure to send logs to both of them.
### Kubernetes Components
_ClusterFlows_ have the ability to collect logs from all containers on all hosts in the Kubernetes cluster. This works well in cases where those containers are part of a Kubernetes pod; however, RKE containers exist outside of the scope of Kubernetes.
Currently the logs from RKE containers are collected, but are not able to easily be filtered. This is because those logs do not contain information as to the source container (e.g. `etcd` or `kube-apiserver`).
A future release of Rancher will include the source container name which will enable filtering of these component logs. Once that change is made, you will be able to customize a _ClusterFlow_ to retrieve **only** the Kubernetes component logs, and direct them to an appropriate output.
## Application Logging
Best practice not only in Kubernetes but in all container-based applications is to direct application logs to `stdout`/`stderr`. The container runtime will then trap these logs and do **something** with them - typically writing them to a file. Depending on the container runtime (and its configuration), these logs can end up in any number of locations.
In the case of writing the logs to a file, Kubernetes helps by creating a `/var/log/containers` directory on each host. This directory symlinks the log files to their actual destination (which can differ based on configuration or container runtime).
Rancher logging will read all log entries in `/var/log/containers`, ensuring that all log entries from all containers (assuming a default configuration) will have the opportunity to be collected and processed.
### Specific Log Files
Log collection only retrieves `stdout`/`stderr` logs from pods in Kubernetes. But what if we want to collect logs from other files that are generated by applications? Here, a log streaming sidecar (or two) may come in handy.
The goal of setting up a streaming sidecar is to take log files that are written to disk, and have their contents streamed to `stdout`. This way, the Logging Operator can pick up those logs and send them to your desired output.
To set this up, edit your workload resource (e.g. Deployment) and add the following sidecar definition:
```yaml
...
containers:
- args:
- -F
- /path/to/your/log/file.log
command:
- tail
image: busybox
name: stream-log-file-[name]
volumeMounts:
- mountPath: /path/to/your/log
name: mounted-log
...
```
This will add a container to your workload definition that will now stream the contents of (in this example) `/path/to/your/log/file.log` to `stdout`.
This log stream is then automatically collected according to any _Flows_ or _ClusterFlows_ you have setup. You may also wish to consider creating a _Flow_ specifically for this log file by targeting the name of the container. See example:
```yaml
...
spec:
match:
- select:
container_names:
- stream-log-file-name
...
```
## General Best Practices
- Where possible, output structured log entries (e.g. `syslog`, JSON). This makes handling of the log entry easier as there are already parsers written for these formats.
- Try to provide the name of the application that is creating the log entry, in the entry itself. This can make troubleshooting easier as Kubernetes objects do not always carry the name of the application as the object name. For instance, a pod ID may be something like `myapp-098kjhsdf098sdf98` which does not provide much information about the application running inside the container.
- Except in the case of collecting all logs cluster-wide, try to scope your _Flow_ and _ClusterFlow_ objects tightly. This makes it easier to troubleshoot when problems arise, and also helps ensure unrelated log entries do not show up in your aggregator. An example of tight scoping would be to constrain a _Flow_ to a single _Deployment_ in a namespace, or perhaps even a single container within a _Pod_.
- Keep the log verbosity down except when troubleshooting. High log verbosity poses a number of issues, chief among them being **noise**: significant events can be drowned out in a sea of `DEBUG` messages. This is somewhat mitigated with automated alerting and scripting, but highly verbose logging still places an inordinate amount of stress on the logging infrastructure.
- Where possible, try to provide a transaction or request ID with the log entry. This can make tracing application activity across multiple log sources easier, especially when dealing with distributed applications.
@@ -0,0 +1,115 @@
---
title: Monitoring Best Practices
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/monitoring-best-practices"/>
</head>
Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. This is not different when using Kubernetes and Rancher. Fortunately the integrated monitoring and alerting functionality makes this whole process a lot easier.
The [Rancher monitoring documentation](../../../pages-for-subheaders/monitoring-and-alerting.md) describes how you can set up a complete Prometheus and Grafana stack. Out of the box this will scrape monitoring data from all system and Kubernetes components in your cluster and provide sensible dashboards and alerts for them to get started. But for a reliable setup, you also need to monitor your own workloads and adapt Prometheus and Grafana to your own specific use cases and cluster sizes. This document aims to give you best practices for this.
## What to Monitor
Kubernetes itself, as well as applications running inside of it, form a distributed system where different components interact with each other. For the whole system and each individual component, you have to ensure performance, availability, reliability and scalability. A good resource with more details and information is Google's free [Site Reliability Engineering Book](https://sre.google/sre-book/table-of-contents/), especially the chapter about [Monitoring distributed systems](https://sre.google/sre-book/monitoring-distributed-systems/).
## Configuring Prometheus Resource Usage
When installing the integrated monitoring stack, Rancher allows to configure several settings that are dependent on the size of your cluster and the workloads running in it. This chapter covers these in more detail.
### Storage and Data Retention
The amount of storage needed for Prometheus directly correlates to the amount of time series and labels that you store and the data retention you have configured. It is important to note that Prometheus is not meant to be used as a long-term metrics storage. Data retention time is usually only a couple of days and not weeks or months. The reason for this is that Prometheus does not perform any aggregation on its stored metrics. This is great because aggregation can dilute data, but it also means that the needed storage grows linearly over time without retention.
One way to calculate the necessary storage is to look at the average size of a storage chunk in Prometheus with this query
```
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h])
```
Next, find out your data ingestion rate per second:
```
rate(prometheus_tsdb_head_samples_appended_total[1h])
```
and then multiply this with the retention time, adding a few percentage points as buffer:
```
average chunk size in bytes * ingestion rate per second * retention time in seconds * 1.1 = necessary storage in bytes
```
You can find more information about how to calculate the necessary storage in this [blog post](https://www.robustperception.io/how-much-disk-space-do-prometheus-blocks-use).
You can read more about the Prometheus storage concept in the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/storage).
### CPU and Memory Requests and Limits
In larger Kubernetes clusters Prometheus can consume quite a bit of memory. The amount of memory Prometheus needs directly correlates to the amount of time series and amount of labels it stores and the scrape interval in which these are filled.
You can find more information about how to calculate the necessary memory in this [blog post](https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion).
The amount of necessary CPUs correlate with the amount of queries you are performing.
### Federation and Long-term Storage
Prometheus is not meant to store metrics for a long amount of time, but should only be used for short term storage.
In order to store some, or all metrics for a long time, you can leverage Prometheus' [remote read/write](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations) capabilities to connect it to storage systems like [Thanos](https://thanos.io/), [InfluxDB](https://www.influxdata.com/), [M3DB](https://www.m3db.io/), or others. You can find an example setup in this [blog post](https://rancher.com/blog/2020/prometheus-metric-federation).
## Scraping Custom Workloads
While the integrated Rancher Monitoring already scrapes system metrics from a cluster's nodes and system components, the custom workloads that you deploy on Kubernetes should also be scraped for data. For that you can configure Prometheus to do an HTTP request to an endpoint of your applications in a certain interval. These endpoints should then return their metrics in a Prometheus format.
In general, you want to scrape data from all the workloads running in your cluster so that you can use them for alerts or debugging issues. Often, you recognize that you need some data only when you actually need the metrics during an incident. It is good, if it is already scraped and stored. Since Prometheus is only meant to be a short-term metrics storage, scraping and keeping lots of data is usually not that expensive. If you are using a long-term storage solution with Prometheus, you can then still decide which data you are actually persisting and keeping there.
### About Prometheus Exporters
A lot of 3rd party workloads like databases, queues or web-servers either already support exposing metrics in a Prometheus format, or there are so called exporters available that translate between the tool's metrics and the format that Prometheus understands. Usually you can add these exporters as additional sidecar containers to the workload's Pods. A lot of helm charts already include options to deploy the correct exporter. Additionally you can find a curated list of exports by SysDig on [promcat.io](https://promcat.io/) and on [ExporterHub](https://exporterhub.io/).
### Prometheus support in Programming Languages and Frameworks
To get your own custom application metrics into Prometheus, you have to collect and expose these metrics directly from your application's code. Fortunately, there are already libraries and integrations available to help with this for most popular programming languages and frameworks. One example for this is the Prometheus support in the [Spring Framework](https://docs.spring.io/spring-metrics/docs/current/public/prometheus).
### ServiceMonitors and PodMonitors
Once all your workloads expose metrics in a Prometheus format, you have to configure Prometheus to scrape it. Under the hood Rancher is using the [prometheus-operator](https://github.com/prometheus-operator/prometheus-operator). This makes it easy to add additional scraping targets with ServiceMonitors and PodMonitors. A lot of helm charts already include an option to create these monitors directly. You can also find more information in the Rancher documentation.
### Prometheus Push Gateway
There are some workloads that are traditionally hard to scrape by Prometheus. Examples for these are short lived workloads like Jobs and CronJobs, or applications that do not allow sharing data between individual handled incoming requests, like PHP applications.
To still get metrics for these use cases, you can set up [prometheus-pushgateways](https://github.com/prometheus/pushgateway). The CronJob or PHP application would push metric updates to the pushgateway. The pushgateway aggregates and exposes them through an HTTP endpoint, which then can be scraped by Prometheus.
### Prometheus Blackbox Monitor
Sometimes it is useful to monitor workloads from the outside. For this, you can use the [Prometheus blackbox-exporter](https://github.com/prometheus/blackbox_exporter) which allows probing any kind of endpoint over HTTP, HTTPS, DNS, TCP and ICMP.
## Monitoring in a (Micro)Service Architecture
If you have a (micro)service architecture where multiple individual workloads within your cluster are communicating with each other, it is really important to have detailed metrics and traces about this traffic to understand how all these workloads are communicating with each other and where a problem or bottleneck may be.
Of course you can monitor all this internal traffic in all your workloads and expose these metrics to Prometheus. But this can quickly become quite work intensive. Service Meshes like Istio, which can be installed with [a click](../../../pages-for-subheaders/istio.md) in Rancher, can do this automatically and provide rich telemetry about the traffic between all services.
## Real User Monitoring
Monitoring the availability and performance of all your internal workloads is vitally important to run stable, reliable and fast applications. But these metrics only show you parts of the picture. To get a complete view it is also necessary to know how your end users are actually perceiving it. For this you can look into various [Real user monitoring solutions](https://en.wikipedia.org/wiki/Real_user_monitoring).
## Security Monitoring
In addition to monitoring workloads to detect performance, availability or scalability problems, the cluster and the workloads running into it should also be monitored for potential security problems. A good starting point is to frequently run and alert on [CIS Scans](../../../pages-for-subheaders/cis-scan-guides.md) which check if the cluster is configured according to security best practices.
For the workloads, you can have a look at Kubernetes and Container security solutions like [NeuVector](https://www.suse.com/products/neuvector/), [Falco](https://falco.org/), [Aqua Kubernetes Security](https://www.aquasec.com/solutions/kubernetes-container-security/), [SysDig](https://sysdig.com/).
## Setting up Alerts
Getting all the metrics into a monitoring systems and visualizing them in dashboards is great, but you also want to be pro-actively alerted if something goes wrong.
The integrated Rancher monitoring already configures a sensible set of alerts that make sense in any Kubernetes cluster. You should extend these to cover your specific workloads and use cases.
When setting up alerts, configure them for all the workloads that are critical to the availability of your applications. But also make sure that they are not too noisy. Ideally every alert you are receiving should be because of a problem that needs your attention and needs to be fixed. If you have alerts that are firing all the time but are not that critical, there is a danger that you start ignoring your alerts all together and then miss the real important ones. Less may be more here. Start to focus on the real important metrics first, for example alert if your application is offline. Fix all the problems that start to pop up and then start to create more detailed alerts.
If an alert starts firing, but there is nothing you can do about it at the moment, it's also fine to silence the alert for a certain amount of time, so that you can look at it later.
You can find more information on how to set up alerts and notification channels in the [Rancher Documentation](../../../pages-for-subheaders/monitoring-and-alerting.md).
@@ -0,0 +1,62 @@
---
title: Best Practices for Rancher Managed vSphere Clusters
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/rancher-managed-clusters-in-vsphere"/>
</head>
This guide outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
- [1. VM Considerations](#1-vm-considerations)
- [2. Network Considerations](#2-network-considerations)
- [3. Storage Considerations](#3-storage-considerations)
- [4. Backups and Disaster Recovery](#4-backups-and-disaster-recovery)
<figcaption>Solution Overview</figcaption>
![Solution Overview](/img/solution_overview.drawio.svg)
## 1. VM Considerations
### Leverage VM Templates to Construct the Environment
To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts
Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores
Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
### Configure VM's as Appropriate for Kubernetes
Its important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
## 2. Network Considerations
### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
### Consistent IP Addressing for VM's
Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
## 3. Storage Considerations
### Leverage SSD Drives for ETCD Nodes
ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
## 4. Backups and Disaster Recovery
### Perform Regular Downstream Cluster Backups
Kubernetes uses etcd to store all its data - from configuration, state and metadata. Backing this up is crucial in the event of disaster recovery.
### Back up Downstream Node VMs
Incorporate the Rancher downstream node VM's within a standard VM backup policy.
@@ -0,0 +1,56 @@
---
title: Tips for Setting Up Containers
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-managed-clusters/tips-to-set-up-containers"/>
</head>
Running well-built containers can greatly impact the overall performance and security of your environment.
Below are a few tips for setting up your containers.
For a more detailed discussion of security for containers, you can also refer to Rancher's [Guide to Container Security.](https://rancher.com/complete-guide-container-security)
### Use a Common Container OS
When possible, you should try to standardize on a common container base OS.
Smaller distributions such as Alpine and BusyBox reduce container image size and generally have a smaller attack/vulnerability surface.
Popular distributions such as Ubuntu, Fedora, and CentOS are more field-tested and offer more functionality.
### Start with a FROM scratch container
If your microservice is a standalone static binary, you should use a FROM scratch container.
The FROM scratch container is an [official Docker image](https://hub.docker.com/_/scratch) that is empty so that you can use it to design minimal images.
This will have the smallest attack surface and smallest image size.
### Run Container Processes as Unprivileged
When possible, use a non-privileged user when running processes within your container. While container runtimes provide isolation, vulnerabilities and attacks are still possible. Inadvertent or accidental host mounts can also be impacted if the container is running as root. For details on configuring a security context for a pod or container, refer to the [Kubernetes docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/).
### Define Resource Limits
Apply CPU and memory limits to your pods. This can help manage the resources on your worker nodes and avoid a malfunctioning microservice from impacting other microservices.
In standard Kubernetes, you can set resource limits on the namespace level. In Rancher, you can set resource limits on the project level and they will propagate to all the namespaces within the project. For details, refer to the Rancher docs.
When setting resource quotas, if you set anything related to CPU or Memory (i.e. limits or reservations) on a project or namespace, all containers will require a respective CPU or Memory field set during creation. To avoid setting these limits on each and every container during workload creation, a default container resource limit can be specified on the namespace.
The Kubernetes docs have more information on how resource limits can be set at the [container level](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) and the namespace level.
### Define Resource Requirements
You should apply CPU and memory requirements to your pods. This is crucial for informing the scheduler which type of compute node your pod needs to be placed on, and ensuring it does not over-provision that node. In Kubernetes, you can set a resource requirement by defining `resources.requests` in the resource requests field in a pod's container spec. For details, refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container).
:::note
If you set a resource limit for the namespace that the pod is deployed in, and the container doesn't have a specific resource request, the pod will not be allowed to start. To avoid setting these fields on each and every container during workload creation, a default container resource limit can be specified on the namespace.
:::
It is recommended to define resource requirements on the container level because otherwise, the scheduler makes assumptions that will likely not be helpful to your application when the cluster experiences load.
### Liveness and Readiness Probes
Set up liveness and readiness probes for your container. Unless your container completely crashes, Kubernetes will not know it's unhealthy unless you create an endpoint or mechanism that can report container status. Alternatively, make sure your container halts and crashes if unhealthy.
The Kubernetes docs show how to [configure liveness and readiness probes for containers.](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/)
@@ -0,0 +1,88 @@
---
title: Installing Rancher in a vSphere Environment
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/on-premises-rancher-in-vsphere"/>
</head>
This guide outlines a reference architecture for installing Rancher on an RKE Kubernetes cluster in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
<figcaption>Solution Overview</figcaption>
![Solution Overview](/img/rancher-on-prem-vsphere.svg)
## 1. Load Balancer Considerations
A load balancer is required to direct traffic to the Rancher workloads residing on the RKE nodes.
### Leverage Fault Tolerance and High Availability
Leverage the use of an external (hardware or software) load balancer that has inherit high-availability functionality (F5, NSX-T, Keepalived, etc).
### Back Up Load Balancer Configuration
In the event of a Disaster Recovery activity, availability of the Load balancer configuration will expedite the recovery process.
### Configure Health Checks
Configure the Load balancer to automatically mark nodes as unavailable if a health check is failed. For example, NGINX can facilitate this with:
`max_fails=3 fail_timeout=5s`
### Leverage an External Load Balancer
Avoid implementing a software load balancer within the management cluster.
### Secure Access to Rancher
Configure appropriate Firewall / ACL rules to only expose access to Rancher
## 2. VM Considerations
### Size the VM's According to Rancher Documentation
See [Installation Requirements](../../../pages-for-subheaders/installation-requirements.md).
### Leverage VM Templates to Construct the Environment
To facilitate the consistency of Virtual Machines deployed across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customization options.
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts
Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores
Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
### Configure VM's as Appropriate for Kubernetes
Its important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
## 3. Network Considerations
### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
### Consistent IP Addressing for VM's
Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
## 4. Storage Considerations
### Leverage SSD Drives for ETCD Nodes
ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
## 5. Backups and Disaster Recovery
### Perform Regular Management Cluster Backups
Rancher stores its data in the ETCD datastore of the Kubernetes cluster it resides on. Like with any Kubernetes cluster, perform frequent, tested backups of this cluster.
### Back up Rancher Cluster Node VMs
Incorporate the Rancher management node VM's within a standard VM backup policy.
@@ -0,0 +1,40 @@
---
title: Rancher Deployment Strategy
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/rancher-deployment-strategy"/>
</head>
There are two recommended deployment strategies for a Rancher instance that manages downstream Kubernetes clusters. Each one has its own pros and cons. Read more about which one would fit best for your use case.
## Hub & Spoke Strategy
---
In this deployment scenario, there is a single Rancher instance managing Kubernetes clusters across the globe. The Rancher instance would be run on a high-availability Kubernetes cluster, and there would be impact due to latencies.
### Pros
* Single control plane interface to view/see all regions and environments.
* Kubernetes does not require Rancher to operate and can tolerate losing connectivity to the Rancher instance.
### Cons
* Subject to network latencies.
* If Rancher goes down, global provisioning of new services is unavailable until it is restored. However, each Kubernetes cluster can continue to be managed individually.
## Regional Strategy
---
In the regional deployment model a Rancher instance is deployed in close proximity to the downstream Kubernetes clusters.
### Pros
* Rancher functionality in regions stay operational if a Rancher instance in another region goes down.
* Network latency between Rancher and downstream clusters is greatly reduced, improving the performance of functionality in Rancher.
* Upgrades of Rancher can be done independently per region.
### Cons
* Overhead of managing multiple Rancher installations.
* Visibility into Kubernetes clusters in different regions requires multiple interfaces/panes of glass.
* Deploying multi-cluster apps in Rancher requires repeating the process for each Rancher server.
@@ -0,0 +1,40 @@
---
title: Tips for Running Rancher
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/tips-for-running-rancher"/>
</head>
This guide is geared toward use cases where Rancher is used to manage downstream Kubernetes clusters. The high-availability setup is intended to prevent losing access to downstream clusters if the Rancher server is not available.
A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment.
If you are installing Rancher in a vSphere environment, refer to the best practices documented [here.](on-premises-rancher-in-vsphere.md)
When you set up your high-availability Rancher installation, consider the following:
### Run Rancher on a Separate Cluster
Don't run other workloads or microservices in the Kubernetes cluster that Rancher is installed on.
### Make sure nodes are configured correctly for Kubernetes
It's important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node, checking that all correct ports are opened, and deploying with ssd backed etcd. More details can be found in the [kubernetes docs](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#before-you-begin) and [etcd's performance op guide](https://etcd.io/docs/v3.4/op-guide/performance/).
### When using RKE: Back up the Statefile
RKE keeps record of the cluster state in a file called `cluster.rkestate`. This file is important for the recovery of a cluster and/or the continued maintenance of the cluster through RKE. Because this file contains certificate material, we strongly recommend encrypting this file before backing up. After each run of `rke up` you should backup the state file.
### Run All Nodes in the Cluster in the Same Datacenter
For best performance, run all three of your nodes in the same geographic datacenter. If you are running nodes in the cloud, such as AWS, run each node in a separate Availability Zone. For example, launch node 1 in us-west-2a, node 2 in us-west-2b, and node 3 in us-west-2c.
### Development and Production Environments Should be Similar
It's strongly recommended to have a "staging" or "pre-production" environment of the Kubernetes cluster that Rancher runs on. This environment should mirror your production environment as closely as possible in terms of software and hardware configuration.
### Monitor Your Clusters to Plan Capacity
The Rancher server's Kubernetes cluster should run within the [system and hardware requirements](../../../pages-for-subheaders/installation-requirements.md) as closely as possible. The more you deviate from the system and hardware requirements, the more risk you take.
However, metrics-driven capacity planning analysis should be the ultimate guidance for scaling Rancher, because the published requirements take into account a variety of workload types.
Using Rancher, you can monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open-source monitoring solution, and Grafana, which lets you visualize the metrics from Prometheus.
After you [enable monitoring](../../../pages-for-subheaders/monitoring-and-alerting.md) in the cluster, you can set up alerts to let you know if your cluster is approaching its capacity. You can also use the Prometheus and Grafana monitoring framework to establish a baseline for key metrics as you scale.
@@ -0,0 +1,65 @@
---
title: Tips for Scaling Rancher
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/tips-for-scaling-rancher"/>
</head>
This guide aims to introduce the approaches that should be considered to scale Rancher setups, and associated challenges with doing so. As systems grow performance will naturally reduce, but there are steps we can take to minimize the load put on Rancher, as well as optimize Rancher's ability to handle these larger setups.
## General Tips on Optimizing Rancher's Performance
* It is advisable to keep Rancher up to date with patch releases. Performance improvements and bug fixes are made throughout the life of a minor release. You can review the release notes to help inform your own decisions on whether an upgrade is necessary but we recommend keeping yourself up to date in most cases.
* Performance will be negatively impacted by increased latency between Rancher's infrastructure and a downstream cluster's infrastructure (eg. geographic distance). If a user or organization requires clusters/nodes all over the world or spread across many regions, it is best to use multiple Rancher installations.
* Please always try to scale up gradually, monitoring and observing any change in behavior while doing do. It is usually easier to resolve performance problems as soon as they surface, and before other problems confuse symptoms.
## Minimizing Load on the local cluster
The largest bottleneck when scaling Rancher is resource growth in the local Kubernetes cluster. The local cluster contains information for all downstream clusters. Many operations that apply to downstream clusters will create new objects in the local cluster and require computation from handlers running in the local cluster.
### Managing Your Object Counts
ETCD eventually encounters limitations to the number of a single Kubernetes resource type it can store. These exact numbers are not well documented. From internal observations we usually see performance issues once a single resource type's object count exceeds 60k, and often that type is Rolebindings.
Rolebindings are created in the local cluster as a side effect of many operations.
Considerations when attempting reduce rolebindings in the local cluster:
* Only add users to clusters and projects when necessary
* Remove clusters and projects when they are no longer needed
* Only use custom roles if necessary
* Use as few rules as possible in custom roles
* Consider whether adding a role to a user is redundant
* Consider that using less, but more powerful, clusters may be more efficient
* Experiment to see if creating new projects or creating new clusters manifests in fewer rolebindings for your specific use case.
### Using New Apps Over Legacy Apps
There are two app kubernetes resources that Rancher uses: apps.projects.cattle.io and apps.cattle.cattle.io. The legacy apps, apps.projects.cattle.io, were introduced first in the Cluster Manager and are now outdated. The new apps, apps.catalog.cattle.io, are found in the Cluster Explorer for their respective cluster. The new apps are preferrable because they live in the downstream cluster while the legacy apps live in the local cluster.
We recommend removing apps that appear in the Cluster Manager, replacing them with apps in the Cluster Explorer for their target cluster if necessary and creating any future apps in the cluster's Cluster Explorer only.
### Using the Authorized Cluster Endpoint (ACE)
There is an _Authorized Cluster Endpoint_ option for Rancher provisioned RKE1, RKE2, and K3s clusters. When enabled this adds a context to kubeconfigs generated for the cluster that uses a direct endpoint to the cluster and bypasses Rancher. However, it is not enough to only enable this option. The user of the Kubeconfig needs to use `kubectl use-context <ace context name>` in order to start using it.
Without using ACE, all kubeconfig requests first route through Rancher.
### Experimental: Option to Reduce Event Handler Executions
The bulk of Rancher's logic occurs on event handlers. These event handlers run on an object whenever the object is updated, and when Rancher is started. Additionally, they run every 15 hours when caches are synced. In scaled setups these scheduled runs come with huge performance costs because every handler is being run on every applicable object. However, this scheduled execution of handlers can be disabled using the CATTLE_SYNC_ONLY_CHANGED_OBJECTS environment variable. If resource allocation spikes are seen on an interval of about 15 hours it is possible this setting can help.
The value for the environment variable can be a comma separated list of the following options. The values refer to types of controllers (the structures that contain and run handlers) and their handlers. Adding the controller types to the variable will disable that set of controllers from running their handlers as part of cache resyncing.
* `mgmt` refers to management controllers which only run on one Rancher node.
* `user` refers to user controllers which run for every cluster. Some of these are ran on the same node as management controllers, while other run in the downstream cluster. This will option targets the former.
* `scaled` refers to scaled controllers which run on every Rancher node. This is not recommended to be set due to the critical functionality the scaled handlers are responsible for.
In short, if you notice CPU usage peaks every 15 hours, add the CATTLE_SYNC_ONLY_CHANGED_OBJECTS environment variable to your rancher deployment with the value `mgmt,user`.
## Optimizations Outside of Rancher
A large component of performance is the local cluster and how it was configured. This cluster can introduce a bottleneck before Rancher software ever runs. When Rancher nodes experience high resource usage, you can use the command "top" to identify whether it is Rancher or a Kubernetes component that is consuming the resource in excess.
### Keeping Kubernetes Versions Up to Date
Similar to Rancher versions, it is advisable to keep your kubernetes cluster up to date. This will ensure that your cluster contains any available performance enhancements or bug fixes.
### Optimizing ETCD
The two main bottlenecks to [ETCD performance](https://etcd.io/docs/v3.4/op-guide/performance/) are disk speed and network speed. Optimization to either should improve performance. For information regarding ETCD performance see [Slow etcd performance (performance testing and optimization)](https://www.suse.com/support/kb/doc/?id=000020100) and [Tuning etcd for Large Installations](https://docs.ranchermanager.rancher.io/how-to-guides/advanced-user-guides/tune-etcd-for-large-installs). Information on disks can also be found [in our docs](https://docs.Ranchermanager.Rancher.io/v2.5/pages-for-subheaders/installation-requirements#disks).
Theoretically, the more nodes in an ETCD cluster the slower it will be due to replication requirements [source](https://etcd.io/docs/v3.3/faq). This may be counter-intuitive to common scaling approaches. It can also be inferred that ETCD performance will be inversely affected by distance between nodes as that will slow down network communication.