Merge pull request #896 from moio/revise_hardware_recommendations

Revise hardware recommendations
This commit is contained in:
Billy Tat
2023-10-12 09:09:41 -07:00
committed by GitHub
28 changed files with 1040 additions and 473 deletions
@@ -1,65 +0,0 @@
---
title: Tips for Scaling Rancher
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/tips-for-scaling-rancher"/>
</head>
This guide aims to introduce the approaches that should be considered to scale Rancher setups, and associated challenges with doing so. As systems grow performance will naturally reduce, but there are steps we can take to minimize the load put on Rancher, as well as optimize Rancher's ability to handle these larger setups.
## General Tips on Optimizing Rancher's Performance
* It is advisable to keep Rancher up to date with patch releases. Performance improvements and bug fixes are made throughout the life of a minor release. You can review the release notes to help inform your own decisions on whether an upgrade is necessary but we recommend keeping yourself up to date in most cases.
* Performance will be negatively impacted by increased latency between Rancher's infrastructure and a downstream cluster's infrastructure (eg. geographic distance). If a user or organization requires clusters/nodes all over the world or spread across many regions, it is best to use multiple Rancher installations.
* Please always try to scale up gradually, monitoring and observing any change in behavior while doing do. It is usually easier to resolve performance problems as soon as they surface, and before other problems confuse symptoms.
## Minimizing Load on the local cluster
The largest bottleneck when scaling Rancher is resource growth in the local Kubernetes cluster. The local cluster contains information for all downstream clusters. Many operations that apply to downstream clusters will create new objects in the local cluster and require computation from handlers running in the local cluster.
### Managing Your Object Counts
ETCD eventually encounters limitations to the number of a single Kubernetes resource type it can store. These exact numbers are not well documented. From internal observations we usually see performance issues once a single resource type's object count exceeds 60k, and often that type is Rolebindings.
Rolebindings are created in the local cluster as a side effect of many operations.
Considerations when attempting reduce rolebindings in the local cluster:
* Only add users to clusters and projects when necessary
* Remove clusters and projects when they are no longer needed
* Only use custom roles if necessary
* Use as few rules as possible in custom roles
* Consider whether adding a role to a user is redundant
* Consider that using less, but more powerful, clusters may be more efficient
* Experiment to see if creating new projects or creating new clusters manifests in fewer rolebindings for your specific use case.
### Using New Apps Over Legacy Apps
There are two app kubernetes resources that Rancher uses: apps.projects.cattle.io and apps.cattle.cattle.io. The legacy apps, apps.projects.cattle.io, were introduced first in the Cluster Manager and are now outdated. The new apps, apps.catalog.cattle.io, are found in the Cluster Explorer for their respective cluster. The new apps are preferrable because they live in the downstream cluster while the legacy apps live in the local cluster.
We recommend removing apps that appear in the Cluster Manager, replacing them with apps in the Cluster Explorer for their target cluster if necessary and creating any future apps in the cluster's Cluster Explorer only.
### Using the Authorized Cluster Endpoint (ACE)
There is an _Authorized Cluster Endpoint_ option for Rancher provisioned RKE1, RKE2, and K3s clusters. When enabled this adds a context to kubeconfigs generated for the cluster that uses a direct endpoint to the cluster and bypasses Rancher. However, it is not enough to only enable this option. The user of the Kubeconfig needs to use `kubectl use-context <ace context name>` in order to start using it.
Without using ACE, all kubeconfig requests first route through Rancher.
### Experimental: Option to Reduce Event Handler Executions
The bulk of Rancher's logic occurs on event handlers. These event handlers run on an object whenever the object is updated, and when Rancher is started. Additionally, they run every 15 hours when caches are synced. In scaled setups these scheduled runs come with huge performance costs because every handler is being run on every applicable object. However, this scheduled execution of handlers can be disabled using the CATTLE_SYNC_ONLY_CHANGED_OBJECTS environment variable. If resource allocation spikes are seen on an interval of about 15 hours it is possible this setting can help.
The value for the environment variable can be a comma separated list of the following options. The values refer to types of controllers (the structures that contain and run handlers) and their handlers. Adding the controller types to the variable will disable that set of controllers from running their handlers as part of cache resyncing.
* `mgmt` refers to management controllers which only run on one Rancher node.
* `user` refers to user controllers which run for every cluster. Some of these are ran on the same node as management controllers, while other run in the downstream cluster. This will option targets the former.
* `scaled` refers to scaled controllers which run on every Rancher node. This is not recommended to be set due to the critical functionality the scaled handlers are responsible for.
In short, if you notice CPU usage peaks every 15 hours, add the CATTLE_SYNC_ONLY_CHANGED_OBJECTS environment variable to your rancher deployment with the value `mgmt,user`.
## Optimizations Outside of Rancher
A large component of performance is the local cluster and how it was configured. This cluster can introduce a bottleneck before Rancher software ever runs. When Rancher nodes experience high resource usage, you can use the command "top" to identify whether it is Rancher or a Kubernetes component that is consuming the resource in excess.
### Keeping Kubernetes Versions Up to Date
Similar to Rancher versions, it is advisable to keep your kubernetes cluster up to date. This will ensure that your cluster contains any available performance enhancements or bug fixes.
### Optimizing ETCD
The two main bottlenecks to [ETCD performance](https://etcd.io/docs/v3.4/op-guide/performance/) are disk speed and network speed. Optimization to either should improve performance. For information regarding ETCD performance see [Slow etcd performance (performance testing and optimization)](https://www.suse.com/support/kb/doc/?id=000020100) and [Tuning etcd for Large Installations](https://docs.ranchermanager.rancher.io/how-to-guides/advanced-user-guides/tune-etcd-for-large-installs). Information on disks can also be found [in our docs](https://docs.Ranchermanager.Rancher.io/v2.5/pages-for-subheaders/installation-requirements#disks).
Theoretically, the more nodes in an ETCD cluster the slower it will be due to replication requirements [source](https://etcd.io/docs/v3.3/faq). This may be counter-intuitive to common scaling approaches. It can also be inferred that ETCD performance will be inversely affected by distance between nodes as that will slow down network communication.
@@ -0,0 +1,100 @@
---
title: Tuning and Best Practices for Rancher at Scale
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/reference-guides/best-practices/rancher-server/tuning-and-best-practices-for-rancher-at-scale"/>
</head>
This guide describes the best practices and tuning approaches to scale Rancher setups and the associated challenges with doing so. As systems grow, performance will naturally reduce, but there are steps that can minimize the load put on Rancher and optimize Rancher's ability to manage larger infrastructures.
## Optimizing Rancher Performance
* Keep Rancher up to date with patch releases. We are continuously improving Rancher with performance enhancements and bug fixes. The latest Rancher release contains all accumulated improvements to performance and stability, plus updates based on developer experience and user feedback.
* Always scale up gradually, and monitor and observe any changes in behavior while doing do. It is usually easier to resolve performance problems as soon as they surface, before other problems obscure the root cause.
* Reduce network latency between the upstream Rancher cluster and downstream clusters to the extent possible. Note that latency is, among other factors, a function of geographic distance - if you require clusters or nodes spread across the world, consider multiple Rancher installations.
## Minimizing Load on the Upstream Cluster
When scaling up Rancher, one typical bottleneck is resource growth in the upstream (local) Kubernetes cluster. The upstream cluster contains information for all downstream clusters. Many operations that apply to downstream clusters create new objects in the upstream cluster and require computation from handlers running in the upstream cluster.
### Managing Your Object Counts
Etcd is the backing database for Kubernetes and for Rancher. The database may eventually encounter limitations to the number of a single Kubernetes resource type it can store. Exact limits vary and depend on a number of factors. However, experience indicates that performance issues frequently arise once a single resource type's object count exceeds 60,000. Often that type is `RoleBinding`.
This is typical in Rancher, as many operations create new `RoleBinding` objects in the upstream cluster as a side effect.
You can reduce the number of `RoleBindings` in the upstream cluster in the following ways:
* Limit the use of the [Restricted Admin](../../../how-to-guides/new-user-guides/authentication-permissions-and-global-configuration/manage-role-based-access-control-rbac/global-permissions#restricted-admin) role. Apply other roles wherever possible.
* If you use [external authentication](../../../pages-for-subheaders/authentication-config), use groups to assign roles.
* Only add users to clusters and projects when necessary.
* Remove clusters and projects when they are no longer needed.
* Only use custom roles if necessary.
* Use as few rules as possible in custom roles.
* Consider whether adding a role to a user is redundant.
* Consider using less, but more powerful, clusters.
* Kubernetes permissions are always "additive" (allow-list) rather than "subtractive" (deny-list). Try to minimize configurations that gives access to all but one aspect of a cluster, project, or namespace, as that will result in the creation of a high number of `RoleBinding` objects.
* Experiment to see if creating new projects or clusters manifests in fewer `RoleBindings` for your specific use case.
### RoleBinding Count Estimation
Predicting how many `RoleBinding` objects a given configuration will create is complicated. However, the following considerations can offer a rough estimate:
* For a minimum estimate, use the formula `32C + U + 2UaC + 8P + 5Pa`.
* `C` is the total number of clusters.
* `U` is the total number of users.
* `Ua` is the average number of users with a membership on a cluster.
* `P` is the total number of projects.
* `Pa` is the average number of users with a membership on a project.
* The Restricted Admin role follows a different formula, as every user with this role results in at least `7C + 2P + 2` additional `RoleBinding` objects.
* The number of `RoleBindings` increases linearly with the number of clusters, projects, and users.
### Using New Apps Over Legacy Apps
Rancher uses two Kubernetes app resources: `apps.projects.cattle.io` and `apps.cattle.cattle.io`. Legacy apps, represented by `apps.projects.cattle.io`, were introduced with the former Cluster Manager UI and are now outdated. Current apps, represented by `apps.catalog.cattle.io`, are found in the Cluster Explorer UI for their respective cluster. `Apps.cattle.cattle.io` apps are preferable because their data resides in downstream clusters, which frees up resources in the upstream cluster.
You should remove any remaining legacy apps that appear in the Cluster Manager UI, and replace them with apps in the Cluster Explorer UI. Create any new apps only in the Cluster Explorer UI.
### Using the Authorized Cluster Endpoint (ACE)
An [Authorized Cluster Endpoint](../../../reference-guides/rancher-manager-architecture/communicating-with-downstream-user-clusters#4-authorized-cluster-endpoint) (ACE) provides access to the Kubernetes API of Rancher-provisioned RKE, RKE2, and K3s clusters. When enabled, the ACE adds a context to kubeconfig files generated for the cluster. The context uses a direct endpoint to the cluster, thereby bypassing Rancher. This reduces load on Rancher for cases where unmediated API access is acceptable or preferable. See [Authorized Cluster Endpoint](../../../reference-guides/rancher-manager-architecture/communicating-with-downstream-user-clusters#4-authorized-cluster-endpoint) for more information and configuration instructions.
### Reducing Event Handler Executions
The bulk of Rancher's logic occurs on event handlers. These event handlers run on an object whenever the object is updated, and when Rancher is started. Additionally, they run every 15 hours when Rancher syncs caches. In scaled setups these scheduled runs come with huge performance costs because every handler is being run on every applicable object. However, the scheduled handler execution can be disabled with the `CATTLE_SYNC_ONLY_CHANGED_OBJECTS` environment variable. If resource allocation spikes are seen every 15 hours, this setting can help.
The value for `CATTLE_SYNC_ONLY_CHANGED_OBJECTS` can be a comma separated list of the following options. The values refer to types of handlers and controllers (the structures that contain and run handlers). Adding the controller types to the variable disables that set of controllers from running their handlers as part of cache resyncing.
* `mgmt` refers to management controllers which only run on one Rancher node.
* `user` refers to user controllers which run for every cluster. Some of these run on the same node as management controllers, while others run in the downstream cluster. This option targets the former.
* `scaled` refers to scaled controllers which run on every Rancher node. You should avoid setting this value, as the scaled handlers are responsible for critical functions and changes may disrupt cluster stability.
In short, if you notice CPU usage peaks every 15 hours, add the `CATTLE_SYNC_ONLY_CHANGED_OBJECTS` environment variable to your Rancher deployment (in the `spec.containers.env` list) with the value `mgmt,user`
## Optimizations Outside of Rancher
Important influencing factors are the underlying cluster's own performance and configuration. The upstream cluster, if misconfigured, can introduce a bottleneck Rancher software has no chance to resolve.
### Manage Upstream Cluster Nodes Directly with RKE2
As Rancher can be very demanding on the upstream cluster, especially at scale, you should have full administrative control of the cluster's configuration and nodes. To identify the root cause of excess resource consumption, use standard Linux troubleshooting techniques and tools. This can aid in distinguishing between whether Rancher, Kubernetes, or operating system components are causing issues.
Although managed Kubernetes services make it easier to deploy and run Kubernetes clusters, they are discouraged for the upstream cluster in high scale scenarios. Managed Kubernetes services typically limit access to configuration and insights on individual nodes and services.
Use RKE2 for large scale use cases.
### Keeping Kubernetes Versions Up to Date
You should keep the local Kubernetes cluster up to date. This will ensure that your cluster has all available performance enhancements and bug fixes.
### Optimizing etcd
Etcd is the backend database for Kubernetes and for Rancher. It plays a very important role in Rancher performance.
The two main bottlenecks to [etcd performance](https://etcd.io/docs/v3.4/op-guide/performance/) are disk and network speed. Etcd should run on dedicated nodes with a fast network setup and with SSDs that have high input/output operations per second (IOPS). For more information regarding etcd performance, see [Slow etcd performance (performance testing and optimization)](https://www.suse.com/support/kb/doc/?id=000020100) and [Tuning etcd for Large Installations](../../../how-to-guides/advanced-user-guides/tune-etcd-for-large-installs). Information on disks can also be found in the [Installation Requirements](../../../pages-for-subheaders/installation-requirements#disks).
It's best to run etcd on exactly three nodes, as adding more nodes will reduce operation speed. This may be counter-intuitive to common scaling approaches, but it's due to etcd's [replication mechanisms](https://etcd.io/docs/v3.5/faq/#what-is-maximum-cluster-size).
Etcd performance will also be negatively affected by network latency between nodes as that will slow down network communication. Etcd nodes should be located together with Rancher nodes.