Merge pull request #981 from superseb/prod_setup

Add production ready cluster setup
This commit is contained in:
Denise
2018-11-13 22:26:42 -08:00
committed by GitHub
3 changed files with 140 additions and 0 deletions
@@ -0,0 +1,137 @@
---
title: Production Ready Cluster
weight: 2510
---
While Rancher makes it easy to create Kubernetes clusters, a production ready cluster takes more consideration and planning. There are three roles that can be assigned to nodes: `etcd`, `controlplane` and `worker`. In the next sections each of the roles will be described in more detail.
When designing your cluster(s), you have two options:
* Use dedicated nodes for each role. This ensures resource availability for the components needed for the specified role. It also strictly isolates network traffic between each of the roles according to the [Port Requirements]({{< baseurl >}}/rancher/v2.x/en/installation/references/).
* Assign the `etcd` and `controlplane` roles to the same nodes. These nodes must meet the hardware requirements for both roles.
>**Note:** Do not add the `worker` role to any node configured with either the `etcd` or `controlplane` role. This will make the nodes schedulable for regular workloads, which could interfere with critical cluster components running on the nodes with the `etcd` or `controlplane` role.
## etcd
Nodes with the `etcd` role run etcd, which is a consistent and highly available key value store used as Kubernetes backing store for all cluster data. etcd replicates the data to each node.
>**Note:** Nodes with the `etcd` role are shown as `Unschedulable` in the UI, meaning no pods will be scheduled to these nodes by default.
### Hardware Requirements
Please see [Kubernetes: Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/) and [etcd: Hardware Recommendations](https://coreos.com/etcd/docs/latest/op-guide/hardware.html) for the hardware requirements.
### Count of etcd Nodes
The number of nodes that you can lose at once while maintaining cluster availability is determined by the number of nodes assigned the `etcd` role. For a cluster with n members, the minimum is (n/2)+1. Therefore, we recommend creating an `etcd` node in 3 different availability zones to survive the loss of one availability zone within a region. If you use only two zones, you can only survive the loss of the zone where you don't lose the majority of nodes.
| Nodes with `etcd` role | Majority | Failure Tolerance |
|--------------|------------|-------------------|
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 2 | **1** |
| 4 | 3 | 1 |
| 5 | 3 | **2** |
| 6 | 4 | 2 |
| 7 | 4 | **3** |
| 8 | 5 | 3 |
| 9 | 5 | **4** |
References:
* [etcd cluster size](https://coreos.com/etcd/docs/latest/v2/admin_guide.html#optimal-cluster-size)
* [Operating etcd clusters for Kubernetes](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/)
### Network Latency
Rancher recommends minimizing latency between the etcd nodes. The default setting for `heartbeat-interval` is `500`, and the default setting for `election-timeout` is `5000`. These settings allow etcd to run in most networks (except really high latency networks).
References:
* [etcd Tuning](https://coreos.com/etcd/docs/latest/tuning.html)
### Backups
etcd is the location where the state of your cluster is stored. Losing etcd data means losing your cluster. Make sure you configure [etcd Recurring Snapshots]({{< baseurl >}}/rancher/v2.x/en/backups/backups/ha-backups/#option-a-recurring-snapshots) for your cluster(s), and make sure the snapshots are stored externally (off the node) as well.
## controlplane
Nodes with the `controlplane` role run the Kubernetes master components (excluding `etcd`, as it's a separate role). See [Kubernetes: Master Components](https://kubernetes.io/docs/concepts/overview/components/#master-components) for a detailed list of components.
>**Note:** Nodes with the `controlplane` role are shown as `Unschedulable` in the UI, meaning no pods will be scheduled to these nodes by default.
References:
* [Kubernetes: Master Components](https://kubernetes.io/docs/concepts/overview/components/#master-components)
### Hardware Requirements
Please see [Kubernetes: Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/) for the hardware requirements.
### Count of controlplane Nodes
Adding more than one node with the `controlplane` role makes every master component highly available. See below for a breakdown of how high availability is achieved per component.
#### kube-apiserver
The Kubernetes API server (`kube-apiserver`) scales horizontally. Each node with the role `controlplane` will be added to the NGINX proxy on the nodes with components that need to access the Kubernetes API server. This means that if a node becomes unreachable, the local NGINX proxy on the node will forward the request to another Kubernetes API server in the list.
#### kube-controller-manager
The Kubernetes controller manager uses leader election using an endpoint in Kubernetes. One instance of the `kube-controller-manager` will create an entry in the Kubernetes endpoints and updates that entry in a configured interval. Other instances will see an active leader and wait for that entry to expire (for example, when a node is unresponsive).
#### kube-scheduler
The Kubernetes scheduler uses leader election using an endpoint in Kubernetes. One instance of the `kube-scheduler` will create an entry in the Kubernetes endpoints and updates that entry in a configured interval. Other instances will see an active leader and wait for that entry to expire (for example, when a node is unresponsive).
## worker
Nodes with the `worker` role run the Kubernetes node components. See [Kubernetes: Node Components](https://kubernetes.io/docs/concepts/overview/components/#node-components) for a detailed list of components.
References:
* [Kubernetes: Node Components](https://kubernetes.io/docs/concepts/overview/components/#node-components)
### Hardware Requirements
The hardware requirements for nodes with the `worker` role mostly depend on your workloads. The minimum to run the Kubernetes node components is 1 CPU (core) and 1GB of memory.
### Count of worker Nodes
Adding more than one node with the `worker` role will make sure your workloads can be rescheduled if a node fails.
## Networking
Cluster nodes should be located within a single region. Most cloud providers provide multiple availability zones within a region, which can be used to create higher availability for your cluster. Using multiple availability zones is fine for nodes with any role. If you are using [Kubernetes Cloud Provider]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/options/cloud-providers/) resources, consult the documentation for any restrictions (i.e. zone storage restrictions).
## Cluster Diagram
This diagram is applicable to Kubernetes clusters built using RKE or [Rancher Launched Kubernetes]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/).
![Cluster diagram]({{< baseurl >}}/img/rancher/clusterdiagram.svg)<br/>
<sup>Lines show the traffic flow between components. Colors are used purely for visual aid</sup>
## Production checklist
* Nodes should have one of the following role configurations:
* `etcd`
* `controlplane`
* `etcd` and `controlplane`
* `worker` (the `worker` role should not be used or added on nodes with the `etcd` or `controlplane` role)
* Network traffic is only strictly allowed according to [Port Requirements]({{< baseurl >}}/rancher/v2.x/en/installation/references/).
* Have at least three nodes with the role `etcd` to survive losing one node. Increase this count for higher node fault toleration, and spread them across (availability) zones to provide even better fault tolerance.
* Assign two or more nodes the `controlplane` role for master component high availability.
* Assign two or more nodes the `worker` role for workload rescheduling upon node failure.
* Enable etcd snapshots. Verify that snapshots are being created, and run a disaster recovery scenario to verify the snapshots are valid.
* Perform load tests on your cluster to verify that its hardware can support your workloads.
* Configure alerts/notifiers for Kubernetes components (System Service).
* Configure logging for cluster analysis and post-mortems.
## RKE cluster running Rancher HA
You may have noticed that our [High Availability (HA) Install]({{< baseurl >}}/rancher/v2.x/en/installation/ha/) instructions do not meet our definition of a production-ready cluster, as there are no dedicated nodes for the `worker` role. However, for your Rancher installation, this three node cluster is valid, as:
* It allows one `etcd` node failure.
* It maintains multiple instances of the master components by having multiple `controlplane` nodes.
* No other workloads than Rancher itself should be created on this cluster.
+1
View File
@@ -0,0 +1 @@
<mxfile userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36" version="9.3.1" editor="www.draw.io" type="device"><diagram id="e3822e1c-69c5-0100-1646-52189ad4df76" name="Page-1">7Z1bb6u4FoB/TR5T4QsYHtvu2WdGoy1V6sPMfqSJm6JNQw6ht/Prj0mAwLIJlxhDZmilNjUGA+vzYt1wF+T+9fM/sb97+RGtebjA1vpzQb4tMEaUWOJX2vKVtbgeObZs4mCdtZ0aHoP/8awx23HzFqz5vtIxiaIwCXbVxlW03fJVUmnz4zj6qHZ7jsLqqDt/w6WGx5Ufyq1/Bevk5djqYnZq/50Hm5d8ZOR4xy1P/urXJo7ettl4C0yeD1/Hza9+fqzsQvcv/jr6KDWR3xbkPo6i5Pjp9fOeh+nNzW/bcb/vNVuL8475Nmmzg0OPe7z74RvPT9kJxb536+A9PcHkK7spzn/f0rO6S/hnsvTDYLNdkNtUBmIoHosPYtvhXm+T5f4g0XQrcnefp33Fp032+zBG2vkgxjCKK4OI+0bW6be8719R/Csd73gEcWnHg1QPLJoP55+34sql4IOEeHoPkNj88RIk/HHnr9KtHwJp0faSvIbZ5kJG6R+r6DVYic/p1e6TOPrF74uTJ8y99+6/F1tydHB6pUEYlnpaFrNvXdGe3chv2V0kd+88TgJB4m22IYl22Y367r8GYTqjdtFuF2wF4neh/8TDu4K5fIBttBWXcheJSwqSdA9iFXchPT7/rMUFFRCK2c2jV57EX6JLtgPBxz2yee24GcYfp0nCHPfY9lKaIOJ6s8mZTcxNcegTnOJDxmcN3LQeVhUDT7GCimrHqWHBV8/WM4LyjqOnKImy1lJvQqln/6aCSIeokXtjV6TNqJe3lOSNc9GW5V0o8kvkja9ONd2LfeIovYaH0BdTsJuO6ndtdeevpH/WiQPoRDhPsEor2opZQhxy+Swhs040phOFZVuRNEFIkjRymULSyL1c0qyTpPvorz/fnni85YkwvSVebh/+EId75PF7o/F1hZB5nk98tzVkKP1WQbaJ/XXAT50zraODPYaBniFE1jOIKvQM1fAwdlvBVwbIknHIno/h4Zn8w98KNyqeoTELjWcpjLjBsPE6YwNpeFy98PVbOINiGBSqtPYHAyV3GC6w9kP+nJiz9XmyWs9BiEkY3EvktLS4qZY4RL1jWm+IHdt+j/bJP894mpCFLjQnhMOT2VDHLLSose52EhR7ui3kyRXzcI2PO9tyzT3siOzKbTfB9nO5i6PPL5Xkpybi4tHRSsSFgjAoYkKhJlCEZQaTsO3MeuAK9IAMyRIb1AN2u5BO2dScMTCDgVFlcbmX/FD34JgZGYwRbNA9duwZkckjorAqHdtkEMXpliK4Qgim5GpaXkXWjLZ1NHUkxxmaDcwrUAlEClYxh5lUCaxdsGp+cEyLEhdTg4zINQRzOGJ4ETOjCRZXFZEAcuTb9W1aPZvektDf71PRlEXHP4Pk79Lnn6nkxEVI8lj73H1eqWTqrFz+9JzuvxUXUBws/eN0tOOJ8bVUpNt4t0t3UhX/z9tiHvpJ8F49vOruZiM8RMEhQ5QJk7r2jWszRPHhJ0EVwS6RdZNvEj8d1wVlPvvoLV7x7JgnCXYcxvOqG3F1kMSPNzyRBhHy9b9K3XZph339pdqWAy/u7DnD/siq9BcfjmdwYraQXjuMVQH2GeN+GDP3htTyxVxXRV9niM8O4npnB9EEMfUkK8z1zp+1tIfrVfa4GGRPZcB3BLkHrhn6loR+AbJ1fSATis8wRoXBzU4K2fWcfiSfHwVZyIQ+Jlgq1HDPn7W0A7KwXpJVJajDk3yG13rK25IsTv5AxaLiRB1lWDaaxyIeEbeqnTAwEdtCjdzqgRgDB9LFLQInnL3MUYst6O9ptiOQ3S5TP1dxjBBasykbt4pDkZGZPWLNHrHNoJBN5tyKkrE5gDrp0JhMidEMPXLaxdnnFL15DsyqC3yxupjj6OYhMZmjR44GR2iOTR17MOSZCLGeH8ZMiLVIHLcMscL+ukOsSLkoxBya6sWxzRwDoanzoxgKTdm20y00Je+gOzSFWqa/Zy9/BC//lP8cx8mf094GSuIsKGSjVjtrt2LM7OSPXDgpUWLWyWftqmtnJ988B2bVxeVv7cxOvnlIjDr5TFVjPTv5vZwjbKsdE91efsM4Ztz8Iovc0s2H/bW7+UxDKdXs5ufCAVVOVdnp8vMbhjHk6LuwNqrBz5f6a3fzexe37sUtSc4SXjxkLenxqZgAlmVbvIJ5UZDys0z231Xqfy5qHtzF8dpOgGrhSmHZlitXspzWaEofy+t3wCdza8XO5GN54FiaoEdyuALBxXXBLo5cxgB20UB+X3NkOPJPfDMJ8NO0wHBe6GY/f7m6gv5xPoyF/tJBNdGQruAvETwSGQZ7ophhVoPhgmjDLpdjn0/yCWFfkM3OaHw8uMpXYj+uxpdWEu6r7h0Fi4NQT2ln6h0E62wHoH5+v9dceoJ61QpU5WKvnmLWaFnsFXl96lPn5V7/KVEuG4242Cvyuq9IMi/3OgVo4Bolhpd7LbK184KvV4eK4QVfsTWvYHGdoBhdKhFnocI5J3dlkJhd+QhbGmpvu3n+3w9fiz5vIuILXkVcTC6ohXLTNLc5+rr3RbI/z+sOFMp11Odb69mD8lxHr1ePLQ3ltn3RVSaicSUbd0FeGSQjcpd2QqGp4nXXwsnvyW7xAm7OrjMMu9IJN8AL++unV1U7Nb7ibY9omcZj8HQsGIv/kgd9/s6K1AYHgsve60oOWEPDpWHRokvhUui/U6zfIl5FdVpFWuuBx4G41NTw+mZp1pnjPvARfEz3TWJhYDgMlcOykaccp/a8PPV5te2vfxpoqDkbYBpco461obHaV8dSOAug1asLXjI0XKpY7/gP8JOO9SoJ1aV+JTu9jClxNKlYisyoWLgYLm7IltrgHcau/bXPAuX/nxp0FmT/F7wuRsDOlUQeNioqB9Kv/tOAyPNgVF29hK63tDhRa4MYg3nA4HpJw4QWvIYaXxiJ0E61hkXm+lJdYziwsnJHMvImqM6s6rG4doHbZPfF2gV+nj0Q1TCI0IS1FKTQzjU2zXVR2V5fv95vrbp+VDuKut5xdbXTgGL7ysbqgehAC9IhOHeyGvLa0nsYU7F1M208gzH62xog1qEIdYzLNGLaoMYjUY3Pv5oxPNXGkxsTo9qbGtRA4hQm2RqZLupFzCCcT5XxEDae4ZgYwtPTyyAaYMP8Rmu9DCK31DbEdOaYjce08YiyZEDLLz+fizXXz4R+TBd1RRPyCx1g+tqdVXN+IGiMD6SqHRADb7KhYX/9WGt4pfnKsVZkXEbGmkBK+tZYUJAKoQPVWEhYNxjRw2NtPEUzOaytyWG9hDT2ToOD4wyUooGUNtnVg1OdX/e/luriNZsJUU3VkHSGGhrWQ+VbINQNhvXwUBuP4nFbuEPVd+vL+XJUhtq6sWxaATv/A2bSJdaLYXqyPrnsInQiMfT9Wi/KAsuFWlbb9cLLeDitG16ezcbBy56eKnUgYL1fe6fgTeSWyrQXYMaDXQrA8BnCmhdQGAKlcWvWYDVQb5BAgGGotXKcPJcFxqm1gOnZ/hqeysbriBVUs1GpnmAdhAMfnH3Ta9KK+ANl1zB4KR81uFC2e7a/Bq6Nh3E7aes2PtSFXBNF8ea4ZiUo7ulduglzAMOt8FQt3UQNbxQRONk69tfvchkP+nZT7uNMgrGDvtdWwUxpt2lAvW7TAPa/cBqIP+MoXc3n1D32dy8/ojVPe/wf</diagram></mxfile>
File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 52 KiB