diff --git a/content/rancher/v2.x/en/cluster-provisioning/production/_index.md b/content/rancher/v2.x/en/cluster-provisioning/production/_index.md new file mode 100644 index 00000000000..4c05895ede2 --- /dev/null +++ b/content/rancher/v2.x/en/cluster-provisioning/production/_index.md @@ -0,0 +1,137 @@ +--- +title: Production Ready Cluster +weight: 2510 +--- + +While Rancher makes it easy to create Kubernetes clusters, a production ready cluster takes more consideration and planning. There are three roles that can be assigned to nodes: `etcd`, `controlplane` and `worker`. In the next sections each of the roles will be described in more detail. + +When designing your cluster(s), you have two options: + +* Use dedicated nodes for each role. This ensures resource availability for the components needed for the specified role. It also strictly isolates network traffic between each of the roles according to the [Port Requirements]({{< baseurl >}}/rancher/v2.x/en/installation/references/). +* Assign the `etcd` and `controlplane` roles to the same nodes. These nodes must meet the hardware requirements for both roles. + +>**Note:** Do not add the `worker` role to any node configured with either the `etcd` or `controlplane` role. This will make the nodes schedulable for regular workloads, which could interfere with critical cluster components running on the nodes with the `etcd` or `controlplane` role. + +## etcd + +Nodes with the `etcd` role run etcd, which is a consistent and highly available key value store used as Kubernetes’ backing store for all cluster data. etcd replicates the data to each node. + +>**Note:** Nodes with the `etcd` role are shown as `Unschedulable` in the UI, meaning no pods will be scheduled to these nodes by default. + +### Hardware Requirements + +Please see [Kubernetes: Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/) and [etcd: Hardware Recommendations](https://coreos.com/etcd/docs/latest/op-guide/hardware.html) for the hardware requirements. + +### Count of etcd Nodes + +The number of nodes that you can lose at once while maintaining cluster availability is determined by the number of nodes assigned the `etcd` role. For a cluster with n members, the minimum is (n/2)+1. Therefore, we recommend creating an `etcd` node in 3 different availability zones to survive the loss of one availability zone within a region. If you use only two zones, you can only survive the loss of the zone where you don't lose the majority of nodes. + +| Nodes with `etcd` role | Majority | Failure Tolerance | +|--------------|------------|-------------------| +| 1 | 1 | 0 | +| 2 | 2 | 0 | +| 3 | 2 | **1** | +| 4 | 3 | 1 | +| 5 | 3 | **2** | +| 6 | 4 | 2 | +| 7 | 4 | **3** | +| 8 | 5 | 3 | +| 9 | 5 | **4** | + +References: + +* [etcd cluster size](https://coreos.com/etcd/docs/latest/v2/admin_guide.html#optimal-cluster-size) +* [Operating etcd clusters for Kubernetes](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/) + +### Network Latency + +Rancher recommends minimizing latency between the etcd nodes. The default setting for `heartbeat-interval` is `500`, and the default setting for `election-timeout` is `5000`. These settings allow etcd to run in most networks (except really high latency networks). + +References: + +* [etcd Tuning](https://coreos.com/etcd/docs/latest/tuning.html) + +### Backups + +etcd is the location where the state of your cluster is stored. Losing etcd data means losing your cluster. Make sure you configure [etcd Recurring Snapshots]({{< baseurl >}}/rancher/v2.x/en/backups/backups/ha-backups/#option-a-recurring-snapshots) for your cluster(s), and make sure the snapshots are stored externally (off the node) as well. + +## controlplane + +Nodes with the `controlplane` role run the Kubernetes master components (excluding `etcd`, as it's a separate role). See [Kubernetes: Master Components](https://kubernetes.io/docs/concepts/overview/components/#master-components) for a detailed list of components. + +>**Note:** Nodes with the `controlplane` role are shown as `Unschedulable` in the UI, meaning no pods will be scheduled to these nodes by default. + +References: + +* [Kubernetes: Master Components](https://kubernetes.io/docs/concepts/overview/components/#master-components) + +### Hardware Requirements + +Please see [Kubernetes: Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/) for the hardware requirements. + +### Count of controlplane Nodes + +Adding more than one node with the `controlplane` role makes every master component highly available. See below for a breakdown of how high availability is achieved per component. + +#### kube-apiserver + +The Kubernetes API server (`kube-apiserver`) scales horizontally. Each node with the role `controlplane` will be added to the NGINX proxy on the nodes with components that need to access the Kubernetes API server. This means that if a node becomes unreachable, the local NGINX proxy on the node will forward the request to another Kubernetes API server in the list. + +#### kube-controller-manager + +The Kubernetes controller manager uses leader election using an endpoint in Kubernetes. One instance of the `kube-controller-manager` will create an entry in the Kubernetes endpoints and updates that entry in a configured interval. Other instances will see an active leader and wait for that entry to expire (for example, when a node is unresponsive). + +#### kube-scheduler + +The Kubernetes scheduler uses leader election using an endpoint in Kubernetes. One instance of the `kube-scheduler` will create an entry in the Kubernetes endpoints and updates that entry in a configured interval. Other instances will see an active leader and wait for that entry to expire (for example, when a node is unresponsive). + +## worker + +Nodes with the `worker` role run the Kubernetes node components. See [Kubernetes: Node Components](https://kubernetes.io/docs/concepts/overview/components/#node-components) for a detailed list of components. + +References: + +* [Kubernetes: Node Components](https://kubernetes.io/docs/concepts/overview/components/#node-components) + +### Hardware Requirements + +The hardware requirements for nodes with the `worker` role mostly depend on your workloads. The minimum to run the Kubernetes node components is 1 CPU (core) and 1GB of memory. + +### Count of worker Nodes + +Adding more than one node with the `worker` role will make sure your workloads can be rescheduled if a node fails. + +## Networking + +Cluster nodes should be located within a single region. Most cloud providers provide multiple availability zones within a region, which can be used to create higher availability for your cluster. Using multiple availability zones is fine for nodes with any role. If you are using [Kubernetes Cloud Provider]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/options/cloud-providers/) resources, consult the documentation for any restrictions (i.e. zone storage restrictions). + +## Cluster Diagram + +This diagram is applicable to Kubernetes clusters built using RKE or [Rancher Launched Kubernetes]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/). + +![Cluster diagram]({{< baseurl >}}/img/rancher/clusterdiagram.svg)
+Lines show the traffic flow between components. Colors are used purely for visual aid + +## Production checklist + +* Nodes should have one of the following role configurations: + * `etcd` + * `controlplane` + * `etcd` and `controlplane` + * `worker` (the `worker` role should not be used or added on nodes with the `etcd` or `controlplane` role) +* Network traffic is only strictly allowed according to [Port Requirements]({{< baseurl >}}/rancher/v2.x/en/installation/references/). +* Have at least three nodes with the role `etcd` to survive losing one node. Increase this count for higher node fault toleration, and spread them across (availability) zones to provide even better fault tolerance. +* Assign two or more nodes the `controlplane` role for master component high availability. +* Assign two or more nodes the `worker` role for workload rescheduling upon node failure. +* Enable etcd snapshots. Verify that snapshots are being created, and run a disaster recovery scenario to verify the snapshots are valid. +* Perform load tests on your cluster to verify that its hardware can support your workloads. +* Configure alerts/notifiers for Kubernetes components (System Service). +* Configure logging for cluster analysis and post-mortems. + +## RKE cluster running Rancher HA + +You may have noticed that our [High Availability (HA) Install]({{< baseurl >}}/rancher/v2.x/en/installation/ha/) instructions do not meet our definition of a production-ready cluster, as there are no dedicated nodes for the `worker` role. However, for your Rancher installation, this three node cluster is valid, as: + +* It allows one `etcd` node failure. +* It maintains multiple instances of the master components by having multiple `controlplane` nodes. +* No other workloads than Rancher itself should be created on this cluster. diff --git a/src/diagrams/clusterdiagram.xml b/src/diagrams/clusterdiagram.xml new file mode 100644 index 00000000000..9606ecb5514 --- /dev/null +++ b/src/diagrams/clusterdiagram.xml @@ -0,0 +1 @@ +7Z1bb6u4FoB/TR5T4QsYHtvu2WdGoy1V6sPMfqSJm6JNQw6ht/Prj0mAwLIJlxhDZmilNjUGA+vzYt1wF+T+9fM/sb97+RGtebjA1vpzQb4tMEaUWOJX2vKVtbgeObZs4mCdtZ0aHoP/8awx23HzFqz5vtIxiaIwCXbVxlW03fJVUmnz4zj6qHZ7jsLqqDt/w6WGx5Ufyq1/Bevk5djqYnZq/50Hm5d8ZOR4xy1P/urXJo7ettl4C0yeD1/Hza9+fqzsQvcv/jr6KDWR3xbkPo6i5Pjp9fOeh+nNzW/bcb/vNVuL8475Nmmzg0OPe7z74RvPT9kJxb536+A9PcHkK7spzn/f0rO6S/hnsvTDYLNdkNtUBmIoHosPYtvhXm+T5f4g0XQrcnefp33Fp032+zBG2vkgxjCKK4OI+0bW6be8719R/Csd73gEcWnHg1QPLJoP55+34sql4IOEeHoPkNj88RIk/HHnr9KtHwJp0faSvIbZ5kJG6R+r6DVYic/p1e6TOPrF74uTJ8y99+6/F1tydHB6pUEYlnpaFrNvXdGe3chv2V0kd+88TgJB4m22IYl22Y367r8GYTqjdtFuF2wF4neh/8TDu4K5fIBttBWXcheJSwqSdA9iFXchPT7/rMUFFRCK2c2jV57EX6JLtgPBxz2yee24GcYfp0nCHPfY9lKaIOJ6s8mZTcxNcegTnOJDxmcN3LQeVhUDT7GCimrHqWHBV8/WM4LyjqOnKImy1lJvQqln/6aCSIeokXtjV6TNqJe3lOSNc9GW5V0o8kvkja9ONd2LfeIovYaH0BdTsJuO6ndtdeevpH/WiQPoRDhPsEor2opZQhxy+Swhs040phOFZVuRNEFIkjRymULSyL1c0qyTpPvorz/fnni85YkwvSVebh/+EId75PF7o/F1hZB5nk98tzVkKP1WQbaJ/XXAT50zraODPYaBniFE1jOIKvQM1fAwdlvBVwbIknHIno/h4Zn8w98KNyqeoTELjWcpjLjBsPE6YwNpeFy98PVbOINiGBSqtPYHAyV3GC6w9kP+nJiz9XmyWs9BiEkY3EvktLS4qZY4RL1jWm+IHdt+j/bJP894mpCFLjQnhMOT2VDHLLSose52EhR7ui3kyRXzcI2PO9tyzT3siOzKbTfB9nO5i6PPL5Xkpybi4tHRSsSFgjAoYkKhJlCEZQaTsO3MeuAK9IAMyRIb1AN2u5BO2dScMTCDgVFlcbmX/FD34JgZGYwRbNA9duwZkckjorAqHdtkEMXpliK4Qgim5GpaXkXWjLZ1NHUkxxmaDcwrUAlEClYxh5lUCaxdsGp+cEyLEhdTg4zINQRzOGJ4ETOjCRZXFZEAcuTb9W1aPZvektDf71PRlEXHP4Pk79Lnn6nkxEVI8lj73H1eqWTqrFz+9JzuvxUXUBws/eN0tOOJ8bVUpNt4t0t3UhX/z9tiHvpJ8F49vOruZiM8RMEhQ5QJk7r2jWszRPHhJ0EVwS6RdZNvEj8d1wVlPvvoLV7x7JgnCXYcxvOqG3F1kMSPNzyRBhHy9b9K3XZph339pdqWAy/u7DnD/siq9BcfjmdwYraQXjuMVQH2GeN+GDP3htTyxVxXRV9niM8O4npnB9EEMfUkK8z1zp+1tIfrVfa4GGRPZcB3BLkHrhn6loR+AbJ1fSATis8wRoXBzU4K2fWcfiSfHwVZyIQ+Jlgq1HDPn7W0A7KwXpJVJajDk3yG13rK25IsTv5AxaLiRB1lWDaaxyIeEbeqnTAwEdtCjdzqgRgDB9LFLQInnL3MUYst6O9ptiOQ3S5TP1dxjBBasykbt4pDkZGZPWLNHrHNoJBN5tyKkrE5gDrp0JhMidEMPXLaxdnnFL15DsyqC3yxupjj6OYhMZmjR44GR2iOTR17MOSZCLGeH8ZMiLVIHLcMscL+ukOsSLkoxBya6sWxzRwDoanzoxgKTdm20y00Je+gOzSFWqa/Zy9/BC//lP8cx8mf094GSuIsKGSjVjtrt2LM7OSPXDgpUWLWyWftqmtnJ988B2bVxeVv7cxOvnlIjDr5TFVjPTv5vZwjbKsdE91efsM4Ztz8Iovc0s2H/bW7+UxDKdXs5ufCAVVOVdnp8vMbhjHk6LuwNqrBz5f6a3fzexe37sUtSc4SXjxkLenxqZgAlmVbvIJ5UZDys0z231Xqfy5qHtzF8dpOgGrhSmHZlitXspzWaEofy+t3wCdza8XO5GN54FiaoEdyuALBxXXBLo5cxgB20UB+X3NkOPJPfDMJ8NO0wHBe6GY/f7m6gv5xPoyF/tJBNdGQruAvETwSGQZ7ophhVoPhgmjDLpdjn0/yCWFfkM3OaHw8uMpXYj+uxpdWEu6r7h0Fi4NQT2ln6h0E62wHoH5+v9dceoJ61QpU5WKvnmLWaFnsFXl96lPn5V7/KVEuG4242Cvyuq9IMi/3OgVo4Bolhpd7LbK184KvV4eK4QVfsTWvYHGdoBhdKhFnocI5J3dlkJhd+QhbGmpvu3n+3w9fiz5vIuILXkVcTC6ohXLTNLc5+rr3RbI/z+sOFMp11Odb69mD8lxHr1ePLQ3ltn3RVSaicSUbd0FeGSQjcpd2QqGp4nXXwsnvyW7xAm7OrjMMu9IJN8AL++unV1U7Nb7ibY9omcZj8HQsGIv/kgd9/s6K1AYHgsve60oOWEPDpWHRokvhUui/U6zfIl5FdVpFWuuBx4G41NTw+mZp1pnjPvARfEz3TWJhYDgMlcOykaccp/a8PPV5te2vfxpoqDkbYBpco461obHaV8dSOAug1asLXjI0XKpY7/gP8JOO9SoJ1aV+JTu9jClxNKlYisyoWLgYLm7IltrgHcau/bXPAuX/nxp0FmT/F7wuRsDOlUQeNioqB9Kv/tOAyPNgVF29hK63tDhRa4MYg3nA4HpJw4QWvIYaXxiJ0E61hkXm+lJdYziwsnJHMvImqM6s6rG4doHbZPfF2gV+nj0Q1TCI0IS1FKTQzjU2zXVR2V5fv95vrbp+VDuKut5xdbXTgGL7ysbqgehAC9IhOHeyGvLa0nsYU7F1M208gzH62xog1qEIdYzLNGLaoMYjUY3Pv5oxPNXGkxsTo9qbGtRA4hQm2RqZLupFzCCcT5XxEDae4ZgYwtPTyyAaYMP8Rmu9DCK31DbEdOaYjce08YiyZEDLLz+fizXXz4R+TBd1RRPyCx1g+tqdVXN+IGiMD6SqHRADb7KhYX/9WGt4pfnKsVZkXEbGmkBK+tZYUJAKoQPVWEhYNxjRw2NtPEUzOaytyWG9hDT2ToOD4wyUooGUNtnVg1OdX/e/luriNZsJUU3VkHSGGhrWQ+VbINQNhvXwUBuP4nFbuEPVd+vL+XJUhtq6sWxaATv/A2bSJdaLYXqyPrnsInQiMfT9Wi/KAsuFWlbb9cLLeDitG16ezcbBy56eKnUgYL1fe6fgTeSWyrQXYMaDXQrA8BnCmhdQGAKlcWvWYDVQb5BAgGGotXKcPJcFxqm1gOnZ/hqeysbriBVUs1GpnmAdhAMfnH3Ta9KK+ANl1zB4KR81uFC2e7a/Bq6Nh3E7aes2PtSFXBNF8ea4ZiUo7ulduglzAMOt8FQt3UQNbxQRONk69tfvchkP+nZT7uNMgrGDvtdWwUxpt2lAvW7TAPa/cBqIP+MoXc3n1D32dy8/ojVPe/wf \ No newline at end of file diff --git a/src/img/rancher/clusterdiagram.svg b/src/img/rancher/clusterdiagram.svg new file mode 100644 index 00000000000..0a12d697a61 --- /dev/null +++ b/src/img/rancher/clusterdiagram.svg @@ -0,0 +1,2 @@ + +
Worker
[Not supported by viewer]

<font><br></font>
Control Plane

[Not supported by viewer]

<font><br></font>
Kubernetes
API Server

[Not supported by viewer]
Kubernetes
Controller Manager
[Not supported by viewer]
Kubernetes
Scheduler
[Not supported by viewer]
etcd
[Not supported by viewer]
Host
[Not supported by viewer]
Kubernetes
Kubelet
[Not supported by viewer]
nginx-proxy
nginx-proxy<br>
Kubernetes
Kubelet
[Not supported by viewer]
etcd
[Not supported by viewer]
Kubernetes
Proxy
[Not supported by viewer]
Kubernetes
Proxy
[Not supported by viewer]

<font><br></font>
Kubernetes
Kubelet
[Not supported by viewer]
Kubernetes
Proxy
[Not supported by viewer]
nginx-proxy
nginx-proxy<br>
Host
[Not supported by viewer]
nginx-proxy
nginx-proxy<br>
Kubernetes
Kubelet
[Not supported by viewer]
etcd
[Not supported by viewer]
Kubernetes
Proxy
[Not supported by viewer]
Host
[Not supported by viewer]
nginx-proxy
nginx-proxy<br>
Kubernetes
Kubelet
[Not supported by viewer]
etcd
[Not supported by viewer]
Kubernetes
Proxy
[Not supported by viewer]

<font><br></font>
Kubernetes
API Server

[Not supported by viewer]
Kubernetes
Controller Manager
[Not supported by viewer]
Kubernetes
Scheduler
[Not supported by viewer]
Kubernetes
Kubelet
[Not supported by viewer]
Kubernetes
Proxy
[Not supported by viewer]
\ No newline at end of file