From 5cf490dd4c8695b39cd16c593bc4a3bfa35a65f3 Mon Sep 17 00:00:00 2001 From: Prachi Damle Date: Fri, 16 Oct 2020 23:32:30 -0700 Subject: [PATCH 1/9] Edit docs to remove cis-edit role --- .../rancher/v2.x/en/cis-scans/v2.5/rbac/_index.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/content/rancher/v2.x/en/cis-scans/v2.5/rbac/_index.md b/content/rancher/v2.x/en/cis-scans/v2.5/rbac/_index.md index 79046fa30ef..078c859a6f2 100644 --- a/content/rancher/v2.x/en/cis-scans/v2.5/rbac/_index.md +++ b/content/rancher/v2.x/en/cis-scans/v2.5/rbac/_index.md @@ -10,13 +10,16 @@ This section describes the permissions required to use the rancher-cis-benchmark The rancher-cis-benchmark is a cluster-admin only feature by default. -However, the `rancher-cis-benchmark` chart installs three default `ClusterRoles`: +However, the `rancher-cis-benchmark` chart installs these two default `ClusterRoles`: - cis-admin -- cis-edit - cis-view In Rancher, only cluster owners and global administrators have `cis-admin` access by default. +Note: If you were using the `cis-edit` role added in Rancher v2.5 setup, it has now been removed since +Rancher v2.5.2 because it essentially is same as `cis-admin`. If you happen to create any clusterrolebindings +for `cis-edit`, please update them to use `cis-admin` ClusterRole instead. + # Cluster-Admin Access Rancher CIS Scans is a cluster-admin only feature by default. @@ -37,11 +40,12 @@ The rancher-cis-benchmark creates three `ClusterRoles` and adds the CIS Benchmar | ClusterRole created by chart | Default K8s ClusterRole | Permissions given with Role | ------------------------------| ---------------------------| ---------------------------| | `cis-admin` | `admin`| Ability to CRUD clusterscanbenchmarks, clusterscanprofiles, clusterscans, clusterscanreports CR -| `cis-edit`| `edit` | Ability to CRUD clusterscanbenchmarks, clusterscanprofiles, clusterscans, clusterscanreports CR | `cis-view` | `view `| Ability to List(R) clusterscanbenchmarks, clusterscanprofiles, clusterscans, clusterscanreports CR + By default only cluster-owner role will have ability to manage and use `rancher-cis-benchmark` feature. -The other Rancher roles (cluster-member, project-owner, project-member) do not have default permissions to manage and use rancher-cis-benchmark resources. +The other Rancher roles (cluster-member, project-owner, project-member) do not have any default permissions to manage and use rancher-cis-benchmark resources. -But if a cluster-owner wants to delegate access to other users, they can do so by creating ClusterRoleBindings between these users and the CIS ClusterRoles manually. +But if a cluster-owner wants to delegate access to other users, they can do so by creating ClusterRoleBindings between these users and the above CIS ClusterRoles manually. +There is no automatic role aggregation supported for the `rancher-cis-benchmark` ClusterRoles. From 39cd4ec3d5c3e05d6a24689d61712846f159baf5 Mon Sep 17 00:00:00 2001 From: catherineluse Date: Mon, 26 Oct 2020 12:00:27 -0700 Subject: [PATCH 2/9] Delete inaccurate paragraphs from logging page --- content/rancher/v2.x/en/logging/v2.5/_index.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/content/rancher/v2.x/en/logging/v2.5/_index.md b/content/rancher/v2.x/en/logging/v2.5/_index.md index 329b5fce0c3..f186543081b 100644 --- a/content/rancher/v2.x/en/logging/v2.5/_index.md +++ b/content/rancher/v2.x/en/logging/v2.5/_index.md @@ -5,13 +5,11 @@ weight: 1 --- - [Changes in Rancher v2.5](#changes-in-rancher-v2-5) -- [Configuring the Logging Output for the Rancher Kubernetes Cluster](#configuring-the-logging-output-for-the-rancher-kubernetes-cluster) - [Enabling Logging for Rancher Managed Clusters](#enabling-logging-for-rancher-managed-clusters) - [Uninstall Logging](#uninstall-logging) - [Configuring the Logging Application](#configuring-the-logging-application) - [Working with Taints and Tolerations](#working-with-taints-and-tolerations) - ### Changes in Rancher v2.5 The following changes were introduced to logging in Rancher v2.5: @@ -30,12 +28,6 @@ The following figure from the [Banzai documentation](https://banzaicloud.com/doc ![How the Banzai Cloud Logging Operator Works with Fluentd]({{}}/img/rancher/banzai-cloud-logging-operator.png) -### Configuring the Logging Output for the Rancher Kubernetes Cluster - -If you install Rancher as a Helm chart, you'll configure the Helm chart options to select a logging output for all the logs in the local Kubernetes cluster. - -If you install Rancher using the Rancher CLI on an Linux OS, the Rancher Helm chart will be installed on a Kubernetes cluster with default options. Then when the Rancher UI is available, you'll enable the logging app from the Apps section of the UI. Then during the process of installing the logging application, you will configure the logging output. - ### Enabling Logging for Rancher Managed Clusters You can enable the logging for a Rancher managed cluster by going to the Apps page and installing the logging app. From 81cbba45726a50820f9a807808eaead430451ffe Mon Sep 17 00:00:00 2001 From: haydndup Date: Sun, 25 Oct 2020 14:27:22 +0100 Subject: [PATCH 3/9] Fix small typo --- .../v2.x/en/installation/resources/troubleshooting/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/rancher/v2.x/en/installation/resources/troubleshooting/_index.md b/content/rancher/v2.x/en/installation/resources/troubleshooting/_index.md index c4fe4af00c9..df6aa08e731 100644 --- a/content/rancher/v2.x/en/installation/resources/troubleshooting/_index.md +++ b/content/rancher/v2.x/en/installation/resources/troubleshooting/_index.md @@ -76,7 +76,7 @@ kubectl -n cattle-system logs -f rancher-784d94f59b-vgqzh Use your browser to check the certificate details. If it says the Common Name is "Kubernetes Ingress Controller Fake Certificate", something may have gone wrong with reading or issuing your SSL cert. -> **Note:** if you are using LetsEncrypt to issue certs it can sometimes take a few minuets to issue the cert. +> **Note:** if you are using LetsEncrypt to issue certs it can sometimes take a few minutes to issue the cert. ### Checking for issues with cert-manager issued certs (Rancher Generated or LetsEncrypt) From e3792c5e5b4a974685c1885322ffa0928e8a2f0a Mon Sep 17 00:00:00 2001 From: catherineluse Date: Mon, 26 Oct 2020 13:15:06 -0700 Subject: [PATCH 4/9] Edit Rancher v2.5 logging docs --- .../rancher/v2.x/en/logging/v2.5/_index.md | 48 +++++++++++++------ 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/content/rancher/v2.x/en/logging/v2.5/_index.md b/content/rancher/v2.x/en/logging/v2.5/_index.md index f186543081b..cc8b7a446c6 100644 --- a/content/rancher/v2.x/en/logging/v2.5/_index.md +++ b/content/rancher/v2.x/en/logging/v2.5/_index.md @@ -7,10 +7,12 @@ weight: 1 - [Changes in Rancher v2.5](#changes-in-rancher-v2-5) - [Enabling Logging for Rancher Managed Clusters](#enabling-logging-for-rancher-managed-clusters) - [Uninstall Logging](#uninstall-logging) +- [Role-based Access Control](#role-based-access-control) - [Configuring the Logging Application](#configuring-the-logging-application) - [Working with Taints and Tolerations](#working-with-taints-and-tolerations) -### Changes in Rancher v2.5 + +# Changes in Rancher v2.5 The following changes were introduced to logging in Rancher v2.5: @@ -28,7 +30,7 @@ The following figure from the [Banzai documentation](https://banzaicloud.com/doc ![How the Banzai Cloud Logging Operator Works with Fluentd]({{}}/img/rancher/banzai-cloud-logging-operator.png) -### Enabling Logging for Rancher Managed Clusters +# Enabling Logging for Rancher Managed Clusters You can enable the logging for a Rancher managed cluster by going to the Apps page and installing the logging app. @@ -39,7 +41,7 @@ You can enable the logging for a Rancher managed cluster by going to the Apps pa **Result:** The logging app is deployed in the `cattle-logging-system` namespace. -### Uninstall Logging +# Uninstall Logging 1. From the **Cluster Explorer,** click **Apps & Marketplace.** 1. Click **Installed Apps.** @@ -49,7 +51,27 @@ You can enable the logging for a Rancher managed cluster by going to the Apps pa **Result** `rancher-logging` is uninstalled. -### Configuring the Logging Application +# Role-based Access Control + +Rancher logging has two roles, `logging-admin` and `logging-view`. + +`logging-admin` allows users full access to namespaced flows and outputs. + +The `logging-view` role allows users to view namespaced flows and outputs, and cluster flows and outputs. + +Edit access to the cluster flow and cluster output resources is powerful as it allows any user with edit access control of all logs in the cluster. + +In Rancher, the cluster administrator role is the only role with full access to all rancher-logging resources. + +Cluster members are not able to edit or read any logging resources. + +Project owners are able to create namespaced flows and outputs in the namespaces under their projects. This means that project owners can collect logs from anything in their project namespaces. Project members are able to view the flows and outputs in the namespaces under their projects. Project owners and project members require at least 1 namespace in their project to use logging. If they do not have at least one namespace in their project they may not see the logging button in the top nav dropdown. + +# Configuring the Logging Application + +To configure the logging application, go to the **Cluster Explorer** in the Rancher UI. In the upper left corner, click **Cluster Explorer > Logging.** + +### Overview of Logging Custom Resources The following Custom Resource Definitions are used to configure logging: @@ -60,11 +82,7 @@ According to the [Banzai Cloud documentation,](https://banzaicloud.com/docs/one- > You can define `outputs` (destinations where you want to send your log messages, for example, Elasticsearch, or and Amazon S3 bucket), and `flows` that use filters and selectors to route log messages to the appropriate outputs. You can also define cluster-wide outputs and flows, for example, to use a centralized output that namespaced users cannot modify. -**RBAC** - -Rancher logging has two roles, `logging-admin` and `logging-view`. `logging-admin` allows users full access to namespaced flows and outputs. The `logging-view` role allows users to view namespaced flows and outputs, and cluster flows and outputs. Edit access to the cluster flow and cluster output resources is powerful as it allows any user with edit access control of all logs in the cluster. Cluster admin is the only role with full access to all rancher-logging resources. Cluster members are not able to edit or read any logging resources. Project owners are able to create namespaced flows and outputs in the namespaces under their projects. This means that project owners can collect logs from anything in their project namespaces. Project members are able to view the flows and outputs in the namespaces under their projects. Project owners and project members require at least 1 namespace in their project to use logging. If they do not have at least one namespace in their project they may not see the logging button in the top nav dropdown. - -**Examples** +### Examples Let's say you wanted to send all logs in your cluster to an elasticsearch cluster. @@ -249,7 +267,7 @@ spec: if we break down what is happening, first we create a deployment of a container that has the additional syslog plugin and accepts logs forwarded from another fluentd. Next we create an output configured as a forwarder to our deployment. The deployment fluentd will then forward all logs to the configured syslog destination. -### Working with Taints and Tolerations +# Working with Taints and Tolerations "Tainting" a Kubernetes node causes pods to repel running on that node. Unless the pods have a ```toleration``` for that node's taint, they will run on other nodes in the cluster. @@ -257,7 +275,7 @@ Unless the pods have a ```toleration``` for that node's taint, they will run on Using ```nodeSelector``` gives pods an affinity towards certain nodes. Both provide choice for the what node(s) the pod will run on. -**Default Implementation in Rancher's Logging Stack** +### Default Implementation in Rancher's Logging Stack By default, Rancher taints all Linux nodes with ```cattle.io/os=linux```, and does not taint Windows nodes. The logging stack pods have ```tolerations``` for this taint, which enables them to run on Linux nodes. @@ -282,14 +300,14 @@ spec: In the above example, we ensure that our pod only runs on Linux nodes, and we add a ```toleration``` for the taint we have on all of our Linux nodes. You can do the same with Rancher's existing taints, or with your own custom ones. -**Are clusters with Windows worker nodes supported?** +### Windows Support -Yes, clusters with Windows worker support logging with some small caveats... +Clusters with Windows worker support logging with some small caveats: 1. Windows node logs are currently unable to be exported. 2. ```fluentd-configcheck``` pod(s) will fail due to an [upstream issue](https://github.com/banzaicloud/logging-operator/issues/592), where ```tolerations``` and ```nodeSelector``` settings are not inherited from the ```logging-operator```. -**Adding NodeSelector Settings and Tolerations for Custom Taints** +### Adding NodeSelector Settings and Tolerations for Custom Taints If you would like to add your own ```nodeSelector``` settings, or if you would like to add ```tolerations``` for additional taints, you can pass the following to the chart's values. @@ -308,4 +326,4 @@ However, if you would like to add tolerations for *only* the ```fluentbit``` con ```yaml fluentbit_tolerations: # insert tolerations list for fluentbit containers only -``` +``` \ No newline at end of file From 1927e15636c365857f13c817dd20d5f280818fe3 Mon Sep 17 00:00:00 2001 From: Chris Kim Date: Tue, 27 Oct 2020 15:02:54 -0400 Subject: [PATCH 5/9] Add selinux documentation Signed-off-by: Chris Kim --- content/k3s/latest/en/advanced/_index.md | 14 ++++++++++++-- .../en/installation/install-options/_index.md | 6 ++++-- 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/content/k3s/latest/en/advanced/_index.md b/content/k3s/latest/en/advanced/_index.md index 8d39be5ea44..6ae8c6946cb 100644 --- a/content/k3s/latest/en/advanced/_index.md +++ b/content/k3s/latest/en/advanced/_index.md @@ -306,14 +306,24 @@ sudo reboot # Experimental SELinux Support -As of release v1.17.4+k3s1, experimental support for SELinux has been added to K3s's embedded containerd. If you are installing K3s on a system where SELinux is enabled by default (such as CentOS), you must ensure the proper SELinux policies have been installed. The [install script]({{}}/k3s/latest/en/installation/install-options/#installation-script-options) will fail if they are not. The necessary policies can be installed with the following commands: +As of release v1.17.4+k3s1, experimental support for SELinux has been added to K3s's embedded containerd. If you are installing K3s on a system where SELinux is enabled by default (such as CentOS), you must ensure the proper SELinux policies have been installed. + +{{% tabs %}} +{{% tab "automatic installation" %}} +As of release v1.19.3+k3s2, the [install script]({{}}/k3s/latest/en/installation/install-options/#installation-script-options) will automatically install the SELinux RPM from the Rancher RPM repository if on a compatible system if not performing an air-gapped install. Automatic installation can be skipped by setting `INSTALL_K3S_SKIP_SELINUX_RPM=true`. +{{%/tab%}} +{{% tab "manual installation" %}} +The necessary policies can be installed with the following commands: ``` yum install -y container-selinux selinux-policy-base -rpm -i https://rpm.rancher.io/k3s-selinux-0.1.1-rc1.el7.noarch.rpm +yum install -y https://rpm.rancher.io/k3s/latest/common/centos/7/noarch/k3s-selinux-0.2-1.el7_8.noarch.rpm ``` To force the install script to log a warning rather than fail, you can set the following environment variable: `INSTALL_K3S_SELINUX_WARN=true`. +{{%/tab%}} +{{% /tabs %}} + The way that SELinux enforcement is enabled or disabled depends on the K3s version. Prior to v1.19.x, SELinux enablement for the builtin containerd was automatic but could be disabled by passing `--disable-selinux`. With v1.19.x and beyond, enabling SELinux must be affirmatively configured via the `--selinux` flag or config file entry. Servers and agents that specify both the `--selinux` and (deprecated) `--disable-selinux` flags will fail to start. Using a custom `--data-dir` under SELinux is not supported. To customize it, you would most likely need to write your own custom policy. For guidance, you could refer to the [containers/container-selinux](https://github.com/containers/container-selinux) repository, which contains the SELinux policy files for Container Runtimes, and the [rancher/k3s-selinux](https://github.com/rancher/k3s-selinux) repository, which contains the SELinux policy for K3s . diff --git a/content/k3s/latest/en/installation/install-options/_index.md b/content/k3s/latest/en/installation/install-options/_index.md index 096ba7da4a6..cffd0b436d5 100644 --- a/content/k3s/latest/en/installation/install-options/_index.md +++ b/content/k3s/latest/en/installation/install-options/_index.md @@ -40,8 +40,10 @@ When using this method to install K3s, the following environment variables can b | `INSTALL_K3S_SYSTEMD_DIR` | Directory to install systemd service and environment files to, or use `/etc/systemd/system` as the default. | | `INSTALL_K3S_EXEC` | Command with flags to use for launching K3s in the service. If the command is not specified, and the `K3S_URL` is set, it will default to "agent." If `K3S_URL` not set, it will default to "server." For help, refer to [this example.]({{}}/k3s/latest/en/installation/install-options/how-to-flags/#example-b-install-k3s-exec) | | `INSTALL_K3S_NAME` | Name of systemd service to create, will default to 'k3s' if running k3s as a server and 'k3s-agent' if running k3s as an agent. If specified the name will be prefixed with 'k3s-'. | -| `INSTALL_K3S_TYPE` | Type of systemd service to create, will default from the K3s exec command if not specified. -| `INSTALL_K3S_CHANNEL_URL` | Channel URL for fetching K3s download URL. Defaults to https://update.k3s.io/v1-release/channels. +| `INSTALL_K3S_TYPE` | Type of systemd service to create, will default from the K3s exec command if not specified. | +| `INSTALL_K3S_SELINUX_WARN` | If set to true will continue if k3s-selinux policy is not found. | +| `INSTALL_K3S_SKIP_SELINUX_RPM` | If set to true will skip automatic installation of the k3s RPM. | +| `INSTALL_K3S_CHANNEL_URL` | Channel URL for fetching K3s download URL. Defaults to https://update.k3s.io/v1-release/channels. | | `INSTALL_K3S_CHANNEL` | Channel to use for fetching K3s download URL. Defaults to "stable". Options include: `stable`, `latest`, `testing`. | From 47f83f0a60c53e70df63748523f31647bafd65bc Mon Sep 17 00:00:00 2001 From: Catherine Luse Date: Tue, 27 Oct 2020 15:42:46 -0700 Subject: [PATCH 6/9] Add spaces --- content/k3s/latest/en/advanced/_index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/k3s/latest/en/advanced/_index.md b/content/k3s/latest/en/advanced/_index.md index 6ae8c6946cb..178e58c4e38 100644 --- a/content/k3s/latest/en/advanced/_index.md +++ b/content/k3s/latest/en/advanced/_index.md @@ -311,7 +311,7 @@ As of release v1.17.4+k3s1, experimental support for SELinux has been added to K {{% tabs %}} {{% tab "automatic installation" %}} As of release v1.19.3+k3s2, the [install script]({{}}/k3s/latest/en/installation/install-options/#installation-script-options) will automatically install the SELinux RPM from the Rancher RPM repository if on a compatible system if not performing an air-gapped install. Automatic installation can be skipped by setting `INSTALL_K3S_SKIP_SELINUX_RPM=true`. -{{%/tab%}} +{{% /tab %}} {{% tab "manual installation" %}} The necessary policies can be installed with the following commands: ``` @@ -321,7 +321,7 @@ yum install -y https://rpm.rancher.io/k3s/latest/common/centos/7/noarch/k3s-seli To force the install script to log a warning rather than fail, you can set the following environment variable: `INSTALL_K3S_SELINUX_WARN=true`. -{{%/tab%}} +{{% /tab %}} {{% /tabs %}} The way that SELinux enforcement is enabled or disabled depends on the K3s version. Prior to v1.19.x, SELinux enablement for the builtin containerd was automatic but could be disabled by passing `--disable-selinux`. With v1.19.x and beyond, enabling SELinux must be affirmatively configured via the `--selinux` flag or config file entry. Servers and agents that specify both the `--selinux` and (deprecated) `--disable-selinux` flags will fail to start. From 06eaa70e3df0429181f1789d678d385080c5809b Mon Sep 17 00:00:00 2001 From: Ansil H Date: Wed, 28 Oct 2020 23:04:17 +0530 Subject: [PATCH 7/9] Removed yaml file names Removed YAML file names as the link point to a new page --- .../resources/advanced/rke-add-on/layer-4-lb/_index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/rancher/v2.x/en/installation/resources/advanced/rke-add-on/layer-4-lb/_index.md b/content/rancher/v2.x/en/installation/resources/advanced/rke-add-on/layer-4-lb/_index.md index c6e43f7248a..0c5000fd3d6 100644 --- a/content/rancher/v2.x/en/installation/resources/advanced/rke-add-on/layer-4-lb/_index.md +++ b/content/rancher/v2.x/en/installation/resources/advanced/rke-add-on/layer-4-lb/_index.md @@ -166,8 +166,8 @@ RKE uses a `.yml` config file to install and configure your Kubernetes cluster. 1. Download one of following templates, depending on the SSL certificate you're using. - - [Template for self-signed certificate
`3-node-certificate.yml`]({{}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate) - - [Template for certificate signed by recognized CA
`3-node-certificate-recognizedca.yml`]({{}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate-recognizedca) + - [Template for self-signed certificate
]({{}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate) + - [Template for certificate signed by recognized CA
]({{}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate-recognizedca) From cd5668f0fce6ae277a9e95f18ce940bb4585cefa Mon Sep 17 00:00:00 2001 From: catherineluse Date: Thu, 29 Oct 2020 08:50:34 -0700 Subject: [PATCH 8/9] Add content to Best Practices Guide --- .../rancher/v2.x/en/best-practices/_index.md | 18 +-- .../en/best-practices/v2.0-v2.4/_index.md | 21 +++ .../{ => v2.0-v2.4}/containers/_index.md | 2 + .../deployment-strategies/_index.md | 2 + .../deployment-types/_index.md | 2 + .../{ => v2.0-v2.4}/management/_index.md | 2 + .../v2.x/en/best-practices/v2.5/_index.md | 21 +++ .../v2.5/rancher-managed/_index.md | 21 +++ .../v2.5/rancher-managed/containers/_index.md | 51 +++++++ .../v2.5/rancher-managed/logging/_index.md | 85 ++++++++++++ .../rancher-managed/managed-vsphere/_index.md | 54 ++++++++ .../v2.5/rancher-managed/monitoring/_index.md | 112 +++++++++++++++ .../v2.5/rancher-server/_index.md | 19 +++ .../deployment-strategies/_index.md | 45 ++++++ .../rancher-server/deployment-types/_index.md | 39 ++++++ .../rancher-in-vsphere/_index.md | 85 ++++++++++++ .../img/rancher/rancher-on-prem-vsphere.svg | 128 ++++++++++++++++++ .../img/rancher/solution_overview.drawio.svg | 3 + 18 files changed, 695 insertions(+), 15 deletions(-) create mode 100644 content/rancher/v2.x/en/best-practices/v2.0-v2.4/_index.md rename content/rancher/v2.x/en/best-practices/{ => v2.0-v2.4}/containers/_index.md (98%) rename content/rancher/v2.x/en/best-practices/{ => v2.0-v2.4}/deployment-strategies/_index.md (96%) rename content/rancher/v2.x/en/best-practices/{ => v2.0-v2.4}/deployment-types/_index.md (98%) rename content/rancher/v2.x/en/best-practices/{ => v2.0-v2.4}/management/_index.md (99%) create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/containers/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-server/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-strategies/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-types/_index.md create mode 100644 content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md create mode 100644 static/img/rancher/rancher-on-prem-vsphere.svg create mode 100644 static/img/rancher/solution_overview.drawio.svg diff --git a/content/rancher/v2.x/en/best-practices/_index.md b/content/rancher/v2.x/en/best-practices/_index.md index 37fa10ac32a..8e90b638a89 100644 --- a/content/rancher/v2.x/en/best-practices/_index.md +++ b/content/rancher/v2.x/en/best-practices/_index.md @@ -3,20 +3,8 @@ title: Best Practices Guide weight: 4 --- -> The Best Practices Guide will be updated for Rancher v2.5. +The purpose of this section is to consolidate best practices for Rancher implementations. -The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers. +If you are using Rancher v2.0-v2.4, refer to the Best Practices Guide [here.](./v2.0-v2.4) -If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support. - -Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server. - -For more guidance on best practices, you can consult these resources: - -- [Security]({{}}/rancher/v2.x/en/security/) -- [Rancher Blog](https://rancher.com/blog/) - - [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/) - - [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/) -- [Rancher Forum](https://forums.rancher.com/) -- [Rancher Users Slack](https://slack.rancher.io/) -- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured) +If you are using Rancher v2.5, refer to the Best Practices Guide [here.](./v2.5) \ No newline at end of file diff --git a/content/rancher/v2.x/en/best-practices/v2.0-v2.4/_index.md b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/_index.md new file mode 100644 index 00000000000..712e6daaf14 --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/_index.md @@ -0,0 +1,21 @@ +--- +title: Best Practices Guide for Rancher v2.0-v2.4 +shortTitle: v2.0-v2.4 +weight: 2 +--- + +The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers. + +If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support. + +Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server. + +For more guidance on best practices, you can consult these resources: + +- [Security]({{}}/rancher/v2.x/en/security/) +- [Rancher Blog](https://rancher.com/blog/) + - [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/) + - [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/) +- [Rancher Forum](https://forums.rancher.com/) +- [Rancher Users Slack](https://slack.rancher.io/) +- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured) diff --git a/content/rancher/v2.x/en/best-practices/containers/_index.md b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/containers/_index.md similarity index 98% rename from content/rancher/v2.x/en/best-practices/containers/_index.md rename to content/rancher/v2.x/en/best-practices/v2.0-v2.4/containers/_index.md index 83f1cc182ec..6a66698e266 100644 --- a/content/rancher/v2.x/en/best-practices/containers/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/containers/_index.md @@ -1,6 +1,8 @@ --- title: Tips for Setting Up Containers weight: 100 +aliases: + - /rancher/v2.x/en/best-practices/containers --- Running well built containers can greatly impact the overall performance and security of your environment. diff --git a/content/rancher/v2.x/en/best-practices/deployment-strategies/_index.md b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/deployment-strategies/_index.md similarity index 96% rename from content/rancher/v2.x/en/best-practices/deployment-strategies/_index.md rename to content/rancher/v2.x/en/best-practices/v2.0-v2.4/deployment-strategies/_index.md index cd6d01bb1c4..e142a45fe9d 100644 --- a/content/rancher/v2.x/en/best-practices/deployment-strategies/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/deployment-strategies/_index.md @@ -1,6 +1,8 @@ --- title: Rancher Deployment Strategies weight: 100 +aliases: + - /rancher/v2.x/en/best-practices/deployment-strategies --- There are two recommended deployment strategies. Each one has its own pros and cons. Read more about which one would fit best for your use case: diff --git a/content/rancher/v2.x/en/best-practices/deployment-types/_index.md b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/deployment-types/_index.md similarity index 98% rename from content/rancher/v2.x/en/best-practices/deployment-types/_index.md rename to content/rancher/v2.x/en/best-practices/v2.0-v2.4/deployment-types/_index.md index d953b9f6393..41407964192 100644 --- a/content/rancher/v2.x/en/best-practices/deployment-types/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/deployment-types/_index.md @@ -1,6 +1,8 @@ --- title: Tips for Running Rancher weight: 100 +aliases: + - /rancher/v2.x/en/best-practices/deployment-types --- A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment. diff --git a/content/rancher/v2.x/en/best-practices/management/_index.md b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/management/_index.md similarity index 99% rename from content/rancher/v2.x/en/best-practices/management/_index.md rename to content/rancher/v2.x/en/best-practices/v2.0-v2.4/management/_index.md index 210edfbdb9c..4a500287193 100644 --- a/content/rancher/v2.x/en/best-practices/management/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.0-v2.4/management/_index.md @@ -1,6 +1,8 @@ --- title: Tips for Scaling, Security and Reliability weight: 101 +aliases: + - /v2.x/en/best-practices/management --- Rancher allows you to set up numerous combinations of configurations. Some configurations are more appropriate for development and testing, while there are other best practices for production environments for maximum availability and fault tolerance. The following best practices should be followed for production. diff --git a/content/rancher/v2.x/en/best-practices/v2.5/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/_index.md new file mode 100644 index 00000000000..ffdc1777b9f --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/_index.md @@ -0,0 +1,21 @@ +--- +title: Best Practices Guide for Rancher v2.5 +shortTitle: v2.5 +weight: 1 +--- + +The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers. + +If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support. + +Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server. + +For more guidance on best practices, you can consult these resources: + +- [Security]({{}}/rancher/v2.x/en/security/) +- [Rancher Blog](https://rancher.com/blog/) + - [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/) + - [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/) +- [Rancher Forum](https://forums.rancher.com/) +- [Rancher Users Slack](https://slack.rancher.io/) +- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured) diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/_index.md new file mode 100644 index 00000000000..cea49be14c0 --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/_index.md @@ -0,0 +1,21 @@ +--- +title: Best Practices for Rancher Managed Clusters +shortTitle: Rancher Managed Clusters +weight: 2 +--- + +### Logging + +Refer to [this guide](./logging) for our recommendations for cluster-level logging and application logging. + +### Monitoring + +Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. Refer to this [guide](./monitoring) for our recommendations. + +### Tips for Setting Up Containers + +Running well built containers can greatly impact the overall performance and security of your environment. Refer to this [guide](./containers) for tips. + +### Best Practices for Rancher Managed vSphere Clusters + +This [guide](./managed-vsphere) outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware. \ No newline at end of file diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/containers/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/containers/_index.md new file mode 100644 index 00000000000..6a66698e266 --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/containers/_index.md @@ -0,0 +1,51 @@ +--- +title: Tips for Setting Up Containers +weight: 100 +aliases: + - /rancher/v2.x/en/best-practices/containers +--- + +Running well built containers can greatly impact the overall performance and security of your environment. + +Below are a few tips for setting up your containers. + +For a more detailed discussion of security for containers, you can also refer to Rancher's [Guide to Container Security.](https://rancher.com/complete-guide-container-security) + +### Use a Common Container OS + +When possible, you should try to standardize on a common container base OS. + +Smaller distributions such as Alpine and BusyBox reduce container image size and generally have a smaller attack/vulnerability surface. + +Popular distributions such as Ubuntu, Fedora, and CentOS are more field-tested and offer more functionality. + +### Start with a FROM scratch container +If your microservice is a standalone static binary, you should use a FROM scratch container. + +The FROM scratch container is an [official Docker image](https://hub.docker.com/_/scratch) that is empty so that you can use it to design minimal images. + +This will have the smallest attack surface and smallest image size. + +### Run Container Processes as Unprivileged +When possible, use a non-privileged user when running processes within your container. While container runtimes provide isolation, vulnerabilities and attacks are still possible. Inadvertent or accidental host mounts can also be impacted if the container is running as root. For details on configuring a security context for a pod or container, refer to the [Kubernetes docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/). + +### Define Resource Limits +Apply CPU and memory limits to your pods. This can help manage the resources on your worker nodes and avoid a malfunctioning microservice from impacting other microservices. + +In standard Kubernetes, you can set resource limits on the namespace level. In Rancher, you can set resource limits on the project level and they will propagate to all the namespaces within the project. For details, refer to the Rancher docs. + +When setting resource quotas, if you set anything related to CPU or Memory (i.e. limits or reservations) on a project or namespace, all containers will require a respective CPU or Memory field set during creation. To avoid setting these limits on each and every container during workload creation, a default container resource limit can be specified on the namespace. + +The Kubernetes docs have more information on how resource limits can be set at the [container level](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) and the namespace level. + +### Define Resource Requirements +You should apply CPU and memory requirements to your pods. This is crucial for informing the scheduler which type of compute node your pod needs to be placed on, and ensuring it does not over-provision that node. In Kubernetes, you can set a resource requirement by defining `resources.requests` in the resource requests field in a pod's container spec. For details, refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container). + +> **Note:** If you set a resource limit for the namespace that the pod is deployed in, and the container doesn't have a specific resource request, the pod will not be allowed to start. To avoid setting these fields on each and every container during workload creation, a default container resource limit can be specified on the namespace. + +It is recommended to define resource requirements on the container level because otherwise, the scheduler makes assumptions that will likely not be helpful to your application when the cluster experiences load. + +### Liveness and Readiness Probes +Set up liveness and readiness probes for your container. Unless your container completely crashes, Kubernetes will not know it's unhealthy unless you create an endpoint or mechanism that can report container status. Alternatively, make sure your container halts and crashes if unhealthy. + +The Kubernetes docs show how to [configure liveness and readiness probes for containers.](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/) diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md new file mode 100644 index 00000000000..1448d78b4b5 --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md @@ -0,0 +1,85 @@ +--- +title: Logging Best Practices +weight: 1 +--- +In this guide, we recommend best practices for cluster-level logging and application logging. +# Pre-2.5 Logging, and post-2.5 + +Logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level). + +Logging in 2.5 has been completely overhauled to provide a more flexible experience for log aggregation. With the new logging feature, administrators and users alike can deploy logging that meets fine-grained collection criteria while offering a wider array of destinations and configuration options. + +"Under the hood", Rancher logging uses the Banzai Cloud logging operator. We provide manageability of this operator (and its resources), and tie that experience in with managing your Rancher clusters. + +# Cluster-level Logging + +### Cluster-wide Scraping + +For some users, it is desirable to scrape logs from every container running in the cluster. This usually coincides with your security team's request (or requirement) to collect all logs from all points of execution. + +In this scenario, it is recommended to create at least two _ClusterOutput_ objects - one for your security team (if you have that requirement), and one for yourselves, the cluster administrators. When creating these objects take care to choose an output endpoint that can handle the significant log traffic coming from the entire cluster. Also make sure to choose an appropriate index to receive all these logs. + +Once you have created these _ClusterOutput_ objects, create a _ClusterFlow_ to collect all the logs. Do not define any _Include_ or _Exclude_ rules on this flow. This will ensure that all logs from across the cluster are collected. If you have two _ClusterOutputs_, make sure to send logs to both of them. + +### Kubernetes Components + +_ClusterFlows_ have the ability to collect logs from all containers on all hosts in the Kubernetes cluster. This works well in cases where those containers are part of a Kubernetes pod; however, RKE containers exist outside of the scope of Kubernetes. + +Currently (as of v2.5.1) the logs from RKE containers are collected, but are not able to easily be filtered. This is because those logs do not contain information as to the source container (e.g. `etcd` or `kube-apiserver`). + +A future release of Rancher will include the source container name which will enable filtering of these component logs. Once that change is made, you will be able to customize a _ClusterFlow_ to retrieve **only** the Kubernetes component logs, and direct them to an appropriate output. + +# Application Logging + +Best practice not only in Kubernetes but in all container-based applications is to direct application logs to `stdout`/`stderr`. The container runtime will then trap these logs and do **something** with them - typically writing them to a file. Depending on the container runtime (and its configuration), these logs can end up in any number of locations. + +In the case of writing the logs to a file, Kubernetes helps by creating a `/var/log/containers` directory on each host. This directory symlinks the log files to their actual destination (which can differ based on configuration or container runtime). + +Rancher logging will read all log entries in `/var/log/containers`, ensuring that all log entries from all containers (assuming a default configuration) will have the opportunity to be collected and processed. + +### Specific Log Files + +Log collection only retrieves `stdout`/`stderr` logs from pods in Kubernetes. But what if we want to collect logs from other files that are generated by applications? Here, a log streaming sidecar (or two) may come in handy. + +The goal of setting up a streaming sidecar is to take log files that are written to disk, and have their contents streamed to `stdout`. This way, the Banzai Logging Operator can pick up those logs and send them to your desired output. + +To set this up, edit your workload resource (e.g. Deployment) and add the following sidecar definition: + +``` +... +containers: +- args: + - -F + - /path/to/your/log/file.log + command: + - tail + image: busybox + name: stream-log-file-[name] + volumeMounts: + - mountPath: /path/to/your/log + name: mounted-log +... +``` + +This will add a container to your workload definition that will now stream the contents of (in this example) `/path/to/your/log/file.log` to `stdout`. + +This log stream is then automatically collected according to any _Flows_ or _ClusterFlows_ you have setup. You may also wish to consider creating a _Flow_ specifically for this log file by targeting the name of the container. See example: + +``` +... +spec: + match: + - select: + container_names: + - stream-log-file-name +... +``` + + +## General Best Practices + +- Where possible, output structured log entries (e.g. `syslog`, JSON). This makes handling of the log entry easier as there are already parsers written for these formats. +- Try to provide the name of the application that is creating the log entry, in the entry itself. This can make troubleshooting easier as Kubernetes objects do not always carry the name of the application as the object name. For instance, a pod ID may be something like `myapp-098kjhsdf098sdf98` which does not provide much information about the application running inside the container. +- Except in the case of collecting all logs cluster-wide, try to scope your _Flow_ and _ClusterFlow_ objects tightly. This makes it easier to troubleshoot when problems arise, and also helps ensure unrelated log entries do not show up in your aggregator. An example of tight scoping would be to constrain a _Flow_ to a single _Deployment_ in a namespace, or perhaps even a single container within a _Pod_. +- Keep the log verbosity down except when troubleshooting. High log verbosity poses a number of issues, chief among them being **noise**: significant events can be drowned out in a sea of `DEBUG` messages. This is somewhat mitigated with automated alerting and scripting, but highly verbose logging still places an inordinate amount of stress on the logging infrastructure. +- Where possible, try to provide a transaction or request ID with the log entry. This can make tracing application activity across multiple log sources easier, especially when dealing with distributed applications. \ No newline at end of file diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md new file mode 100644 index 00000000000..2f9696c2c33 --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md @@ -0,0 +1,54 @@ +--- +title: Best Practices for Rancher Managed vSphere Clusters +shortTitle: Rancher Managed Clusters in vSphere +--- + +This guide outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware. + +## Solution Overview + +![Solution Overview](./img/rancher/solution_overview.drawio.svg) + +# 1 - VM Considerations + +## Leverage VM Templates to Construct the Environment + +To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options. + +## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts + +Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level. + +## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores + +Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level. + +## Configure VM's as Appropriate for Kubernetes + +It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node. + +# 2 - Network Considerations + +## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes + +Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup. + +## Consistent IP Addressing for VM's + +Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated. + +# 3 - Storage Considerations + +## Leverage SSD Drives for ETCD Nodes + +ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible. + +# 4 - Backup and Disaster Recovery + +## Perform Regular Downstream Cluster Backups + +Kubernetes uses etcd to store all its data - from configuration, state and metadata. Backing this up is crucial in the event of disaster recovery. + +## Back up Downstream Node VM's + +Incorporate the Rancher downstream node VM's within a standard VM backup policy. \ No newline at end of file diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md new file mode 100644 index 00000000000..3b837d7701e --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md @@ -0,0 +1,112 @@ +--- +title: Monitoring Best Practices +weight: 2 +--- + +Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. This is not different when using Kubernetes and Rancher. Fortunately the integrated monitoring and alerting functionality makes this whole process a lot easier. + +The [Rancher Documentation]({{}}/rancher/v2.x/en/monitoring-alerting/v2.5/) describes in detail, how you can set up a complete Prometheus and Grafana stack. Out of the box this will scrape monitoring data from all system and Kubernetes components in your cluster and provide sensible dashboards and alerts for them to get started. But for a reliable setup, you also need to monitor your own workloads and adapt Prometheus and Grafana to your own specific use cases and cluster sizes. This document aims to give you best practices for this. + +## What to monitor + +Kubernetes itself, as well as applications running inside of it, form a distributed system where different components interact with each other. For the whole system and each individual component, you have to ensure performance, availability, reliability and scalability. A good resource with more details and information is Google's free [Site Reliability Engineering Book](https://landing.google.com/sre/sre-book/), especially the chapter about [Monitoring distributed systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/). + +## Configuring Prometheus Resource Usage + +When installing the integrated monitoring stack, Rancher allows to configure several settings that are dependent on the size of your cluster and the workloads running in it. This chapter covers these in more detail. + +### Storage and Data Retention + +The amount of storage needed for Prometheus directly correlates to the amount of time series and labels that you store and the data retention you have configured. It is important to note that Prometheus is not meant to be used as a long-term metrics storage. Data retention time is usually only a couple of days and not weeks or months. The reason for this is, that Prometheus does not perform any aggregation on its stored metrics. This is great because aggregation can dilute data, but it also means that the needed storage grows linearly over time without retention. + +One way to calculate the necessary storage is to look at the average size of a storage chunk in Prometheus with this query + +``` +rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h]) +``` + +Next, find out your data ingestion rate per second: + +``` +rate(prometheus_tsdb_head_samples_appended_total[1h]) +``` + +and then multiply this with the retention time, adding a few percentage points as buffer: + +``` +average chunk size in bytes * ingestion rate per second * retention time in seconds * 1.1 = necessary storage in bytes +``` + +You can find more information about how to calculate the necessary storage in this [blog post](https://www.robustperception.io/how-much-disk-space-do-prometheus-blocks-use). + +You can read more about the Prometheus storage concept in the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/storage). + +### CPU and Memory Requests and Limits + +In larger Kubernetes clusters Prometheus can consume quite a bit of memory. The amount of memory Prometheus needs directly correlates to the amount of time series and amount of labels it stores and the scrape interval in which these are filled. + +You can find more information about how to calculate the necessary memory in this [blog post](https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion). + +The amount of necessary CPUs correlate with the amount of queries you are performing. + +### Federation and long-term Storage + +Prometheus is not meant to store metrics for a long amount of time, but should only be used for short term storage. + +In order to store some, or all metrics for a long time, you can leverage Prometheus' [remote read/write](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations) capabilities to connect it to storage systems like [Thanos](https://thanos.io/), [InfluxDB](https://www.influxdata.com/), [M3DB](https://www.m3db.io/), or others. You can find an example setup in this [blog post](https://rancher.com/blog/2020/prometheus-metric-federation). + +## Scraping Custom Workloads + +While the integrated Rancher Monitoring already scrapes system metrics from a cluster's nodes and system components, the custom workloads that you deploy on Kubernetes should also be scraped for data. For that you can configure Prometheus to do an HTTP request to an endpoint of your applications in a certain interval. These endpoints should then return their metrics in a Prometheus format. + +In general, you want to scrape data from all the workloads running in your cluster so that you can use them for alerts or debugging issues. Oftentimes, you recognize, that you need some data only when you actually need the metrics during an incident. It is good, if it is already scraped and stored. Since Prometheus is only meant to be a short-term metrics storage, scraping and keeping lots of data is usually not that expensive. If you are using a long-term storage solution with Prometheus, you can then still decide which data you are actually persisting and keeping there. + +### About Prometheus Exporters + +A lot of 3rd party workloads like databases, queues or web-servers either already support exposing metrics in a Prometheus format, or there are so called exporters available that translate between the tool's metrics and the format that Prometheus understands. Usually you can add these exporters as additional sidecar containers to the workload's Pods. A lot of helm charts already include options to deploy the correct exporter. Additionally you can find a curated list of exports by SysDig on [promcat.io](https://promcat.io/) and on [ExporterHub](https://exporterhub.io/). + +### Prometheus support in Programming Languages and Frameworks + +To get your own custom application metrics into Prometheus, you have to collect and expose these metrics directly from your applications code. Fortunately, there are already libraries and integrations available to help with this for most popular programming languages and frameworks. One example for this is the Prometheus support in the [Spring Framework](https://docs.spring.io/spring-metrics/docs/current/public/prometheus). + +### ServiceMonitors and PodMonitors + +Once all your workloads expose metrics in a Prometheus format, you have to configure Prometheus to scrape it. Under the hood Rancher is using the [prometheus-operator](https://github.com/prometheus-operator/prometheus-operator). This makes it easy to add additional scraping targets with ServiceMonitors and PodMonitors. A lot of helm charts already include an option to create these monitors directly. You can also find more information in the [Rancher Documentation](TODO). + +### Prometheus Push Gateway + +There are some workloads that are traditionally hard to scrape by Prometheus. Examples for these are short lived workloads like Jobs and CronJobs, or applications that do not allow sharing data between individual handled incoming requests, like PHP applications. + +To still get metrics for these use cases, you can set up [prometheus-pushgateways](https://github.com/prometheus/pushgateway). The CronJob or PHP application would push metric updates to the pushgateway. The pushgateway aggregates and exposes them through an HTTP endpoint, which then can be scraped by Prometheus. + +### Prometheus Blackbox Monitor + +Sometimes it is useful to monitor workloads from the outside. For this, you can use the [Prometheus blackbox-exporter](https://github.com/prometheus/blackbox_exporter) which allows probing any kind of endpoint over HTTP, HTTPS, DNS, TCP and ICMP. + +## Monitoring in a (Micro)Service Architecture + +If you have a (micro)service architecture where multiple individual workloads within your cluster are communicating with each other, it is really important to have detailed metrics and traces about this traffic to understand how all these workloads are communicating with each other and where a problem or bottleneck may be. + +Of course you can monitor all this internal traffic in all your workloads and expose these metrics to Prometheus. But this can quickly become quite work intensive. Service Meshes like Istio, which can be installed with [a click](https://rancher.com/docs/rancher/v2.x/en/cluster-admin/tools/istio/) in Rancher, can do this automatically and provide rich telemetry about the traffic between all services. + +## Real User Monitoring + +Monitoring the availability and performance of all your internal workloads is vitally important to run stable, reliable and fast applications. But these metrics only show you parts of the picture. To get a complete view it is also necessary to know how your end users are actually perceiving it. For this you can look into various [Real user monitoring solutions](https://en.wikipedia.org/wiki/Real_user_monitoring). + +## Security Monitoring + +In addition to monitoring workloads to detect performance, availability or scalability problems, the cluster and the workloads running into it should also be monitored for potential security problems. A good starting point is to frequently run and alert on [CIS Scans]({{}}/rancher/v2.x/en/cis-scans/v2.5/) which check if the cluster is configured according to security best practices. + +For the workloads, you can have a look at Kubernetes and Container security solutions like [Falko](https://falco.org/), [Aqua Kubernetes Security](https://www.aquasec.com/solutions/kubernetes-container-security/), [SysDig](https://sysdig.com/). + +## Setting up Alerts + +Getting all the metrics into a monitoring systems and visualizing them in dashboards is great, but you also want to be pro-actively alerted if something goes wrong. + +The integrated Rancher monitoring already configures a sensible set of alerts that make sense in any Kubernetes cluster. You should extend these to cover your specific workloads and use cases. + +When setting up alerts, configure them for all the workloads that are critical to the availability of your applications. But also make sure that they are not too noisy. Ideally every alert you are receiving should be because of a problem that needs your attention and needs to be fixed. If you have alerts that are firing all the time but are not that critical, there is a danger that you start ignoring your alerts all together and then miss the real important ones. Less may be more here. Start to focus on the real important metrics first, for example alert if your application is offline. Fix all the problems that start to pop up and then start to create more detailed alerts. + +If an alert starts firing, but there is nothing you can do about it at the moment, it's also fine to silence the alert for a certain amount of time, so that you can look at it later. + +You can find more information on how to set up alerts and notification channels in the [Rancher Documentation]({{}}/rancher/v2.x/en/monitoring-alerting/v2.5). \ No newline at end of file diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/_index.md new file mode 100644 index 00000000000..32786386a3b --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/_index.md @@ -0,0 +1,19 @@ +--- +title: Best Practices for the Rancher Server +shortTitle: Rancher Server +weight: 1 +--- + +This guide contains our recommendations for running the Rancher server, and is intended to be used in situations in which Rancher manages downstream Kubernetes clusters. + +### Recommended Architecture and Infrastructure + +Refer to this [guide](./deployment-types) for our general advice for setting up the Rancher server on a high-availability Kubernetes cluster. + +### Deployment Strategies + +This [guide](./deployment-strategies) is designed to help you choose whether a regional deployment strategy or a hub-and-spoke deployment strategy is better for a Rancher server that manages downstream Kubernetes clusters. + +### Installing Rancher in a vSphere Environment + +This [guide](./rancher-in-vsphere) outlines a reference architecture for installing Rancher in a vSphere environment, in addition to standard vSphere best practices as documented by VMware. \ No newline at end of file diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-strategies/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-strategies/_index.md new file mode 100644 index 00000000000..35a1e08b2b6 --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-strategies/_index.md @@ -0,0 +1,45 @@ +--- +title: Rancher Deployment Strategy +weight: 100 +--- + +There are two recommended deployment strategies for a Rancher server that manages downstream Kubernetes clusters. Each one has its own pros and cons. Read more about which one would fit best for your use case: + +* [Hub and Spoke](#hub-and-spoke) +* [Regional](#regional) + +# Hub & Spoke Strategy +--- + +In this deployment scenario, there is a single Rancher control plane managing Kubernetes clusters across the globe. The control plane would be run on a high-availability Kubernetes cluster, and there would be impact due to latencies. + +{{< img "/img/rancher/bpg/hub-and-spoke.png" "Hub and Spoke Deployment">}} + +### Pros + +* Environments could have nodes and network connectivity across regions. +* Single control plane interface to view/see all regions and environments. +* Kubernetes does not require Rancher to operate and can tolerate losing connectivity to the Rancher control plane. + +### Cons + +* Subject to network latencies. +* If the control plane goes out, global provisioning of new services is unavailable until it is restored. However, each Kubernetes cluster can continue to be managed individually. + +# Regional Strategy +--- +In the regional deployment model a control plane is deployed in close proximity to the compute nodes. + +{{< img "/img/rancher/bpg/regional.png" "Regional Deployment">}} + +### Pros + +* Rancher functionality in regions stay operational if a control plane in another region goes down. +* Network latency is greatly reduced, improving the performance of functionality in Rancher. +* Upgrades of the Rancher control plane can be done independently per region. + +### Cons + +* Overhead of managing multiple Rancher installations. +* Visibility across global Kubernetes clusters requires multiple interfaces/panes of glass. +* Deploying multi-cluster apps in Rancher requires repeating the process for each Rancher server. diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-types/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-types/_index.md new file mode 100644 index 00000000000..6cc5b883f75 --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/deployment-types/_index.md @@ -0,0 +1,39 @@ +--- +title: Tips for Running Rancher +weight: 100 +aliases: + - /rancher/v2.x/en/best-practices/deployment-types +--- + +This guide is geared toward use cases where Rancher is used to manage downstream Kubernetes clusters. The high-availability setup is intended to prevent losing access to downstream clusters if the Rancher server is not available. + +A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment. + +If you are installing Rancher in a vSphere environment, refer to the best practices documented [here.](../rancher-in-vsphere) + +When you set up your high-availability Rancher installation, consider the following: + +### Run Rancher on a Separate Cluster +Don't run other workloads or microservices in the Kubernetes cluster that Rancher is installed on. + +### Make sure nodes are configured correctly for Kubernetes ### +It's important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node, checking that all correct ports are opened, and deploying with ssd backed etcd. More details can be found in the [kubernetes docs](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#before-you-begin) and [etcd's performance op guide](https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/performance.md) + +### When using RKE: Back up the Statefile +RKE keeps record of the cluster state in a file called `cluster.rkestate`. This file is important for the recovery of a cluster and/or the continued maintenance of the cluster through RKE. Because this file contains certificate material, we strongly recommend encrypting this file before backing up. After each run of `rke up` you should backup the state file. + +### Run All Nodes in the Cluster in the Same Datacenter +For best performance, run all three of your nodes in the same geographic datacenter. If you are running nodes in the cloud, such as AWS, run each node in a separate Availability Zone. For example, launch node 1 in us-west-2a, node 2 in us-west-2b, and node 3 in us-west-2c. + +### Development and Production Environments Should be Similar +It's strongly recommended to have a "staging" or "pre-production" environment of the Kubernetes cluster that Rancher runs on. This environment should mirror your production environment as closely as possible in terms of software and hardware configuration. + +### Monitor Your Clusters to Plan Capacity +The Rancher server's Kubernetes cluster should run within the [system and hardware requirements]({{}}/rancher/v2.x/en/installation/requirements/) as closely as possible. The more you deviate from the system and hardware requirements, the more risk you take. + +However, metrics-driven capacity planning analysis should be the ultimate guidance for scaling Rancher, because the published requirements take into account a variety of workload types. + +Using Rancher, you can monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open-source monitoring solution, and Grafana, which lets you visualize the metrics from Prometheus. + +After you [enable monitoring]({{}}/rancher/v2.x/en/monitoring-alerting/legacy/monitoring/cluster-monitoring/) in the cluster, you can set up [a notification channel]({{}}/rancher/v2.x/en/cluster-admin/tools/notifiers/) and [cluster alerts]({{}}/rancher/v2.x/en/cluster-admin/tools/alerts/) to let you know if your cluster is approaching its capacity. You can also use the Prometheus and Grafana monitoring framework to establish a baseline for key metrics as you scale. + diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md new file mode 100644 index 00000000000..9e77589ac3f --- /dev/null +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md @@ -0,0 +1,85 @@ +--- +title: Installing Rancher in a vSphere Environment +shortTitle: On-Premises Rancher in vSphere +weight: 3 +--- + +This guide outlines a reference architecture for installing Rancher on an RKE Kubernetes cluster in a vSphere environment, in addition to standard vSphere best practices as documented by VMware. + +## Solution Overview + +![Solution Overview](/img/rancher/rancher-on-prem-vsphere.svg) + +# 1 - Load Balancer Considerations + +A load balancer is required to direct traffic to the Rancher workloads residing on the RKE nodes. + +## Leverage Fault Tolerance and High Availability + +Leverage the use of an external (hardware or software) load balancer that has inherit high-availability functionality (F5, NSX-T, Keepalived, etc). + +## Back Up Load Balancer Configuration + +In the event of a Disaster Recovery activity, availability of the Load balancer configuration will expedite the recovery process. + +## Configure Health Checks + +Configure the Load balancer to automatically mark nodes as unavailable if a health check is failed. For example, NGINX can facilitate this with: + +`max_fails=3 fail_timeout=5s` + +## Leverage an External Load Balancer + +Avoid implementing a software load balancer within the management cluster. + +## Secure Access to Rancher + +Configure appropriate Firewall / ACL rules to only expose access to Rancher + +# 2 - VM Considerations + +## Size the VM's According to Rancher Documentation + +https://rancher.com/docs/rancher/v2.x/en/installation/requirements/ + +## Leverage VM Templates to Construct the Environment + +To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options. + +## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts + +Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level. + +## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores + +Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level. + +## Configure VM's as Appropriate for Kubernetes + +It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node. + +# 3 - Network Considerations + +## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes + +Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup. + +## Consistent IP Addressing for VM's + +Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated. + +# 4 - Storage Considerations + +## Leverage SSD Drives for ETCD Nodes + +ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible. + +# 5 - Backup and Disaster Recovery + +## Perform Regular Management Cluster Backups + +Rancher stores its data in the ETCD datastore of the Kubernetes cluster it resides on. Like with any Kubernetes cluster, perform frequent, tested backups of this cluster. + +## Back up Rancher Cluster Node VM's + +Incorporate the Rancher management node VM's within a standard VM backup policy. \ No newline at end of file diff --git a/static/img/rancher/rancher-on-prem-vsphere.svg b/static/img/rancher/rancher-on-prem-vsphere.svg new file mode 100644 index 00000000000..0cff7674904 --- /dev/null +++ b/static/img/rancher/rancher-on-prem-vsphere.svg @@ -0,0 +1,128 @@ + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+ DRS Anti-Affinity group +
+
+
+
+ + DRS Anti-Affinity group + +
+
+ + + + + + + + + + + + + + + + + + + +
+
+
+ Shared Storage / vSAN +
+
+
+
+ + Shared Storage / vSAN + +
+
+ + + + + + + + + + + + + + + + + +
+
+
+ Load Balancer +
+
+
+
+ + Load Balancer + +
+
+ + + + + + +
+
+
+ https://rancher.domain +
+
+
+
+ + https://rancher.domain + +
+
+
+ + + + + Viewer does not support full SVG 1.1 + + + +
\ No newline at end of file diff --git a/static/img/rancher/solution_overview.drawio.svg b/static/img/rancher/solution_overview.drawio.svg new file mode 100644 index 00000000000..05a90fcc315 --- /dev/null +++ b/static/img/rancher/solution_overview.drawio.svg @@ -0,0 +1,3 @@ + + +
DRS Anti-Affinity group
DRS Anti-Affinity group
Shared Storage / vSAN
Shared Storage / vSAN
Viewer does not support full SVG 1.1
\ No newline at end of file From 7be2614dfb5ae61686bfe5281f14775b2cec03bd Mon Sep 17 00:00:00 2001 From: cluse Date: Thu, 29 Oct 2020 11:04:08 -0700 Subject: [PATCH 9/9] Edit new content in Best Practices Guide --- .../v2.5/rancher-managed/logging/_index.md | 12 +++-- .../rancher-managed/managed-vsphere/_index.md | 35 ++++++++------ .../v2.5/rancher-managed/monitoring/_index.md | 24 ++++++---- .../rancher-in-vsphere/_index.md | 48 +++++++++++-------- 4 files changed, 72 insertions(+), 47 deletions(-) diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md index 1448d78b4b5..32da6bbc48d 100644 --- a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/logging/_index.md @@ -3,9 +3,15 @@ title: Logging Best Practices weight: 1 --- In this guide, we recommend best practices for cluster-level logging and application logging. -# Pre-2.5 Logging, and post-2.5 -Logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level). +- [Changes in Logging in Rancher v2.5](#changes-in-logging-in-rancher-v2-5) +- [Cluster-level Logging](#cluster-level-logging) +- [Application Logging](#application-logging) +- [General Best Practices](#general-best-practices) + +# Changes in Logging in Rancher v2.5 + +Prior to Rancher v2.5, logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level). Logging in 2.5 has been completely overhauled to provide a more flexible experience for log aggregation. With the new logging feature, administrators and users alike can deploy logging that meets fine-grained collection criteria while offering a wider array of destinations and configuration options. @@ -76,7 +82,7 @@ spec: ``` -## General Best Practices +# General Best Practices - Where possible, output structured log entries (e.g. `syslog`, JSON). This makes handling of the log entry easier as there are already parsers written for these formats. - Try to provide the name of the application that is creating the log entry, in the entry itself. This can make troubleshooting easier as Kubernetes objects do not always carry the name of the application as the object name. For instance, a pod ID may be something like `myapp-098kjhsdf098sdf98` which does not provide much information about the application running inside the container. diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md index 2f9696c2c33..fa936c404e7 100644 --- a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/managed-vsphere/_index.md @@ -5,50 +5,55 @@ shortTitle: Rancher Managed Clusters in vSphere This guide outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware. -## Solution Overview +- [1. VM Considerations](#1-vm-considerations) +- [2. Network Considerations](#2-network-considerations) +- [3. Storage Considerations](#3-storage-considerations) +- [4. Backups and Disaster Recovery](#4-backups-and-disaster-recovery) -![Solution Overview](./img/rancher/solution_overview.drawio.svg) +
Solution Overview
-# 1 - VM Considerations +![Solution Overview](/img/rancher/solution_overview.drawio.svg) -## Leverage VM Templates to Construct the Environment +# 1. VM Considerations + +### Leverage VM Templates to Construct the Environment To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options. -## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts +### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level. -## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores +### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level. -## Configure VM's as Appropriate for Kubernetes +### Configure VM's as Appropriate for Kubernetes It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node. -# 2 - Network Considerations +# 2. Network Considerations -## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes +### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup. -## Consistent IP Addressing for VM's +### Consistent IP Addressing for VM's Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated. -# 3 - Storage Considerations +# 3. Storage Considerations -## Leverage SSD Drives for ETCD Nodes +### Leverage SSD Drives for ETCD Nodes ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible. -# 4 - Backup and Disaster Recovery +# 4. Backups and Disaster Recovery -## Perform Regular Downstream Cluster Backups +### Perform Regular Downstream Cluster Backups Kubernetes uses etcd to store all its data - from configuration, state and metadata. Backing this up is crucial in the event of disaster recovery. -## Back up Downstream Node VM's +### Back up Downstream Node VMs Incorporate the Rancher downstream node VM's within a standard VM backup policy. \ No newline at end of file diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md index 3b837d7701e..1e190eb3a7d 100644 --- a/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-managed/monitoring/_index.md @@ -7,11 +7,19 @@ Configuring sensible monitoring and alerting rules is vital for running any prod The [Rancher Documentation]({{}}/rancher/v2.x/en/monitoring-alerting/v2.5/) describes in detail, how you can set up a complete Prometheus and Grafana stack. Out of the box this will scrape monitoring data from all system and Kubernetes components in your cluster and provide sensible dashboards and alerts for them to get started. But for a reliable setup, you also need to monitor your own workloads and adapt Prometheus and Grafana to your own specific use cases and cluster sizes. This document aims to give you best practices for this. -## What to monitor +- [What to Monitor](#what-to-monitor) +- [Configuring Prometheus Resource Usage](#configuring-prometheus-resource-usage) +- [Scraping Custom Workloads](#scraping-custom-workloads) +- [Monitoring in a (Micro)Service Architecture](#monitoring-in-a-micro-service-architecture) +- [Real User Monitoring](#real-user-monitoring) +- [Security Monitoring](#security-monitoring) +- [Setting up Alerts](#setting-up-alerts) + +# What to Monitor Kubernetes itself, as well as applications running inside of it, form a distributed system where different components interact with each other. For the whole system and each individual component, you have to ensure performance, availability, reliability and scalability. A good resource with more details and information is Google's free [Site Reliability Engineering Book](https://landing.google.com/sre/sre-book/), especially the chapter about [Monitoring distributed systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/). -## Configuring Prometheus Resource Usage +# Configuring Prometheus Resource Usage When installing the integrated monitoring stack, Rancher allows to configure several settings that are dependent on the size of your cluster and the workloads running in it. This chapter covers these in more detail. @@ -49,13 +57,13 @@ You can find more information about how to calculate the necessary memory in thi The amount of necessary CPUs correlate with the amount of queries you are performing. -### Federation and long-term Storage +### Federation and Long-term Storage Prometheus is not meant to store metrics for a long amount of time, but should only be used for short term storage. In order to store some, or all metrics for a long time, you can leverage Prometheus' [remote read/write](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations) capabilities to connect it to storage systems like [Thanos](https://thanos.io/), [InfluxDB](https://www.influxdata.com/), [M3DB](https://www.m3db.io/), or others. You can find an example setup in this [blog post](https://rancher.com/blog/2020/prometheus-metric-federation). -## Scraping Custom Workloads +# Scraping Custom Workloads While the integrated Rancher Monitoring already scrapes system metrics from a cluster's nodes and system components, the custom workloads that you deploy on Kubernetes should also be scraped for data. For that you can configure Prometheus to do an HTTP request to an endpoint of your applications in a certain interval. These endpoints should then return their metrics in a Prometheus format. @@ -83,23 +91,23 @@ To still get metrics for these use cases, you can set up [prometheus-pushgateway Sometimes it is useful to monitor workloads from the outside. For this, you can use the [Prometheus blackbox-exporter](https://github.com/prometheus/blackbox_exporter) which allows probing any kind of endpoint over HTTP, HTTPS, DNS, TCP and ICMP. -## Monitoring in a (Micro)Service Architecture +# Monitoring in a (Micro)Service Architecture If you have a (micro)service architecture where multiple individual workloads within your cluster are communicating with each other, it is really important to have detailed metrics and traces about this traffic to understand how all these workloads are communicating with each other and where a problem or bottleneck may be. Of course you can monitor all this internal traffic in all your workloads and expose these metrics to Prometheus. But this can quickly become quite work intensive. Service Meshes like Istio, which can be installed with [a click](https://rancher.com/docs/rancher/v2.x/en/cluster-admin/tools/istio/) in Rancher, can do this automatically and provide rich telemetry about the traffic between all services. -## Real User Monitoring +# Real User Monitoring Monitoring the availability and performance of all your internal workloads is vitally important to run stable, reliable and fast applications. But these metrics only show you parts of the picture. To get a complete view it is also necessary to know how your end users are actually perceiving it. For this you can look into various [Real user monitoring solutions](https://en.wikipedia.org/wiki/Real_user_monitoring). -## Security Monitoring +# Security Monitoring In addition to monitoring workloads to detect performance, availability or scalability problems, the cluster and the workloads running into it should also be monitored for potential security problems. A good starting point is to frequently run and alert on [CIS Scans]({{}}/rancher/v2.x/en/cis-scans/v2.5/) which check if the cluster is configured according to security best practices. For the workloads, you can have a look at Kubernetes and Container security solutions like [Falko](https://falco.org/), [Aqua Kubernetes Security](https://www.aquasec.com/solutions/kubernetes-container-security/), [SysDig](https://sysdig.com/). -## Setting up Alerts +# Setting up Alerts Getting all the metrics into a monitoring systems and visualizing them in dashboards is great, but you also want to be pro-actively alerted if something goes wrong. diff --git a/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md index 9e77589ac3f..c54e5b1fbcf 100644 --- a/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md +++ b/content/rancher/v2.x/en/best-practices/v2.5/rancher-server/rancher-in-vsphere/_index.md @@ -6,80 +6,86 @@ weight: 3 This guide outlines a reference architecture for installing Rancher on an RKE Kubernetes cluster in a vSphere environment, in addition to standard vSphere best practices as documented by VMware. -## Solution Overview +- [1. Load Balancer Considerations](#1-load-balancer-considerations) +- [2. VM Considerations](#2-vm-considerations) +- [3. Network Considerations](#3-network-considerations) +- [4. Storage Considerations](#4-storage-considerations) +- [5. Backups and Disaster Recovery](#5-backups-and-disaster-recovery) + +
Solution Overview
![Solution Overview](/img/rancher/rancher-on-prem-vsphere.svg) -# 1 - Load Balancer Considerations +# 1. Load Balancer Considerations A load balancer is required to direct traffic to the Rancher workloads residing on the RKE nodes. -## Leverage Fault Tolerance and High Availability +### Leverage Fault Tolerance and High Availability Leverage the use of an external (hardware or software) load balancer that has inherit high-availability functionality (F5, NSX-T, Keepalived, etc). -## Back Up Load Balancer Configuration +### Back Up Load Balancer Configuration In the event of a Disaster Recovery activity, availability of the Load balancer configuration will expedite the recovery process. -## Configure Health Checks +### Configure Health Checks Configure the Load balancer to automatically mark nodes as unavailable if a health check is failed. For example, NGINX can facilitate this with: `max_fails=3 fail_timeout=5s` -## Leverage an External Load Balancer +### Leverage an External Load Balancer Avoid implementing a software load balancer within the management cluster. -## Secure Access to Rancher +### Secure Access to Rancher Configure appropriate Firewall / ACL rules to only expose access to Rancher -# 2 - VM Considerations +# 2. VM Considerations -## Size the VM's According to Rancher Documentation +### Size the VM's According to Rancher Documentation https://rancher.com/docs/rancher/v2.x/en/installation/requirements/ -## Leverage VM Templates to Construct the Environment +### Leverage VM Templates to Construct the Environment To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options. -## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts +### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level. -## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores +### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level. -## Configure VM's as Appropriate for Kubernetes +### Configure VM's as Appropriate for Kubernetes It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node. -# 3 - Network Considerations +# 3. Network Considerations -## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes +### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup. -## Consistent IP Addressing for VM's +### Consistent IP Addressing for VM's Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated. -# 4 - Storage Considerations +# 4. Storage Considerations -## Leverage SSD Drives for ETCD Nodes +### Leverage SSD Drives for ETCD Nodes ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible. -# 5 - Backup and Disaster Recovery +# 5. Backups and Disaster Recovery -## Perform Regular Management Cluster Backups +### Perform Regular Management Cluster Backups Rancher stores its data in the ETCD datastore of the Kubernetes cluster it resides on. Like with any Kubernetes cluster, perform frequent, tested backups of this cluster. -## Back up Rancher Cluster Node VM's +### Back up Rancher Cluster Node VMs Incorporate the Rancher management node VM's within a standard VM backup policy. \ No newline at end of file