Merge pull request #2807 from catherineluse/best-practices

Best practices
This commit is contained in:
Catherine Luse
2020-10-29 08:53:08 -07:00
committed by GitHub
23 changed files with 746 additions and 44 deletions
+12 -2
View File
@@ -306,14 +306,24 @@ sudo reboot
# Experimental SELinux Support
As of release v1.17.4+k3s1, experimental support for SELinux has been added to K3s's embedded containerd. If you are installing K3s on a system where SELinux is enabled by default (such as CentOS), you must ensure the proper SELinux policies have been installed. The [install script]({{<baseurl>}}/k3s/latest/en/installation/install-options/#installation-script-options) will fail if they are not. The necessary policies can be installed with the following commands:
As of release v1.17.4+k3s1, experimental support for SELinux has been added to K3s's embedded containerd. If you are installing K3s on a system where SELinux is enabled by default (such as CentOS), you must ensure the proper SELinux policies have been installed.
{{% tabs %}}
{{% tab "automatic installation" %}}
As of release v1.19.3+k3s2, the [install script]({{<baseurl>}}/k3s/latest/en/installation/install-options/#installation-script-options) will automatically install the SELinux RPM from the Rancher RPM repository if on a compatible system if not performing an air-gapped install. Automatic installation can be skipped by setting `INSTALL_K3S_SKIP_SELINUX_RPM=true`.
{{% /tab %}}
{{% tab "manual installation" %}}
The necessary policies can be installed with the following commands:
```
yum install -y container-selinux selinux-policy-base
rpm -i https://rpm.rancher.io/k3s-selinux-0.1.1-rc1.el7.noarch.rpm
yum install -y https://rpm.rancher.io/k3s/latest/common/centos/7/noarch/k3s-selinux-0.2-1.el7_8.noarch.rpm
```
To force the install script to log a warning rather than fail, you can set the following environment variable: `INSTALL_K3S_SELINUX_WARN=true`.
{{% /tab %}}
{{% /tabs %}}
The way that SELinux enforcement is enabled or disabled depends on the K3s version. Prior to v1.19.x, SELinux enablement for the builtin containerd was automatic but could be disabled by passing `--disable-selinux`. With v1.19.x and beyond, enabling SELinux must be affirmatively configured via the `--selinux` flag or config file entry. Servers and agents that specify both the `--selinux` and (deprecated) `--disable-selinux` flags will fail to start.
Using a custom `--data-dir` under SELinux is not supported. To customize it, you would most likely need to write your own custom policy. For guidance, you could refer to the [containers/container-selinux](https://github.com/containers/container-selinux) repository, which contains the SELinux policy files for Container Runtimes, and the [rancher/k3s-selinux](https://github.com/rancher/k3s-selinux) repository, which contains the SELinux policy for K3s .
@@ -40,8 +40,10 @@ When using this method to install K3s, the following environment variables can b
| `INSTALL_K3S_SYSTEMD_DIR` | Directory to install systemd service and environment files to, or use `/etc/systemd/system` as the default. |
| `INSTALL_K3S_EXEC` | Command with flags to use for launching K3s in the service. If the command is not specified, and the `K3S_URL` is set, it will default to "agent." If `K3S_URL` not set, it will default to "server." For help, refer to [this example.]({{<baseurl>}}/k3s/latest/en/installation/install-options/how-to-flags/#example-b-install-k3s-exec) |
| `INSTALL_K3S_NAME` | Name of systemd service to create, will default to 'k3s' if running k3s as a server and 'k3s-agent' if running k3s as an agent. If specified the name will be prefixed with 'k3s-'. |
| `INSTALL_K3S_TYPE` | Type of systemd service to create, will default from the K3s exec command if not specified.
| `INSTALL_K3S_CHANNEL_URL` | Channel URL for fetching K3s download URL. Defaults to https://update.k3s.io/v1-release/channels.
| `INSTALL_K3S_TYPE` | Type of systemd service to create, will default from the K3s exec command if not specified. |
| `INSTALL_K3S_SELINUX_WARN` | If set to true will continue if k3s-selinux policy is not found. |
| `INSTALL_K3S_SKIP_SELINUX_RPM` | If set to true will skip automatic installation of the k3s RPM. |
| `INSTALL_K3S_CHANNEL_URL` | Channel URL for fetching K3s download URL. Defaults to https://update.k3s.io/v1-release/channels. |
| `INSTALL_K3S_CHANNEL` | Channel to use for fetching K3s download URL. Defaults to "stable". Options include: `stable`, `latest`, `testing`. |
@@ -3,20 +3,8 @@ title: Best Practices Guide
weight: 4
---
> The Best Practices Guide will be updated for Rancher v2.5.
The purpose of this section is to consolidate best practices for Rancher implementations.
The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
If you are using Rancher v2.0-v2.4, refer to the Best Practices Guide [here.](./v2.0-v2.4)
If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
For more guidance on best practices, you can consult these resources:
- [Security]({{<baseurl>}}/rancher/v2.x/en/security/)
- [Rancher Blog](https://rancher.com/blog/)
- [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/)
- [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
- [Rancher Forum](https://forums.rancher.com/)
- [Rancher Users Slack](https://slack.rancher.io/)
- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
If you are using Rancher v2.5, refer to the Best Practices Guide [here.](./v2.5)
@@ -0,0 +1,21 @@
---
title: Best Practices Guide for Rancher v2.0-v2.4
shortTitle: v2.0-v2.4
weight: 2
---
The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
For more guidance on best practices, you can consult these resources:
- [Security]({{<baseurl>}}/rancher/v2.x/en/security/)
- [Rancher Blog](https://rancher.com/blog/)
- [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/)
- [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
- [Rancher Forum](https://forums.rancher.com/)
- [Rancher Users Slack](https://slack.rancher.io/)
- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
@@ -1,6 +1,8 @@
---
title: Tips for Setting Up Containers
weight: 100
aliases:
- /rancher/v2.x/en/best-practices/containers
---
Running well built containers can greatly impact the overall performance and security of your environment.
@@ -1,6 +1,8 @@
---
title: Rancher Deployment Strategies
weight: 100
aliases:
- /rancher/v2.x/en/best-practices/deployment-strategies
---
There are two recommended deployment strategies. Each one has its own pros and cons. Read more about which one would fit best for your use case:
@@ -1,6 +1,8 @@
---
title: Tips for Running Rancher
weight: 100
aliases:
- /rancher/v2.x/en/best-practices/deployment-types
---
A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment.
@@ -1,6 +1,8 @@
---
title: Tips for Scaling, Security and Reliability
weight: 101
aliases:
- /v2.x/en/best-practices/management
---
Rancher allows you to set up numerous combinations of configurations. Some configurations are more appropriate for development and testing, while there are other best practices for production environments for maximum availability and fault tolerance. The following best practices should be followed for production.
@@ -0,0 +1,21 @@
---
title: Best Practices Guide for Rancher v2.5
shortTitle: v2.5
weight: 1
---
The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
For more guidance on best practices, you can consult these resources:
- [Security]({{<baseurl>}}/rancher/v2.x/en/security/)
- [Rancher Blog](https://rancher.com/blog/)
- [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/)
- [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
- [Rancher Forum](https://forums.rancher.com/)
- [Rancher Users Slack](https://slack.rancher.io/)
- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
@@ -0,0 +1,21 @@
---
title: Best Practices for Rancher Managed Clusters
shortTitle: Rancher Managed Clusters
weight: 2
---
### Logging
Refer to [this guide](./logging) for our recommendations for cluster-level logging and application logging.
### Monitoring
Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. Refer to this [guide](./monitoring) for our recommendations.
### Tips for Setting Up Containers
Running well built containers can greatly impact the overall performance and security of your environment. Refer to this [guide](./containers) for tips.
### Best Practices for Rancher Managed vSphere Clusters
This [guide](./managed-vsphere) outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
@@ -0,0 +1,51 @@
---
title: Tips for Setting Up Containers
weight: 100
aliases:
- /rancher/v2.x/en/best-practices/containers
---
Running well built containers can greatly impact the overall performance and security of your environment.
Below are a few tips for setting up your containers.
For a more detailed discussion of security for containers, you can also refer to Rancher's [Guide to Container Security.](https://rancher.com/complete-guide-container-security)
### Use a Common Container OS
When possible, you should try to standardize on a common container base OS.
Smaller distributions such as Alpine and BusyBox reduce container image size and generally have a smaller attack/vulnerability surface.
Popular distributions such as Ubuntu, Fedora, and CentOS are more field-tested and offer more functionality.
### Start with a FROM scratch container
If your microservice is a standalone static binary, you should use a FROM scratch container.
The FROM scratch container is an [official Docker image](https://hub.docker.com/_/scratch) that is empty so that you can use it to design minimal images.
This will have the smallest attack surface and smallest image size.
### Run Container Processes as Unprivileged
When possible, use a non-privileged user when running processes within your container. While container runtimes provide isolation, vulnerabilities and attacks are still possible. Inadvertent or accidental host mounts can also be impacted if the container is running as root. For details on configuring a security context for a pod or container, refer to the [Kubernetes docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/).
### Define Resource Limits
Apply CPU and memory limits to your pods. This can help manage the resources on your worker nodes and avoid a malfunctioning microservice from impacting other microservices.
In standard Kubernetes, you can set resource limits on the namespace level. In Rancher, you can set resource limits on the project level and they will propagate to all the namespaces within the project. For details, refer to the Rancher docs.
When setting resource quotas, if you set anything related to CPU or Memory (i.e. limits or reservations) on a project or namespace, all containers will require a respective CPU or Memory field set during creation. To avoid setting these limits on each and every container during workload creation, a default container resource limit can be specified on the namespace.
The Kubernetes docs have more information on how resource limits can be set at the [container level](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) and the namespace level.
### Define Resource Requirements
You should apply CPU and memory requirements to your pods. This is crucial for informing the scheduler which type of compute node your pod needs to be placed on, and ensuring it does not over-provision that node. In Kubernetes, you can set a resource requirement by defining `resources.requests` in the resource requests field in a pod's container spec. For details, refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container).
> **Note:** If you set a resource limit for the namespace that the pod is deployed in, and the container doesn't have a specific resource request, the pod will not be allowed to start. To avoid setting these fields on each and every container during workload creation, a default container resource limit can be specified on the namespace.
It is recommended to define resource requirements on the container level because otherwise, the scheduler makes assumptions that will likely not be helpful to your application when the cluster experiences load.
### Liveness and Readiness Probes
Set up liveness and readiness probes for your container. Unless your container completely crashes, Kubernetes will not know it's unhealthy unless you create an endpoint or mechanism that can report container status. Alternatively, make sure your container halts and crashes if unhealthy.
The Kubernetes docs show how to [configure liveness and readiness probes for containers.](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/)
@@ -0,0 +1,85 @@
---
title: Logging Best Practices
weight: 1
---
In this guide, we recommend best practices for cluster-level logging and application logging.
# Pre-2.5 Logging, and post-2.5
Logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level).
Logging in 2.5 has been completely overhauled to provide a more flexible experience for log aggregation. With the new logging feature, administrators and users alike can deploy logging that meets fine-grained collection criteria while offering a wider array of destinations and configuration options.
"Under the hood", Rancher logging uses the Banzai Cloud logging operator. We provide manageability of this operator (and its resources), and tie that experience in with managing your Rancher clusters.
# Cluster-level Logging
### Cluster-wide Scraping
For some users, it is desirable to scrape logs from every container running in the cluster. This usually coincides with your security team's request (or requirement) to collect all logs from all points of execution.
In this scenario, it is recommended to create at least two _ClusterOutput_ objects - one for your security team (if you have that requirement), and one for yourselves, the cluster administrators. When creating these objects take care to choose an output endpoint that can handle the significant log traffic coming from the entire cluster. Also make sure to choose an appropriate index to receive all these logs.
Once you have created these _ClusterOutput_ objects, create a _ClusterFlow_ to collect all the logs. Do not define any _Include_ or _Exclude_ rules on this flow. This will ensure that all logs from across the cluster are collected. If you have two _ClusterOutputs_, make sure to send logs to both of them.
### Kubernetes Components
_ClusterFlows_ have the ability to collect logs from all containers on all hosts in the Kubernetes cluster. This works well in cases where those containers are part of a Kubernetes pod; however, RKE containers exist outside of the scope of Kubernetes.
Currently (as of v2.5.1) the logs from RKE containers are collected, but are not able to easily be filtered. This is because those logs do not contain information as to the source container (e.g. `etcd` or `kube-apiserver`).
A future release of Rancher will include the source container name which will enable filtering of these component logs. Once that change is made, you will be able to customize a _ClusterFlow_ to retrieve **only** the Kubernetes component logs, and direct them to an appropriate output.
# Application Logging
Best practice not only in Kubernetes but in all container-based applications is to direct application logs to `stdout`/`stderr`. The container runtime will then trap these logs and do **something** with them - typically writing them to a file. Depending on the container runtime (and its configuration), these logs can end up in any number of locations.
In the case of writing the logs to a file, Kubernetes helps by creating a `/var/log/containers` directory on each host. This directory symlinks the log files to their actual destination (which can differ based on configuration or container runtime).
Rancher logging will read all log entries in `/var/log/containers`, ensuring that all log entries from all containers (assuming a default configuration) will have the opportunity to be collected and processed.
### Specific Log Files
Log collection only retrieves `stdout`/`stderr` logs from pods in Kubernetes. But what if we want to collect logs from other files that are generated by applications? Here, a log streaming sidecar (or two) may come in handy.
The goal of setting up a streaming sidecar is to take log files that are written to disk, and have their contents streamed to `stdout`. This way, the Banzai Logging Operator can pick up those logs and send them to your desired output.
To set this up, edit your workload resource (e.g. Deployment) and add the following sidecar definition:
```
...
containers:
- args:
- -F
- /path/to/your/log/file.log
command:
- tail
image: busybox
name: stream-log-file-[name]
volumeMounts:
- mountPath: /path/to/your/log
name: mounted-log
...
```
This will add a container to your workload definition that will now stream the contents of (in this example) `/path/to/your/log/file.log` to `stdout`.
This log stream is then automatically collected according to any _Flows_ or _ClusterFlows_ you have setup. You may also wish to consider creating a _Flow_ specifically for this log file by targeting the name of the container. See example:
```
...
spec:
match:
- select:
container_names:
- stream-log-file-name
...
```
## General Best Practices
- Where possible, output structured log entries (e.g. `syslog`, JSON). This makes handling of the log entry easier as there are already parsers written for these formats.
- Try to provide the name of the application that is creating the log entry, in the entry itself. This can make troubleshooting easier as Kubernetes objects do not always carry the name of the application as the object name. For instance, a pod ID may be something like `myapp-098kjhsdf098sdf98` which does not provide much information about the application running inside the container.
- Except in the case of collecting all logs cluster-wide, try to scope your _Flow_ and _ClusterFlow_ objects tightly. This makes it easier to troubleshoot when problems arise, and also helps ensure unrelated log entries do not show up in your aggregator. An example of tight scoping would be to constrain a _Flow_ to a single _Deployment_ in a namespace, or perhaps even a single container within a _Pod_.
- Keep the log verbosity down except when troubleshooting. High log verbosity poses a number of issues, chief among them being **noise**: significant events can be drowned out in a sea of `DEBUG` messages. This is somewhat mitigated with automated alerting and scripting, but highly verbose logging still places an inordinate amount of stress on the logging infrastructure.
- Where possible, try to provide a transaction or request ID with the log entry. This can make tracing application activity across multiple log sources easier, especially when dealing with distributed applications.
@@ -0,0 +1,54 @@
---
title: Best Practices for Rancher Managed vSphere Clusters
shortTitle: Rancher Managed Clusters in vSphere
---
This guide outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
## Solution Overview
![Solution Overview](./img/rancher/solution_overview.drawio.svg)
# 1 - VM Considerations
## Leverage VM Templates to Construct the Environment
To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts
Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores
Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
## Configure VM's as Appropriate for Kubernetes
Its important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
# 2 - Network Considerations
## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
## Consistent IP Addressing for VM's
Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
# 3 - Storage Considerations
## Leverage SSD Drives for ETCD Nodes
ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
# 4 - Backup and Disaster Recovery
## Perform Regular Downstream Cluster Backups
Kubernetes uses etcd to store all its data - from configuration, state and metadata. Backing this up is crucial in the event of disaster recovery.
## Back up Downstream Node VM's
Incorporate the Rancher downstream node VM's within a standard VM backup policy.
@@ -0,0 +1,112 @@
---
title: Monitoring Best Practices
weight: 2
---
Configuring sensible monitoring and alerting rules is vital for running any production workloads securely and reliably. This is not different when using Kubernetes and Rancher. Fortunately the integrated monitoring and alerting functionality makes this whole process a lot easier.
The [Rancher Documentation]({{<baseurl>}}/rancher/v2.x/en/monitoring-alerting/v2.5/) describes in detail, how you can set up a complete Prometheus and Grafana stack. Out of the box this will scrape monitoring data from all system and Kubernetes components in your cluster and provide sensible dashboards and alerts for them to get started. But for a reliable setup, you also need to monitor your own workloads and adapt Prometheus and Grafana to your own specific use cases and cluster sizes. This document aims to give you best practices for this.
## What to monitor
Kubernetes itself, as well as applications running inside of it, form a distributed system where different components interact with each other. For the whole system and each individual component, you have to ensure performance, availability, reliability and scalability. A good resource with more details and information is Google's free [Site Reliability Engineering Book](https://landing.google.com/sre/sre-book/), especially the chapter about [Monitoring distributed systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/).
## Configuring Prometheus Resource Usage
When installing the integrated monitoring stack, Rancher allows to configure several settings that are dependent on the size of your cluster and the workloads running in it. This chapter covers these in more detail.
### Storage and Data Retention
The amount of storage needed for Prometheus directly correlates to the amount of time series and labels that you store and the data retention you have configured. It is important to note that Prometheus is not meant to be used as a long-term metrics storage. Data retention time is usually only a couple of days and not weeks or months. The reason for this is, that Prometheus does not perform any aggregation on its stored metrics. This is great because aggregation can dilute data, but it also means that the needed storage grows linearly over time without retention.
One way to calculate the necessary storage is to look at the average size of a storage chunk in Prometheus with this query
```
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h])
```
Next, find out your data ingestion rate per second:
```
rate(prometheus_tsdb_head_samples_appended_total[1h])
```
and then multiply this with the retention time, adding a few percentage points as buffer:
```
average chunk size in bytes * ingestion rate per second * retention time in seconds * 1.1 = necessary storage in bytes
```
You can find more information about how to calculate the necessary storage in this [blog post](https://www.robustperception.io/how-much-disk-space-do-prometheus-blocks-use).
You can read more about the Prometheus storage concept in the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/storage).
### CPU and Memory Requests and Limits
In larger Kubernetes clusters Prometheus can consume quite a bit of memory. The amount of memory Prometheus needs directly correlates to the amount of time series and amount of labels it stores and the scrape interval in which these are filled.
You can find more information about how to calculate the necessary memory in this [blog post](https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion).
The amount of necessary CPUs correlate with the amount of queries you are performing.
### Federation and long-term Storage
Prometheus is not meant to store metrics for a long amount of time, but should only be used for short term storage.
In order to store some, or all metrics for a long time, you can leverage Prometheus' [remote read/write](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations) capabilities to connect it to storage systems like [Thanos](https://thanos.io/), [InfluxDB](https://www.influxdata.com/), [M3DB](https://www.m3db.io/), or others. You can find an example setup in this [blog post](https://rancher.com/blog/2020/prometheus-metric-federation).
## Scraping Custom Workloads
While the integrated Rancher Monitoring already scrapes system metrics from a cluster's nodes and system components, the custom workloads that you deploy on Kubernetes should also be scraped for data. For that you can configure Prometheus to do an HTTP request to an endpoint of your applications in a certain interval. These endpoints should then return their metrics in a Prometheus format.
In general, you want to scrape data from all the workloads running in your cluster so that you can use them for alerts or debugging issues. Oftentimes, you recognize, that you need some data only when you actually need the metrics during an incident. It is good, if it is already scraped and stored. Since Prometheus is only meant to be a short-term metrics storage, scraping and keeping lots of data is usually not that expensive. If you are using a long-term storage solution with Prometheus, you can then still decide which data you are actually persisting and keeping there.
### About Prometheus Exporters
A lot of 3rd party workloads like databases, queues or web-servers either already support exposing metrics in a Prometheus format, or there are so called exporters available that translate between the tool's metrics and the format that Prometheus understands. Usually you can add these exporters as additional sidecar containers to the workload's Pods. A lot of helm charts already include options to deploy the correct exporter. Additionally you can find a curated list of exports by SysDig on [promcat.io](https://promcat.io/) and on [ExporterHub](https://exporterhub.io/).
### Prometheus support in Programming Languages and Frameworks
To get your own custom application metrics into Prometheus, you have to collect and expose these metrics directly from your applications code. Fortunately, there are already libraries and integrations available to help with this for most popular programming languages and frameworks. One example for this is the Prometheus support in the [Spring Framework](https://docs.spring.io/spring-metrics/docs/current/public/prometheus).
### ServiceMonitors and PodMonitors
Once all your workloads expose metrics in a Prometheus format, you have to configure Prometheus to scrape it. Under the hood Rancher is using the [prometheus-operator](https://github.com/prometheus-operator/prometheus-operator). This makes it easy to add additional scraping targets with ServiceMonitors and PodMonitors. A lot of helm charts already include an option to create these monitors directly. You can also find more information in the [Rancher Documentation](TODO).
### Prometheus Push Gateway
There are some workloads that are traditionally hard to scrape by Prometheus. Examples for these are short lived workloads like Jobs and CronJobs, or applications that do not allow sharing data between individual handled incoming requests, like PHP applications.
To still get metrics for these use cases, you can set up [prometheus-pushgateways](https://github.com/prometheus/pushgateway). The CronJob or PHP application would push metric updates to the pushgateway. The pushgateway aggregates and exposes them through an HTTP endpoint, which then can be scraped by Prometheus.
### Prometheus Blackbox Monitor
Sometimes it is useful to monitor workloads from the outside. For this, you can use the [Prometheus blackbox-exporter](https://github.com/prometheus/blackbox_exporter) which allows probing any kind of endpoint over HTTP, HTTPS, DNS, TCP and ICMP.
## Monitoring in a (Micro)Service Architecture
If you have a (micro)service architecture where multiple individual workloads within your cluster are communicating with each other, it is really important to have detailed metrics and traces about this traffic to understand how all these workloads are communicating with each other and where a problem or bottleneck may be.
Of course you can monitor all this internal traffic in all your workloads and expose these metrics to Prometheus. But this can quickly become quite work intensive. Service Meshes like Istio, which can be installed with [a click](https://rancher.com/docs/rancher/v2.x/en/cluster-admin/tools/istio/) in Rancher, can do this automatically and provide rich telemetry about the traffic between all services.
## Real User Monitoring
Monitoring the availability and performance of all your internal workloads is vitally important to run stable, reliable and fast applications. But these metrics only show you parts of the picture. To get a complete view it is also necessary to know how your end users are actually perceiving it. For this you can look into various [Real user monitoring solutions](https://en.wikipedia.org/wiki/Real_user_monitoring).
## Security Monitoring
In addition to monitoring workloads to detect performance, availability or scalability problems, the cluster and the workloads running into it should also be monitored for potential security problems. A good starting point is to frequently run and alert on [CIS Scans]({{<baseurl>}}/rancher/v2.x/en/cis-scans/v2.5/) which check if the cluster is configured according to security best practices.
For the workloads, you can have a look at Kubernetes and Container security solutions like [Falko](https://falco.org/), [Aqua Kubernetes Security](https://www.aquasec.com/solutions/kubernetes-container-security/), [SysDig](https://sysdig.com/).
## Setting up Alerts
Getting all the metrics into a monitoring systems and visualizing them in dashboards is great, but you also want to be pro-actively alerted if something goes wrong.
The integrated Rancher monitoring already configures a sensible set of alerts that make sense in any Kubernetes cluster. You should extend these to cover your specific workloads and use cases.
When setting up alerts, configure them for all the workloads that are critical to the availability of your applications. But also make sure that they are not too noisy. Ideally every alert you are receiving should be because of a problem that needs your attention and needs to be fixed. If you have alerts that are firing all the time but are not that critical, there is a danger that you start ignoring your alerts all together and then miss the real important ones. Less may be more here. Start to focus on the real important metrics first, for example alert if your application is offline. Fix all the problems that start to pop up and then start to create more detailed alerts.
If an alert starts firing, but there is nothing you can do about it at the moment, it's also fine to silence the alert for a certain amount of time, so that you can look at it later.
You can find more information on how to set up alerts and notification channels in the [Rancher Documentation]({{<baseurl>}}/rancher/v2.x/en/monitoring-alerting/v2.5).
@@ -0,0 +1,19 @@
---
title: Best Practices for the Rancher Server
shortTitle: Rancher Server
weight: 1
---
This guide contains our recommendations for running the Rancher server, and is intended to be used in situations in which Rancher manages downstream Kubernetes clusters.
### Recommended Architecture and Infrastructure
Refer to this [guide](./deployment-types) for our general advice for setting up the Rancher server on a high-availability Kubernetes cluster.
### Deployment Strategies
This [guide](./deployment-strategies) is designed to help you choose whether a regional deployment strategy or a hub-and-spoke deployment strategy is better for a Rancher server that manages downstream Kubernetes clusters.
### Installing Rancher in a vSphere Environment
This [guide](./rancher-in-vsphere) outlines a reference architecture for installing Rancher in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
@@ -0,0 +1,45 @@
---
title: Rancher Deployment Strategy
weight: 100
---
There are two recommended deployment strategies for a Rancher server that manages downstream Kubernetes clusters. Each one has its own pros and cons. Read more about which one would fit best for your use case:
* [Hub and Spoke](#hub-and-spoke)
* [Regional](#regional)
# Hub & Spoke Strategy
---
In this deployment scenario, there is a single Rancher control plane managing Kubernetes clusters across the globe. The control plane would be run on a high-availability Kubernetes cluster, and there would be impact due to latencies.
{{< img "/img/rancher/bpg/hub-and-spoke.png" "Hub and Spoke Deployment">}}
### Pros
* Environments could have nodes and network connectivity across regions.
* Single control plane interface to view/see all regions and environments.
* Kubernetes does not require Rancher to operate and can tolerate losing connectivity to the Rancher control plane.
### Cons
* Subject to network latencies.
* If the control plane goes out, global provisioning of new services is unavailable until it is restored. However, each Kubernetes cluster can continue to be managed individually.
# Regional Strategy
---
In the regional deployment model a control plane is deployed in close proximity to the compute nodes.
{{< img "/img/rancher/bpg/regional.png" "Regional Deployment">}}
### Pros
* Rancher functionality in regions stay operational if a control plane in another region goes down.
* Network latency is greatly reduced, improving the performance of functionality in Rancher.
* Upgrades of the Rancher control plane can be done independently per region.
### Cons
* Overhead of managing multiple Rancher installations.
* Visibility across global Kubernetes clusters requires multiple interfaces/panes of glass.
* Deploying multi-cluster apps in Rancher requires repeating the process for each Rancher server.
@@ -0,0 +1,39 @@
---
title: Tips for Running Rancher
weight: 100
aliases:
- /rancher/v2.x/en/best-practices/deployment-types
---
This guide is geared toward use cases where Rancher is used to manage downstream Kubernetes clusters. The high-availability setup is intended to prevent losing access to downstream clusters if the Rancher server is not available.
A high-availability Kubernetes installation, defined as an installation of Rancher on a Kubernetes cluster with at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment.
If you are installing Rancher in a vSphere environment, refer to the best practices documented [here.](../rancher-in-vsphere)
When you set up your high-availability Rancher installation, consider the following:
### Run Rancher on a Separate Cluster
Don't run other workloads or microservices in the Kubernetes cluster that Rancher is installed on.
### Make sure nodes are configured correctly for Kubernetes ###
It's important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node, checking that all correct ports are opened, and deploying with ssd backed etcd. More details can be found in the [kubernetes docs](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#before-you-begin) and [etcd's performance op guide](https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/performance.md)
### When using RKE: Back up the Statefile
RKE keeps record of the cluster state in a file called `cluster.rkestate`. This file is important for the recovery of a cluster and/or the continued maintenance of the cluster through RKE. Because this file contains certificate material, we strongly recommend encrypting this file before backing up. After each run of `rke up` you should backup the state file.
### Run All Nodes in the Cluster in the Same Datacenter
For best performance, run all three of your nodes in the same geographic datacenter. If you are running nodes in the cloud, such as AWS, run each node in a separate Availability Zone. For example, launch node 1 in us-west-2a, node 2 in us-west-2b, and node 3 in us-west-2c.
### Development and Production Environments Should be Similar
It's strongly recommended to have a "staging" or "pre-production" environment of the Kubernetes cluster that Rancher runs on. This environment should mirror your production environment as closely as possible in terms of software and hardware configuration.
### Monitor Your Clusters to Plan Capacity
The Rancher server's Kubernetes cluster should run within the [system and hardware requirements]({{<baseurl>}}/rancher/v2.x/en/installation/requirements/) as closely as possible. The more you deviate from the system and hardware requirements, the more risk you take.
However, metrics-driven capacity planning analysis should be the ultimate guidance for scaling Rancher, because the published requirements take into account a variety of workload types.
Using Rancher, you can monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open-source monitoring solution, and Grafana, which lets you visualize the metrics from Prometheus.
After you [enable monitoring]({{<baseurl>}}/rancher/v2.x/en/monitoring-alerting/legacy/monitoring/cluster-monitoring/) in the cluster, you can set up [a notification channel]({{<baseurl>}}/rancher/v2.x/en/cluster-admin/tools/notifiers/) and [cluster alerts]({{<baseurl>}}/rancher/v2.x/en/cluster-admin/tools/alerts/) to let you know if your cluster is approaching its capacity. You can also use the Prometheus and Grafana monitoring framework to establish a baseline for key metrics as you scale.
@@ -0,0 +1,85 @@
---
title: Installing Rancher in a vSphere Environment
shortTitle: On-Premises Rancher in vSphere
weight: 3
---
This guide outlines a reference architecture for installing Rancher on an RKE Kubernetes cluster in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
## Solution Overview
![Solution Overview](/img/rancher/rancher-on-prem-vsphere.svg)
# 1 - Load Balancer Considerations
A load balancer is required to direct traffic to the Rancher workloads residing on the RKE nodes.
## Leverage Fault Tolerance and High Availability
Leverage the use of an external (hardware or software) load balancer that has inherit high-availability functionality (F5, NSX-T, Keepalived, etc).
## Back Up Load Balancer Configuration
In the event of a Disaster Recovery activity, availability of the Load balancer configuration will expedite the recovery process.
## Configure Health Checks
Configure the Load balancer to automatically mark nodes as unavailable if a health check is failed. For example, NGINX can facilitate this with:
`max_fails=3 fail_timeout=5s`
## Leverage an External Load Balancer
Avoid implementing a software load balancer within the management cluster.
## Secure Access to Rancher
Configure appropriate Firewall / ACL rules to only expose access to Rancher
# 2 - VM Considerations
## Size the VM's According to Rancher Documentation
https://rancher.com/docs/rancher/v2.x/en/installation/requirements/
## Leverage VM Templates to Construct the Environment
To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts
Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores
Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
## Configure VM's as Appropriate for Kubernetes
Its important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
# 3 - Network Considerations
## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
## Consistent IP Addressing for VM's
Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
# 4 - Storage Considerations
## Leverage SSD Drives for ETCD Nodes
ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
# 5 - Backup and Disaster Recovery
## Perform Regular Management Cluster Backups
Rancher stores its data in the ETCD datastore of the Kubernetes cluster it resides on. Like with any Kubernetes cluster, perform frequent, tested backups of this cluster.
## Back up Rancher Cluster Node VM's
Incorporate the Rancher management node VM's within a standard VM backup policy.
@@ -166,8 +166,8 @@ RKE uses a `.yml` config file to install and configure your Kubernetes cluster.
1. Download one of following templates, depending on the SSL certificate you're using.
- [Template for self-signed certificate<br/> `3-node-certificate.yml`]({{<baseurl>}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate)
- [Template for certificate signed by recognized CA<br/> `3-node-certificate-recognizedca.yml`]({{<baseurl>}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate-recognizedca)
- [Template for self-signed certificate<br/>]({{<baseurl>}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate)
- [Template for certificate signed by recognized CA<br/> ]({{<baseurl>}}/rancher/v2.x/en/installation/options/cluster-yml-templates/3-node-certificate-recognizedca)
@@ -76,7 +76,7 @@ kubectl -n cattle-system logs -f rancher-784d94f59b-vgqzh
Use your browser to check the certificate details. If it says the Common Name is "Kubernetes Ingress Controller Fake Certificate", something may have gone wrong with reading or issuing your SSL cert.
> **Note:** if you are using LetsEncrypt to issue certs it can sometimes take a few minuets to issue the cert.
> **Note:** if you are using LetsEncrypt to issue certs it can sometimes take a few minutes to issue the cert.
### Checking for issues with cert-manager issued certs (Rancher Generated or LetsEncrypt)
+32 -22
View File
@@ -5,14 +5,14 @@ weight: 1
---
- [Changes in Rancher v2.5](#changes-in-rancher-v2-5)
- [Configuring the Logging Output for the Rancher Kubernetes Cluster](#configuring-the-logging-output-for-the-rancher-kubernetes-cluster)
- [Enabling Logging for Rancher Managed Clusters](#enabling-logging-for-rancher-managed-clusters)
- [Uninstall Logging](#uninstall-logging)
- [Role-based Access Control](#role-based-access-control)
- [Configuring the Logging Application](#configuring-the-logging-application)
- [Working with Taints and Tolerations](#working-with-taints-and-tolerations)
### Changes in Rancher v2.5
# Changes in Rancher v2.5
The following changes were introduced to logging in Rancher v2.5:
@@ -30,13 +30,7 @@ The following figure from the [Banzai documentation](https://banzaicloud.com/doc
![How the Banzai Cloud Logging Operator Works with Fluentd]({{<baseurl>}}/img/rancher/banzai-cloud-logging-operator.png)
### Configuring the Logging Output for the Rancher Kubernetes Cluster
If you install Rancher as a Helm chart, you'll configure the Helm chart options to select a logging output for all the logs in the local Kubernetes cluster.
If you install Rancher using the Rancher CLI on an Linux OS, the Rancher Helm chart will be installed on a Kubernetes cluster with default options. Then when the Rancher UI is available, you'll enable the logging app from the Apps section of the UI. Then during the process of installing the logging application, you will configure the logging output.
### Enabling Logging for Rancher Managed Clusters
# Enabling Logging for Rancher Managed Clusters
You can enable the logging for a Rancher managed cluster by going to the Apps page and installing the logging app.
@@ -47,7 +41,7 @@ You can enable the logging for a Rancher managed cluster by going to the Apps pa
**Result:** The logging app is deployed in the `cattle-logging-system` namespace.
### Uninstall Logging
# Uninstall Logging
1. From the **Cluster Explorer,** click **Apps & Marketplace.**
1. Click **Installed Apps.**
@@ -57,7 +51,27 @@ You can enable the logging for a Rancher managed cluster by going to the Apps pa
**Result** `rancher-logging` is uninstalled.
### Configuring the Logging Application
# Role-based Access Control
Rancher logging has two roles, `logging-admin` and `logging-view`.
`logging-admin` allows users full access to namespaced flows and outputs.
The `logging-view` role allows users to view namespaced flows and outputs, and cluster flows and outputs.
Edit access to the cluster flow and cluster output resources is powerful as it allows any user with edit access control of all logs in the cluster.
In Rancher, the cluster administrator role is the only role with full access to all rancher-logging resources.
Cluster members are not able to edit or read any logging resources.
Project owners are able to create namespaced flows and outputs in the namespaces under their projects. This means that project owners can collect logs from anything in their project namespaces. Project members are able to view the flows and outputs in the namespaces under their projects. Project owners and project members require at least 1 namespace in their project to use logging. If they do not have at least one namespace in their project they may not see the logging button in the top nav dropdown.
# Configuring the Logging Application
To configure the logging application, go to the **Cluster Explorer** in the Rancher UI. In the upper left corner, click **Cluster Explorer > Logging.**
### Overview of Logging Custom Resources
The following Custom Resource Definitions are used to configure logging:
@@ -68,11 +82,7 @@ According to the [Banzai Cloud documentation,](https://banzaicloud.com/docs/one-
> You can define `outputs` (destinations where you want to send your log messages, for example, Elasticsearch, or and Amazon S3 bucket), and `flows` that use filters and selectors to route log messages to the appropriate outputs. You can also define cluster-wide outputs and flows, for example, to use a centralized output that namespaced users cannot modify.
**RBAC**
Rancher logging has two roles, `logging-admin` and `logging-view`. `logging-admin` allows users full access to namespaced flows and outputs. The `logging-view` role allows users to view namespaced flows and outputs, and cluster flows and outputs. Edit access to the cluster flow and cluster output resources is powerful as it allows any user with edit access control of all logs in the cluster. Cluster admin is the only role with full access to all rancher-logging resources. Cluster members are not able to edit or read any logging resources. Project owners are able to create namespaced flows and outputs in the namespaces under their projects. This means that project owners can collect logs from anything in their project namespaces. Project members are able to view the flows and outputs in the namespaces under their projects. Project owners and project members require at least 1 namespace in their project to use logging. If they do not have at least one namespace in their project they may not see the logging button in the top nav dropdown.
**Examples**
### Examples
Let's say you wanted to send all logs in your cluster to an elasticsearch cluster.
@@ -257,7 +267,7 @@ spec:
if we break down what is happening, first we create a deployment of a container that has the additional syslog plugin and accepts logs forwarded from another fluentd. Next we create an output configured as a forwarder to our deployment. The deployment fluentd will then forward all logs to the configured syslog destination.
### Working with Taints and Tolerations
# Working with Taints and Tolerations
"Tainting" a Kubernetes node causes pods to repel running on that node.
Unless the pods have a ```toleration``` for that node's taint, they will run on other nodes in the cluster.
@@ -265,7 +275,7 @@ Unless the pods have a ```toleration``` for that node's taint, they will run on
Using ```nodeSelector``` gives pods an affinity towards certain nodes.
Both provide choice for the what node(s) the pod will run on.
**Default Implementation in Rancher's Logging Stack**
### Default Implementation in Rancher's Logging Stack
By default, Rancher taints all Linux nodes with ```cattle.io/os=linux```, and does not taint Windows nodes.
The logging stack pods have ```tolerations``` for this taint, which enables them to run on Linux nodes.
@@ -290,14 +300,14 @@ spec:
In the above example, we ensure that our pod only runs on Linux nodes, and we add a ```toleration``` for the taint we have on all of our Linux nodes.
You can do the same with Rancher's existing taints, or with your own custom ones.
**Are clusters with Windows worker nodes supported?**
### Windows Support
Yes, clusters with Windows worker support logging with some small caveats...
Clusters with Windows worker support logging with some small caveats:
1. Windows node logs are currently unable to be exported.
2. ```fluentd-configcheck``` pod(s) will fail due to an [upstream issue](https://github.com/banzaicloud/logging-operator/issues/592), where ```tolerations``` and ```nodeSelector``` settings are not inherited from the ```logging-operator```.
**Adding NodeSelector Settings and Tolerations for Custom Taints**
### Adding NodeSelector Settings and Tolerations for Custom Taints
If you would like to add your own ```nodeSelector``` settings, or if you would like to add ```tolerations``` for additional taints, you can pass the following to the chart's values.
@@ -316,4 +326,4 @@ However, if you would like to add tolerations for *only* the ```fluentbit``` con
```yaml
fluentbit_tolerations:
# insert tolerations list for fluentbit containers only
```
```
File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 41 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 22 KiB