Merge pull request #1600 from rancher/staging

Best Practices Guide
This commit is contained in:
Denise
2019-07-22 12:59:10 -07:00
committed by GitHub
7 changed files with 262 additions and 0 deletions
@@ -0,0 +1,23 @@
---
title: Best Practices Guide
weight: 1000
---
The purpose of this section is to consolidate best practices for Rancher implementations. This also includes recommendations for related technologies, such as Kubernetes, Docker, containers, and more. The objective is to improve the outcome of a Rancher implementation using the operational experience of Rancher and its customers.
If you have any questions about how these might apply to your use case, please contact your Customer Success Manager or Support.
Use the navigation bar on the left to find the current best practices for managing and deploying the Rancher Server.
For more guidance on best practices, you can consult these resources:
- [Rancher Docs]({{< baseurl >}})
- [Monitoring]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/)
- [Backups and Disaster Recovery]({{< baseurl >}}/rancher/v2.x/en/backups/)
- [Security]({{< baseurl >}}/rancher/v2.x/en/security/)
- [Rancher Blog](https://rancher.com/blog/)
- [Articles about best practices on the Rancher blog](https://rancher.com/tags/best-practices/)
- [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
- [Rancher Forum](https://forums.rancher.com/)
- [Rancher Users Slack](https://slack.rancher.io/)
- [Rancher Labs YouTube Channel - Online Meetups, Demos, Training, and Webinars](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA/featured)
@@ -0,0 +1,51 @@
---
title: Tips for Setting Up Containers
weight: 100
---
Running well built containers can greatly impact the overall performance and security of your environment.
Below are a few tips for setting up your containers.
For a more detailed discussion of security for containers, you can also refer to Rancher's [Guide to Container Security.](https://rancher.com/complete-guide-container-security)
### Use a Common Container OS
When possible, you should try to standardize on a common container base OS.
Smaller distributions such as Alpine and BusyBox reduce container image size and generally have a smaller attack/vulnerability surface.
Popular distributions such as Ubuntu, Fedora, and CentOS are more field-tested and offer more functionality.
Another option is RancherOS, an operating system composed entirely of Docker containers. Everything in RancherOS is a container managed by Docker. This includes system services such as udev and syslog. RancherOS includes only the bare minimum amount of software needed to run Docker, decreasing complexity and boot time. The small code base and decreased attack surface of RancherOS also improves security. For details, you can refer to the [RancherOS docs]({{< baseurl >}}/os/v1.x/en/).
### Start with a FROM scratch container
If your microservice is a standalone static binary, you should use a FROM scratch container.
The FROM scratch container is an [official Docker image](https://hub.docker.com/_/scratch) that is empty so that you can use it to design minimal images.
This will have the smallest attack surface and smallest image size.
### Run Container Processes as Unprivileged
When possible, use a non-privileged user when running processes within your container. While container runtimes provide isolation, vulnerabilities and attacks are still possible. Inadvertent or accidental host mounts can also be impacted if the container is running as root. For details on configuring a security context for a pod or container, refer to the [Kubernetes docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/).
### Define Resource Limits
Apply CPU and memory limits to your pods. This can help manage the resources on your worker nodes and avoid a malfunctioning microservice from impacting other microservices.
In standard Kubernetes, you can set resource limits on the namespace level. In Rancher, you can set resource limits on the project level and they will propagate to all the namespaces within the project. For details, refer to the [Rancher docs]({{<baseurl>}}rancher/v2.x/en/project-admin/resource-quotas/).
When setting resource quotas, if you set anything related to CPU or Memory (i.e. limits or reservations) on a project or namespace, all containers will require a respective CPU or Memory field set during creation. To avoid setting these limits on each and every container during workload creation, a default container resource limit can be specified on the namespace.
The Kubernetes docs have more information on how resource limits can be set at the [container level](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) and the [namespace level](https://kubernetes.io/docs/concepts/policy/resource-quotas/).
### Define Resource Requirements
You should apply CPU and memory requirements to your pods. This is crucial for informing the scheduler which type of compute node your pod needs to be placed on, and ensuring it does not over-provision that node. In Kubernetes, you can set a resource requirement by defining `resources.requests` in the resource requests field in a pod's container spec. For details, refer to the [Kubernetes docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container).
> **Note:** If you set a resource limit for the namespace that the pod is deployed in, and the container doesn't have a specific resource request, the pod will not be allowed to start. To avoid setting these fields on each and every container during workload creation, a default container resource limit can be specified on the namespace.
It is recommended to define resource requirements on the container level because otherwise, the scheduler makes assumptions that will likely not be helpful to your application when the cluster experiences load.
### Liveness and Readiness Probes
Set up liveness and readiness probes for your container. Unless your container completely crashes, Kubernetes will not know it's unhealthy unless you create an endpoint or mechanism that can report container status. Alternatively, make sure your container halts and crashes if unhealthy.
The Kubernetes docs show how to [configure liveness and readiness probes for containers.](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/)
@@ -0,0 +1,45 @@
---
title: Rancher Deployment Strategies
weight: 100
---
There are two recommended deployment strategies. Each one has its own pros and cons. Read more about which one would fit best for your use case:
* [Hub and Spoke](#hub-and-spoke)
* [Regional](#regional)
# Hub & Spoke Strategy
---
In this deployment scenario, there is a single Rancher control plane managing Kubernetes clusters across the globe. The control plane would be run in an HA (high-availability) configuration, and there would be impact due to latencies.
![Hub and Spoke Deployment]({{< baseurl >}}/img/rancher/bpg/hub-and-spoke.png)
### Pros
* Environments could have nodes and network connectivity across regions.
* Single control plane interface to view/see all regions and environments.
* Kubernetes does not require Rancher to operate and can tolerate losing connectivity to the Rancher control plane.
### Cons
* Subject to network latencies.
* If the control plane goes out, global provisioning of new services is unavailable until it is restored. However, each Kubernetes cluster can continue to be managed individually.
# Regional Strategy
---
In the regional deployment model a control plane is deployed in close proximity to the compute nodes.
![Regional Deployment]({{< baseurl >}}/img/rancher/bpg/regional.png)
### Pros
* Rancher functionality in regions stay operational if a control plane in another region goes down.
* Network latency is greatly reduced, improving the performance of functionality in Rancher.
* Upgrades of the Rancher control plane can be done independently per region.
### Cons
* Overhead of managing multiple Rancher installations.
* Visibility across global Kubernetes clusters requires multiple interfaces/panes of glass.
* Deploying multi-cluster apps in Rancher requires repeating the process for each Rancher server.
@@ -0,0 +1,32 @@
---
title: Tips for Running Rancher
weight: 100
---
A high-availability (HA) installation, defined as an installation of at least three nodes, should be used in any production installation of Rancher, as well as any installation deemed "important." Multiple Rancher instances running on multiple nodes ensure high availability that cannot be accomplished with a single node environment.
When you set up your high-availability Rancher installation, consider the following:
### Run Rancher on a Separate Cluster
Don't run other workloads or microservices in your Rancher HA cluster.
### Don't Run Rancher on a Hosted Kubernetes Environment
Don't run Rancher HA in a hosted Kubernetes environment such as Google's GKE, Amazon's EKS, or Microsoft's AKS. These hosted Kubernetes solutions do not expose etcd to a degree that is manageable for Rancher, and their customizations can interfere with Rancher operations.
It is strongly recommended to use hosted infrastructure such as Amazon's EC2 or Google's GCE instead. When you create a cluster using RKE on an infrastructure provider, you can configure the cluster to create etcd snapshots as a backup. You can then [use RKE]({{<baseurl>}}/rke/latest/en/etcd-snapshots/) or [Rancher]({{<baseurl>}}/rancher/v2.x/en/backups/restorations/) to restore your cluster from one of these snapshots. In a hosted Kubernetes environment, this backup and restore functionality is not supported.
### Run All Nodes in the Cluster in the Same Datacenter
For best performance, run all three of your nodes in the same geographic datacenter. If you are running nodes in the cloud, such as AWS, run each node in a separate Availability Zone. For example, launch node 1 in us-west-2a, node 2 in us-west-2b, and node 3 in us-west-2c.
### Development and Production Environments Should be Similar
It's strongly recommended to have a "staging" or "pre-production" environment of your Rancher HA cluster mirrors your production environment as closely as possible in terms of software and hardware configuration.
### Monitor Your Clusters to Plan Capacity
You should run Rancher HA within the [system and hardware requirements]({{< baseurl >}}/rancher/v2.x/en/installation/requirements/) as closely as possible. The more you deviate from the system and hardware requirements, the more risk you take.
However, metrics-driven capacity planning analysis should be the ultimate guidance for scaling Rancher, because the published requirements take into account a variety of workload types.
Using Rancher, you can monitor the state and processes of your cluster nodes, Kubernetes components, and software deployments through integration with Prometheus, a leading open-source monitoring solution, and Grafana, which lets you visualize the metrics from Prometheus.
After you [enable monitoring]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/) in the cluster, you can set up [a notification channel]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/notifiers/) and [cluster alerts]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/alerts/) to let you know if your cluster is approaching its capacity. You can also use the Prometheus and Grafana monitoring framework to establish a baseline for key metrics as you scale.
@@ -0,0 +1,111 @@
---
title: Tips for Scaling, Security and Reliability
weight: 101
---
Rancher allows you to set up numerous combinations of configurations. Some configurations are more appropriate for development and testing, while there are other best practices for production environments for maximum availability and fault tolerance. The following best practices should be followed for production.
# Tips for Preventing and Handling Problems
These tips can help you solve problems before they happen.
### Run Rancher on a Supported OS and Supported Docker Version
Rancher is container-based and can potentially run on any Linux-based operating system. However, only operating systems listed in the [requirements documentation]({{< baseurl >}}/rancher/v2.x/en/installation/requirements/) should be used for running Rancher, along with a supported version of Docker. These versions have been most thoroughly tested and can be properly supported by the Rancher Support team.
### Upgrade Your Kubernetes Version
Keep your Kubernetes cluster up to date with a recent and supported version. Typically the Kubernetes community will support the current version and previous three minor releases (for example, 1.14.x, 1.13.x, 1.12.x, and 1.11.x). After a new version is released, the third-oldest supported version reaches EOL (End of Life) status. Running on an EOL release can be a risk if a security issues are found and patches are not available. The community typically makes minor releases every quarter (every three months).
Ranchers SLAs are not community dependent, but as Kubernetes is a community-driven software, the quality of experience will degrade as you get farther away from the community's supported target.
### Kill Pods Randomly During Testing
Run chaoskube or a similar mechanism to randomly kill pods in your test environment. This will test the resiliency of your infrastructure and the ability of Kubernetes to self-heal. It's not recommended to run this in your production environment.
### Deploy Complicated Clusters with Terraform
Rancher's "Add Cluster" UI is preferable for getting started with Kubernetes cluster orchestration or for simple use cases. However, for more complex or demanding use cases, it is recommended to use a CLI/API driven approach. [Terraform](https://www.terraform.io/) is recommended as the tooling to implement this. When you use Terraform with version control and a CI/CD environment, you can have high assurances of consistency and reliability when deploying Kubernetes clusters. This approach also gives you the most customization options.
Rancher [maintains a Terraform provider](https://rancher.com/blog/2019/rancher-2-terraform-provider/) for working with Rancher 2.0 Kubernetes. It is called the [Rancher2 Provider.](https://www.terraform.io/docs/providers/rancher2/index.html)
### Upgrade Rancher in a Staging Environment
All upgrades, both patch and feature upgrades, should be first tested on a staging environment before production is upgraded. The more closely the staging environment mirrors production, the higher chance your production upgrade will be successful.
### Renew Certificates Before they Expire
Multiple people in your organization should set up calendar reminders for certificate renewal. Consider renewing the certificate two weeks to one month in advance. If you have multiple certificates to track, consider using [monitoring and alerting mechanisms]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/) to track certificate expiration.
Rancher-provisioned Kubernetes clusters will use certificates that expire in one year. Clusters provisioned by other means may have a longer or shorter expiration.
Certificates can be renewed for Rancher-provisioned clusters [through the Rancher user interface]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/certificate-rotation/).
### Enable Recurring Snapshots for Backing up and Restoring the Cluster
Make sure etcd recurring snapshots are enabled. Extend the snapshot retention to a period of time that meets your business needs. In the event of a catastrophic failure or deletion of data, this may be your only recourse for recovery. For details about configuring snapshots, refer to the [RKE documentation]({{<baseurl>}}/rke/latest/en/etcd-snapshots/) or the [Rancher documentation on backups]({{<baseurl>}}/rancher/v2.x/en/backups/).
### Provision Clusters with Rancher
When possible, use Rancher to provision your Kubernetes cluster rather than importing a cluster. This will ensure the best compatibility and supportability.
### Use Stable and Supported Rancher Versions for Production
Do not upgrade production environments to alpha, beta, release candidate (rc), or "latest" versions. These early releases are often not stable and may not have a future upgrade path.
When installing or upgrading a non-production environment to an early release, anticipate problems such as features not working, data loss, outages, and inability to upgrade without a reinstall.
Make sure the feature version you are upgrading to is considered "stable" as determined by Rancher. Use the beta, release candidate, and "latest" versions in a testing, development, or demo environment to try out new features. Feature version upgrades, for example 2.1.x to 2.2.x, should be considered as and when they are released. Some bug fixes and most features are not back ported into older versions.
Keep in mind that Rancher does End of Life support for old versions, so you will eventually want to upgrade if you want to continue to receive patches.
For more detail on what happens during the Rancher product lifecycle, refer to the [Support Maintenance Terms](https://rancher.com/support-maintenance-terms/).
### Ask Rancher for Assistance Before Upgrades
Notify Rancher support of your upgrade plans so they can be on full alert during your maintenance window in the event you need their assistance.
# Network Topology
These tips can help Rancher work more smoothly with your network.
### Use Low-latency Networks for Communication Within Clusters
Kubernetes clusters are best served by low-latency networks. This is especially true for the control plane components and etcd, where lots of coordination and leader election traffic occurs. Networking between Rancher server and the Kubernetes clusters it manages are more tolerant of latency.
### Allow Rancher to Communicate Directly with Clusters
Limit the use of proxies or load balancers between Rancher server and Kubernetes clusters. As Rancher is maintaining a long-lived web sockets connection, these intermediaries can interfere with the connection lifecycle as they often weren't configured with this use case in mind.
# Tips for Scaling and Reliability
These tips can help you scale your cluster more easily.
### Use One Kubernetes Role Per Host
Separate the etcd, control plane, and worker roles onto different hosts. Don't assign multiple roles to the same host, such as a worker and control plane. This will give you maximum scalability.
### Run the Control Plane and etcd on Virtual Machines
Run your etcd and control plane nodes on virtual machines where you can scale vCPU and memory easily if needed in the future.
### Use at Least Three etcd Nodes
Provision 3 or 5 etcd nodes. Etcd requires a quorum to determine a leader by the majority of nodes, therefore it is not recommended to have clusters of even numbers. Three etcd nodes is generally sufficient for smaller clusters and five etcd nodes for large clusters.
### Use at Least Two Control Plane Nodes
Provision two or more control plane nodes. Some control plane components, such as the `kube-apiserver`, run in [active-active](https://www.jscape.com/blog/active-active-vs-active-passive-high-availability-cluster) mode and will give you more scalability. Other components such as kube-scheduler and kube-controller run in active-passive mode (leader elect) and give you more fault tolerance.
### Monitor Your Cluster
Closely monitor and scale your nodes as needed. You should [enable cluster monitoring]({{< baseurl >}}/rancher/v2.x/en/cluster-admin/tools/monitoring/) and use the Prometheus metrics and Grafana visualization options as a starting point.
# Tips for Security
Below are some basic tips for increasing security in Rancher. For more detailed information about securing your cluster, you can refer to these resources:
- Rancher's [security documentation and Kubernetes cluster hardening guide]({{< baseurl >}}/rancher/v2.x/en/security/)
- [101 More Security Best Practices for Kubernetes](https://rancher.com/blog/2019/2019-01-17-101-more-kubernetes-security-best-practices/)
### Update Rancher with Security Patches
Keep your Rancher installation up to date with the latest patches. Patch updates have important software fixes and sometimes have security fixes. When patches with security fixes are released, customers with Rancher licenses are notified by e-mail. These updates are also posted on Rancher's [forum](https://forums.rancher.com/).
### Report Security Issues Directly to Rancher
If you believe you have uncovered a security-related problem in Rancher, please communicate this immediately and discretely to the Rancher support team (rancher@support.com). Posting security issues on public forums such as Twitter, Rancher Slack, GitHub, etc. can potentially compromise security for all Rancher customers. Reporting security issues discretely allows Rancher to assess and mitigate the problem. Security patches are typically given high priority and released as quickly as possible.
### Only Upgrade One Component at a Time
In addition to Rancher software updates, closely monitor security fixes for related software, such as Docker, Linux, and any libraries used by your workloads. For production environments, try to avoid upgrading too many entities during a single maintenance window. Upgrading multiple components can make it difficult to root cause an issue in the event of a failure. As business requirements allow, upgrade one component at a time.
# Network Security
In general, you can use network security best practices in your Rancher and Kubernetes clusters. Consider the following:
### Use a Firewall Between your Hosts and the Internet
Firewalls should be used between your hosts and the Internet (or corporate Intranet). This could be enterprise firewall appliances in a datacenter or SDN constructs in the cloud, such as VPCs, security groups, ingress, and egress rules. Try to limit inbound access only to ports and IP addresses that require it. Outbound access can be shut off (air gap) if environment sensitive information that requires this restriction. If available, use firewalls with intrusion detection and DDoS prevention.
### Run Periodic Security Scans
Run security and penetration scans on your environment periodically. Even with well design infrastructure, a poorly designed microservice could compromise the entire environment.
Binary file not shown.

After

Width:  |  Height:  |  Size: 274 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 260 KiB