mirror of
https://github.com/rancher/rancher-docs.git
synced 2026-05-13 16:43:22 +00:00
Edit new content in Best Practices Guide
This commit is contained in:
@@ -3,9 +3,15 @@ title: Logging Best Practices
|
||||
weight: 1
|
||||
---
|
||||
In this guide, we recommend best practices for cluster-level logging and application logging.
|
||||
# Pre-2.5 Logging, and post-2.5
|
||||
|
||||
Logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level).
|
||||
- [Changes in Logging in Rancher v2.5](#changes-in-logging-in-rancher-v2-5)
|
||||
- [Cluster-level Logging](#cluster-level-logging)
|
||||
- [Application Logging](#application-logging)
|
||||
- [General Best Practices](#general-best-practices)
|
||||
|
||||
# Changes in Logging in Rancher v2.5
|
||||
|
||||
Prior to Rancher v2.5, logging in Rancher has historically been a pretty static integration. There were a fixed list of aggregators to choose from (ElasticSearch, Splunk, Kafka, Fluentd and Syslog), and only two configuration points to choose (Cluster-level and Project-level).
|
||||
|
||||
Logging in 2.5 has been completely overhauled to provide a more flexible experience for log aggregation. With the new logging feature, administrators and users alike can deploy logging that meets fine-grained collection criteria while offering a wider array of destinations and configuration options.
|
||||
|
||||
@@ -76,7 +82,7 @@ spec:
|
||||
```
|
||||
|
||||
|
||||
## General Best Practices
|
||||
# General Best Practices
|
||||
|
||||
- Where possible, output structured log entries (e.g. `syslog`, JSON). This makes handling of the log entry easier as there are already parsers written for these formats.
|
||||
- Try to provide the name of the application that is creating the log entry, in the entry itself. This can make troubleshooting easier as Kubernetes objects do not always carry the name of the application as the object name. For instance, a pod ID may be something like `myapp-098kjhsdf098sdf98` which does not provide much information about the application running inside the container.
|
||||
|
||||
+20
-15
@@ -5,50 +5,55 @@ shortTitle: Rancher Managed Clusters in vSphere
|
||||
|
||||
This guide outlines a reference architecture for provisioning downstream Rancher clusters in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
|
||||
|
||||
## Solution Overview
|
||||
- [1. VM Considerations](#1-vm-considerations)
|
||||
- [2. Network Considerations](#2-network-considerations)
|
||||
- [3. Storage Considerations](#3-storage-considerations)
|
||||
- [4. Backups and Disaster Recovery](#4-backups-and-disaster-recovery)
|
||||
|
||||

|
||||
<figcaption>Solution Overview</figcaption>
|
||||
|
||||
# 1 - VM Considerations
|
||||

|
||||
|
||||
## Leverage VM Templates to Construct the Environment
|
||||
# 1. VM Considerations
|
||||
|
||||
### Leverage VM Templates to Construct the Environment
|
||||
|
||||
To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
|
||||
|
||||
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts
|
||||
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across ESXi Hosts
|
||||
|
||||
Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
|
||||
|
||||
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores
|
||||
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Downstream Cluster Nodes Across Datastores
|
||||
|
||||
Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
|
||||
|
||||
## Configure VM's as Appropriate for Kubernetes
|
||||
### Configure VM's as Appropriate for Kubernetes
|
||||
|
||||
It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
|
||||
|
||||
# 2 - Network Considerations
|
||||
# 2. Network Considerations
|
||||
|
||||
## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
|
||||
### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
|
||||
|
||||
Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
|
||||
|
||||
## Consistent IP Addressing for VM's
|
||||
### Consistent IP Addressing for VM's
|
||||
|
||||
Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
|
||||
|
||||
# 3 - Storage Considerations
|
||||
# 3. Storage Considerations
|
||||
|
||||
## Leverage SSD Drives for ETCD Nodes
|
||||
### Leverage SSD Drives for ETCD Nodes
|
||||
|
||||
ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
|
||||
|
||||
# 4 - Backup and Disaster Recovery
|
||||
# 4. Backups and Disaster Recovery
|
||||
|
||||
## Perform Regular Downstream Cluster Backups
|
||||
### Perform Regular Downstream Cluster Backups
|
||||
|
||||
Kubernetes uses etcd to store all its data - from configuration, state and metadata. Backing this up is crucial in the event of disaster recovery.
|
||||
|
||||
## Back up Downstream Node VM's
|
||||
### Back up Downstream Node VMs
|
||||
|
||||
Incorporate the Rancher downstream node VM's within a standard VM backup policy.
|
||||
@@ -7,11 +7,19 @@ Configuring sensible monitoring and alerting rules is vital for running any prod
|
||||
|
||||
The [Rancher Documentation]({{<baseurl>}}/rancher/v2.x/en/monitoring-alerting/v2.5/) describes in detail, how you can set up a complete Prometheus and Grafana stack. Out of the box this will scrape monitoring data from all system and Kubernetes components in your cluster and provide sensible dashboards and alerts for them to get started. But for a reliable setup, you also need to monitor your own workloads and adapt Prometheus and Grafana to your own specific use cases and cluster sizes. This document aims to give you best practices for this.
|
||||
|
||||
## What to monitor
|
||||
- [What to Monitor](#what-to-monitor)
|
||||
- [Configuring Prometheus Resource Usage](#configuring-prometheus-resource-usage)
|
||||
- [Scraping Custom Workloads](#scraping-custom-workloads)
|
||||
- [Monitoring in a (Micro)Service Architecture](#monitoring-in-a-micro-service-architecture)
|
||||
- [Real User Monitoring](#real-user-monitoring)
|
||||
- [Security Monitoring](#security-monitoring)
|
||||
- [Setting up Alerts](#setting-up-alerts)
|
||||
|
||||
# What to Monitor
|
||||
|
||||
Kubernetes itself, as well as applications running inside of it, form a distributed system where different components interact with each other. For the whole system and each individual component, you have to ensure performance, availability, reliability and scalability. A good resource with more details and information is Google's free [Site Reliability Engineering Book](https://landing.google.com/sre/sre-book/), especially the chapter about [Monitoring distributed systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/).
|
||||
|
||||
## Configuring Prometheus Resource Usage
|
||||
# Configuring Prometheus Resource Usage
|
||||
|
||||
When installing the integrated monitoring stack, Rancher allows to configure several settings that are dependent on the size of your cluster and the workloads running in it. This chapter covers these in more detail.
|
||||
|
||||
@@ -49,13 +57,13 @@ You can find more information about how to calculate the necessary memory in thi
|
||||
|
||||
The amount of necessary CPUs correlate with the amount of queries you are performing.
|
||||
|
||||
### Federation and long-term Storage
|
||||
### Federation and Long-term Storage
|
||||
|
||||
Prometheus is not meant to store metrics for a long amount of time, but should only be used for short term storage.
|
||||
|
||||
In order to store some, or all metrics for a long time, you can leverage Prometheus' [remote read/write](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations) capabilities to connect it to storage systems like [Thanos](https://thanos.io/), [InfluxDB](https://www.influxdata.com/), [M3DB](https://www.m3db.io/), or others. You can find an example setup in this [blog post](https://rancher.com/blog/2020/prometheus-metric-federation).
|
||||
|
||||
## Scraping Custom Workloads
|
||||
# Scraping Custom Workloads
|
||||
|
||||
While the integrated Rancher Monitoring already scrapes system metrics from a cluster's nodes and system components, the custom workloads that you deploy on Kubernetes should also be scraped for data. For that you can configure Prometheus to do an HTTP request to an endpoint of your applications in a certain interval. These endpoints should then return their metrics in a Prometheus format.
|
||||
|
||||
@@ -83,23 +91,23 @@ To still get metrics for these use cases, you can set up [prometheus-pushgateway
|
||||
|
||||
Sometimes it is useful to monitor workloads from the outside. For this, you can use the [Prometheus blackbox-exporter](https://github.com/prometheus/blackbox_exporter) which allows probing any kind of endpoint over HTTP, HTTPS, DNS, TCP and ICMP.
|
||||
|
||||
## Monitoring in a (Micro)Service Architecture
|
||||
# Monitoring in a (Micro)Service Architecture
|
||||
|
||||
If you have a (micro)service architecture where multiple individual workloads within your cluster are communicating with each other, it is really important to have detailed metrics and traces about this traffic to understand how all these workloads are communicating with each other and where a problem or bottleneck may be.
|
||||
|
||||
Of course you can monitor all this internal traffic in all your workloads and expose these metrics to Prometheus. But this can quickly become quite work intensive. Service Meshes like Istio, which can be installed with [a click](https://rancher.com/docs/rancher/v2.x/en/cluster-admin/tools/istio/) in Rancher, can do this automatically and provide rich telemetry about the traffic between all services.
|
||||
|
||||
## Real User Monitoring
|
||||
# Real User Monitoring
|
||||
|
||||
Monitoring the availability and performance of all your internal workloads is vitally important to run stable, reliable and fast applications. But these metrics only show you parts of the picture. To get a complete view it is also necessary to know how your end users are actually perceiving it. For this you can look into various [Real user monitoring solutions](https://en.wikipedia.org/wiki/Real_user_monitoring).
|
||||
|
||||
## Security Monitoring
|
||||
# Security Monitoring
|
||||
|
||||
In addition to monitoring workloads to detect performance, availability or scalability problems, the cluster and the workloads running into it should also be monitored for potential security problems. A good starting point is to frequently run and alert on [CIS Scans]({{<baseurl>}}/rancher/v2.x/en/cis-scans/v2.5/) which check if the cluster is configured according to security best practices.
|
||||
|
||||
For the workloads, you can have a look at Kubernetes and Container security solutions like [Falko](https://falco.org/), [Aqua Kubernetes Security](https://www.aquasec.com/solutions/kubernetes-container-security/), [SysDig](https://sysdig.com/).
|
||||
|
||||
## Setting up Alerts
|
||||
# Setting up Alerts
|
||||
|
||||
Getting all the metrics into a monitoring systems and visualizing them in dashboards is great, but you also want to be pro-actively alerted if something goes wrong.
|
||||
|
||||
|
||||
+27
-21
@@ -6,80 +6,86 @@ weight: 3
|
||||
|
||||
This guide outlines a reference architecture for installing Rancher on an RKE Kubernetes cluster in a vSphere environment, in addition to standard vSphere best practices as documented by VMware.
|
||||
|
||||
## Solution Overview
|
||||
- [1. Load Balancer Considerations](#1-load-balancer-considerations)
|
||||
- [2. VM Considerations](#2-vm-considerations)
|
||||
- [3. Network Considerations](#3-network-considerations)
|
||||
- [4. Storage Considerations](#4-storage-considerations)
|
||||
- [5. Backups and Disaster Recovery](#5-backups-and-disaster-recovery)
|
||||
|
||||
<figcaption>Solution Overview</figcaption>
|
||||
|
||||

|
||||
|
||||
# 1 - Load Balancer Considerations
|
||||
# 1. Load Balancer Considerations
|
||||
|
||||
A load balancer is required to direct traffic to the Rancher workloads residing on the RKE nodes.
|
||||
|
||||
## Leverage Fault Tolerance and High Availability
|
||||
### Leverage Fault Tolerance and High Availability
|
||||
|
||||
Leverage the use of an external (hardware or software) load balancer that has inherit high-availability functionality (F5, NSX-T, Keepalived, etc).
|
||||
|
||||
## Back Up Load Balancer Configuration
|
||||
### Back Up Load Balancer Configuration
|
||||
|
||||
In the event of a Disaster Recovery activity, availability of the Load balancer configuration will expedite the recovery process.
|
||||
|
||||
## Configure Health Checks
|
||||
### Configure Health Checks
|
||||
|
||||
Configure the Load balancer to automatically mark nodes as unavailable if a health check is failed. For example, NGINX can facilitate this with:
|
||||
|
||||
`max_fails=3 fail_timeout=5s`
|
||||
|
||||
## Leverage an External Load Balancer
|
||||
### Leverage an External Load Balancer
|
||||
|
||||
Avoid implementing a software load balancer within the management cluster.
|
||||
|
||||
## Secure Access to Rancher
|
||||
### Secure Access to Rancher
|
||||
|
||||
Configure appropriate Firewall / ACL rules to only expose access to Rancher
|
||||
|
||||
# 2 - VM Considerations
|
||||
# 2. VM Considerations
|
||||
|
||||
## Size the VM's According to Rancher Documentation
|
||||
### Size the VM's According to Rancher Documentation
|
||||
|
||||
https://rancher.com/docs/rancher/v2.x/en/installation/requirements/
|
||||
|
||||
## Leverage VM Templates to Construct the Environment
|
||||
### Leverage VM Templates to Construct the Environment
|
||||
|
||||
To facilitate consistency across the deployed Virtual Machines across the environment, consider the use of "Golden Images" in the form of VM templates. Packer can be used to accomplish this, adding greater customisation options.
|
||||
|
||||
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts
|
||||
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across ESXi Hosts
|
||||
|
||||
Doing so will ensure node VM's are spread across multiple ESXi hosts - preventing a single point of failure at the host level.
|
||||
|
||||
## Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores
|
||||
### Leverage DRS Anti-Affinity Rules (Where Possible) to Separate Rancher Cluster Nodes Across Datastores
|
||||
|
||||
Doing so will ensure node VM's are spread across multiple datastores - preventing a single point of failure at the datastore level.
|
||||
|
||||
## Configure VM's as Appropriate for Kubernetes
|
||||
### Configure VM's as Appropriate for Kubernetes
|
||||
|
||||
It’s important to follow K8s and etcd best practices when deploying your nodes, including disabling swap, double-checking you have full network connectivity between all machines in the cluster, using unique hostnames, MAC addresses, and product_uuids for every node.
|
||||
|
||||
# 3 - Network Considerations
|
||||
# 3. Network Considerations
|
||||
|
||||
## Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
|
||||
### Leverage Low Latency, High Bandwidth Connectivity Between ETCD Nodes
|
||||
|
||||
Deploy etcd members within a single data center where possible to avoid latency overheads and reduce the likelihood of network partitioning. For most setups, 1Gb connections will suffice. For large clusters, 10Gb connections can reduce the time taken to restore from backup.
|
||||
|
||||
## Consistent IP Addressing for VM's
|
||||
### Consistent IP Addressing for VM's
|
||||
|
||||
Each node used should have a static IP configured. In the case of DHCP, each node should have a DHCP reservation to make sure the node gets the same IP allocated.
|
||||
|
||||
# 4 - Storage Considerations
|
||||
# 4. Storage Considerations
|
||||
|
||||
## Leverage SSD Drives for ETCD Nodes
|
||||
### Leverage SSD Drives for ETCD Nodes
|
||||
|
||||
ETCD is very sensitive to write latency. Therefore, leverage SSD disks where possible.
|
||||
|
||||
# 5 - Backup and Disaster Recovery
|
||||
# 5. Backups and Disaster Recovery
|
||||
|
||||
## Perform Regular Management Cluster Backups
|
||||
### Perform Regular Management Cluster Backups
|
||||
|
||||
Rancher stores its data in the ETCD datastore of the Kubernetes cluster it resides on. Like with any Kubernetes cluster, perform frequent, tested backups of this cluster.
|
||||
|
||||
## Back up Rancher Cluster Node VM's
|
||||
### Back up Rancher Cluster Node VMs
|
||||
|
||||
Incorporate the Rancher management node VM's within a standard VM backup policy.
|
||||
Reference in New Issue
Block a user