Add troubleshooting section

2026-05-15 09:33:30 +00:00 · 2018-11-23 23:48:50 +01:00
parent 13e6ae822b
commit 2fc866c67f
7 changed files with 804 additions and 0 deletions
@@ -121,3 +121,13 @@ kubectl -n ingress-nginx logs -f nginx-ingress-controller-rfjrq nginx-ingress-co
 ...
 W0705 23:04:58.240571       7 backend_ssl.go:49] error obtaining PEM from secret cattle-system/tls-rancher-ingress: error retrieving secret cattle-system/tls-rancher-ingress: secret cattle-system/tls-rancher-ingress was not found
 ```
+
+### no matches for kind "Issuer"
+
+The [SSL configuration]({{< baseurl >}}/rancher/v2.x/en/installation/ha/helm-rancher/#choose-your-ssl-configuration) option you have chosen requires [cert-manager]({{< baseurl >}}/rancher/v2.x/en/installation/ha/helm-rancher/#optional-install-cert-manager) to be installed before installing Rancher or else the following error is shown:
+
+```
+Error: validation failed: unable to recognize "": no matches for kind "Issuer" in version "certmanager.k8s.io/v1alpha1"
+```
+
+Install [cert-manager]({{< baseurl >}}/rancher/v2.x/en/installation/ha/helm-rancher/#optional-install-cert-manager) and try installing Rancher again.
@@ -0,0 +1,7 @@
+---
+title: Troubleshooting
+weight: 8100
+---
+
+Troubleshooting
+
@@ -0,0 +1,62 @@
+---
+title: Imported clusters
+weight: 105
+---
+
+The commands/steps listed on this page can be used to check clusters that you are importing or that are imported in Rancher.
+
+Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kubeconfig_from_imported_cluster.yml`)
+
+### Rancher agents
+
+Communication to the cluster (Kubernetes API via cattle-cluster-agent) and communication to the nodes is done through Rancher agents.
+
+If the cattle-cluster-agent cannot connect to the configured `server-url`, the cluster will remain in **Pending** state, showing `Waiting for full cluster configuration`. 
+
+#### cattle-node-agent
+
+Check if the cattle-node-agent pods are present on each node, have status **Running** and don't have a high count of Restarts:
+
+```
+kubectl -n cattle-system get pods -l app=cattle-agent -o wide
+```
+
+Example output:
+
+```
+NAME                      READY     STATUS    RESTARTS   AGE       IP                NODE
+cattle-node-agent-4gc2p   1/1       Running   0          2h        x.x.x.x           worker-1
+cattle-node-agent-8cxkk   1/1       Running   0          2h        x.x.x.x           etcd-1
+cattle-node-agent-kzrlg   1/1       Running   0          2h        x.x.x.x           etcd-0
+cattle-node-agent-nclz9   1/1       Running   0          2h        x.x.x.x           controlplane-0
+cattle-node-agent-pwxp7   1/1       Running   0          2h        x.x.x.x           worker-0
+cattle-node-agent-t5484   1/1       Running   0          2h        x.x.x.x           controlplane-1
+cattle-node-agent-t8mtz   1/1       Running   0          2h        x.x.x.x           etcd-2
+```
+
+Check logging of a specific cattle-node-agent pod or all cattle-node-agent pods:
+
+```
+kubectl -n cattle-system logs -l app=cattle-agent
+```
+
+#### cattle-cluster-agent
+
+Check if the cattle-cluster-agent pod is present in the cluster, has status **Running** and doesn't have a high count of Restarts:
+
+```
+kubectl -n cattle-system get pods -l app=cattle-cluster-agent -o wide
+```
+
+Example output:
+
+```
+NAME                                    READY     STATUS    RESTARTS   AGE       IP           NODE
+cattle-cluster-agent-54d7c6c54d-ht9h4   1/1       Running   0          2h        x.x.x.x      worker-1
+```
+
+Check logging of cattle-cluster-agent pod:
+
+```
+kubectl -n cattle-system logs -l app=cattle-cluster-agent
+```
@@ -0,0 +1,363 @@
+---
+title: Kubernetes components
+weight: 100
+---
+
+The commands/steps listed on this page apply to the core Kubernetes components on [Rancher Launched Kubernetes]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/) clusters.
+
+## Diagram
+
+![Cluster diagram]({{< baseurl >}}/img/rancher/clusterdiagram.svg)<br/>
+<sup>Lines show the traffic flow between components. Colors are used purely for visual aid</sup>
+
+## etcd
+
+This section applies to nodes with the `etcd` role.
+
+### Is etcd container is running
+
+The container for etcd should have status **Up**. The duration shown after **Up** is the time the container has been running.
+
+```
+docker ps -a -f=name=etcd$
+```
+
+Example output:
+```
+CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS               NAMES
+605a124503b9        rancher/coreos-etcd:v3.2.18   "/usr/local/bin/et..."   2 hours ago         Up 2 hours                              etcd
+```
+
+### etcd container logging
+
+The logging of the container can contain information on what the problem could be.
+
+```
+docker logs etcd
+```
+
+* `health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused`
+
+A connection to the address shown on port 2380 cannot be established. Check if the etcd container is running on the host with the address shown.
+
+* `xxx is starting a new election at term x`
+
+The etcd cluster has lost it's quorum and is trying to establish a new leader. This can happen when the majority of the nodes running etcd go down/unreachable.
+
+* `connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0  <nil>}`
+
+The host firewall is preventing network communication.
+
+* `rafthttp: request cluster ID mismatch`
+
+The node with the etcd instance logging `rafthttp: request cluster ID mismatch` is trying to join a cluster that has already been formed with another peer. The node should be removed from the cluster, and re-added.
+
+* `rafthttp: failed to find member`
+
+The cluster state (`/var/lib/etcd`) contains wrong information to join the cluster. The node should be removed from the cluster, the state directory should be cleaned and the node should be re-added.
+
+### etcd cluster and connectivity checks
+
+If any of the commands respond with `Error:  context deadline exceeded`, the etcd instance is unhealthy (either quorum is lost or the instance is not correctly joined in the cluster)
+
+* Check etcd members on all nodes
+
+Output should contain all the nodes with the `etcd` role and the output should be identical on all nodes.
+
+```
+docker exec etcd etcdctl member list
+```
+
+Example output:
+```
+xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001
+xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001
+xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001
+```
+
+* Check endpoint status
+
+The values for `RAFT TERM` should be equal and `RAFT INDEX` should be not be too far apart from each other.
+
+```
+docker exec etcd etcdctl endpoint status --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") --write-out table
+```
+
+Example output:
+```
+-----------------+------------------+---------+---------+-----------+-----------+------------+
+| ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
+| https://IP:2379 | 333ef673fc4add56 |  3.2.18 |   24 MB |     false |        72 |      66887 |
+| https://IP:2379 | 5feed52d940ce4cf |  3.2.18 |   24 MB |      true |        72 |      66887 |
+| https://IP:2379 | db6b3bdb559a848d |  3.2.18 |   25 MB |     false |        72 |      66887 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
+```
+
+* Check endpoint health
+
+```
+docker exec etcd etcdctl endpoint health --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','")
+```
+
+Example output:
+```
+https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
+https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
+https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms
+```
+
+* Check connectivty on port TCP/2379
+
+```
+for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5"); do
+  echo "Validating connection to ${endpoint}/health";
+  curl -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health";
+done
+```
+
+If you are running on an operating system without `curl` (for example, RancherOS), you can use the following command which uses a Docker container to run the `curl` command.
+
+```
+for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5"); do
+  echo "Validating connection to ${endpoint}/health";
+  docker run --net=host -v /opt/rke/etc/kubernetes/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health"
+done
+```
+
+Example output:
+```
+Validating connection to https://IP:2379/health
+{"health": "true"}
+Validating connection to https://IP:2379/health
+{"health": "true"}
+Validating connection to https://IP:2379/health
+{"health": "true"}
+```
+
+* Check connectivty on port TCP/2380
+
+```
+for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f4"); do
+  echo "Validating connection to ${endpoint}/version";
+  curl -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/version";
+done
+```
+
+If you are running on an operating system without `curl` (for example, RancherOS), you can use the following command which uses a Docker container to run the `curl` command.
+
+```
+for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f4"); do
+  echo "Validating connection to ${endpoint}/version";
+  docker run --net=host -v /opt/rke/etc/kubernetes/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/version"
+done
+```
+
+Example output:
+```
+Validating connection to https://IP:2380/version
+{"etcdserver":"3.2.18","etcdcluster":"3.2.0"}
+Validating connection to https://IP:2380/version
+{"etcdserver":"3.2.18","etcdcluster":"3.2.0"}
+Validating connection to https://IP:2380/version
+{"etcdserver":"3.2.18","etcdcluster":"3.2.0"}
+```
+
+### etcd alarms
+
+etcd will trigger alarms, for instance when it runs out of space.
+
+```
+docker exec etcd etcdctl alarm list
+```
+
+Example output when NOSPACE alarm is triggered:
+```
+memberID:x alarm:NOSPACE
+memberID:x alarm:NOSPACE
+memberID:x alarm:NOSPACE
+```
+
+### etcd space errors
+
+Related error messages are `etcdserver: mvcc: database space exceeded` or `applying raft message exceeded backend quota`. Alarm `NOSPACE` will be triggered.
+
+Resolution:
+
+* Compact the keyspace
+
+```
+rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
+docker exec etcd etcdctl compact "$rev"
+```
+
+Example output:
+```
+compacted revision xxx
+```
+
+* Defrag all etcd members
+
+```
+docker exec etcd etcdctl defrag --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','")
+```
+
+Example output:
+```
+Finished defragmenting etcd member[https://IP:2379]
+Finished defragmenting etcd member[https://IP:2379]
+Finished defragmenting etcd member[https://IP:2379]
+```
+
+* Check endpoint status
+
+```
+docker exec etcd etcdctl endpoint status --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") --write-out table
+```
+
+Example output:
+```
+-----------------+------------------+---------+---------+-----------+-----------+------------+
+| ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
+| https://IP:2379 |  e973e4419737125 |  3.2.18 |  553 kB |     false |        32 |    2449410 |
+| https://IP:2379 | 4a509c997b26c206 |  3.2.18 |  553 kB |     false |        32 |    2449410 |
+| https://IP:2379 | b217e736575e9dd3 |  3.2.18 |  553 kB |      true |        32 |    2449410 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
+```
+
+## controlplane
+
+This section applies to nodes with the `controlplane` role.
+
+### Are the containers for controlplane running
+
+There are three specific containers launched on nodes with the `controlpane` role:
+
+* `kube-apiserver`
+* `kube-controller-manager`
+* `kube-scheduler`
+
+The containers should have status **Up**. The duration shown after **Up** is the time the container has been running.
+
+```
+docker ps -a -f=name='kube-apiserver|kube-controller-manager|kube-scheduler'
+```
+
+Example output:
+```
+CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
+26c7159abbcc        rancher/hyperkube:v1.11.5-rancher1   "/opt/rke-tools/en..."   3 hours ago         Up 3 hours                              kube-apiserver
+f3d287ca4549        rancher/hyperkube:v1.11.5-rancher1   "/opt/rke-tools/en..."   3 hours ago         Up 3 hours                              kube-scheduler
+bdf3898b8063        rancher/hyperkube:v1.11.5-rancher1   "/opt/rke-tools/en..."   3 hours ago         Up 3 hours                              kube-controller-manager
+```
+
+### controlplane container logging
+
+The logging of the containers can contain information on what the problem could be.
+
+```
+docker logs kube-apiserver
+docker logs kube-controller-manager
+docker logs kube-scheduler
+```
+
+## nginx-proxy
+
+The `nginx-proxy` container is deployed on every node that does not have the `controlplane` role. It provides access to all the nodes with the `controlplane` role by dynamically generating the NGINX configuration based on available nodes with the `controlplane` role.
+
+### Is the container running
+
+The container is called `nginx-proxy` and should have status `Up`. The duration shown after `Up` is the time the container has been running.
+
+```
+docker ps -a -f=name=nginx-proxy
+```
+
+Example output:
+
+```
+docker ps -a -f=name=nginx-proxy
+CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS               NAMES
+c3e933687c0e        rancher/rke-tools:v0.1.15   "nginx-proxy CP_HO..."   3 hours ago         Up 3 hours                              nginx-proxy
+```
+
+### Check generated NGINX configuration
+
+The generated configuration should include the IP addresses of the nodes with the `controlplane` role. The configuration can be checked using the following command:
+
+```
+docker exec nginx-proxy cat /etc/nginx/nginx.conf
+```
+
+Example output:
+```
+error_log stderr notice;
+
+worker_processes auto;
+events {
+  multi_accept on;
+  use epoll;
+  worker_connections 1024;
+}
+
+stream {
+        upstream kube_apiserver {
+            
+            server ip_of_controlplane_node1:6443;
+            
+            server ip_of_controlplane_node2:6443;
+            
+        }
+
+        server {
+            listen        6443;
+            proxy_pass    kube_apiserver;
+            proxy_timeout 30;
+            proxy_connect_timeout 2s;
+
+        }
+
+}
+```
+
+### nginx-proxy container logging
+
+The logging of the containers can contain information on what the problem could be.
+
+```
+docker logs nginx-proxy
+```
+
+## worker and generic
+
+This section applies to every node as it includes components that run on nodes with any role.
+
+### Are the containers running
+
+There are three specific containers launched on nodes with the `controlpane` role:
+
+* kubelet
+* kube-proxy
+
+The containers should have status `Up`. The duration shown after `Up` is the time the container has been running.
+
+```
+docker ps -a -f=name='kubelet|kube-proxy'
+```
+
+Example output:
+```
+CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
+158d0dcc33a5        rancher/hyperkube:v1.11.5-rancher1   "/opt/rke-tools/en..."   3 hours ago         Up 3 hours                              kube-proxy
+a30717ecfb55        rancher/hyperkube:v1.11.5-rancher1   "/opt/rke-tools/en..."   3 hours ago         Up 3 hours                              kubelet
+```
+
+### container logging
+
+The logging of the containers can contain information on what the problem could be.
+
+```
+docker logs kubelet
+docker logs kube-proxy
+```
@@ -0,0 +1,181 @@
+---
+title: Kubernetes resources
+weight: 101
+---
+
+The commands/steps listed on this page can be used to check the most important Kubernetes resources and apply to [Rancher Launched Kubernetes]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/) clusters.
+
+Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
+
+### Nodes
+
+#### Get nodes
+
+Run the command below and check the following:
+
+- All nodes in your cluster should be listed, make sure there is not one missing.
+- All nodes should have the **Ready** status (if not in **Ready** state, check the `kubelet` container logs on that node using `docker logs kubelet`)
+- Check if all nodes report the correct version.
+- Check if OS/Kernel/Docker values are shown as expected (possibly you can relate issues due to upgraded OS/Kernel/Docker)
+
+
+```
+kubectl get nodes
+``` 
+
+Example output:
+
+```
+NAME                              STATUS    ROLES          AGE       VERSION   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
+etcd-0                            Ready     etcd           2m        v1.11.5   <none>        Ubuntu 16.04.5 LTS   4.4.0-138-generic   docker://17.3.2
+etcd-1                            Ready     etcd           2m        v1.11.5   <none>        Ubuntu 16.04.5 LTS   4.4.0-138-generic   docker://17.3.2
+etcd-2                            Ready     etcd           2m        v1.11.5   <none>        Ubuntu 16.04.5 LTS   4.4.0-138-generic   docker://17.3.2
+controlplane-0                    Ready     controlplane   2m        v1.11.5   <none>        Ubuntu 16.04.5 LTS   4.4.0-138-generic   docker://17.3.2
+controlplane-1                    Ready     controlplane   1m        v1.11.5   <none>        Ubuntu 16.04.5 LTS   4.4.0-138-generic   docker://17.3.2
+worker-0                          Ready     worker         2m        v1.11.5   <none>        Ubuntu 16.04.5 LTS   4.4.0-138-generic   docker://17.3.2
+worker-1                          Ready     worker         2m        v1.11.5   <none>        Ubuntu 16.04.5 LTS   4.4.0-138-generic   docker://17.3.2
+```
+
+#### Get node conditions
+
+Run the command below to list nodes with [Node Conditions](https://kubernetes.io/docs/concepts/architecture/nodes/#condition) that are active that could prevent normal operation.
+
+```
+kubectl get nodes -o go-template='{{range .items}}{{$node := .}}{{range .status.conditions}}{{if ne .type "Ready"}}{{if eq .status "True"}}{{$node.metadata.name}}{{": "}}{{.type}}{{":"}}{{.status}}{{"\n"}}{{end}}{{end}}{{end}}{{end}}'
+```
+
+Example output:
+
+```
+worker-0: DiskPressure:True
+```
+
+### Ingress Controller
+
+The default Ingress Controller is NGINX and is deployed as a DaemonSet in the `ingress-nginx` namespace. The pods are only scheduled to nodes with the `worker` role.
+
+Check if the pods are running on all nodes:
+
+```
+kubectl -n ingress-nginx get pods -o wide
+```
+
+Example output:
+
+```
+kubectl -n ingress-nginx get pods -o wide
+NAME                                    READY     STATUS    RESTARTS   AGE       IP               NODE
+default-http-backend-797c5bc547-kwwlq   1/1       Running   0          17m       x.x.x.x          worker-1
+nginx-ingress-controller-4qd64          1/1       Running   0          14m       x.x.x.x          worker-1
+nginx-ingress-controller-8wxhm          1/1       Running   0          13m       x.x.x.x          worker-0
+```
+
+If a pod is unable to run (Status is not **Running**, Ready status is not showing `1/1` or you see a high count of Restarts), check the pod details, logs and namespace events.
+
+#### Pod details
+
+```
+kubectl -n ingress-nginx describe pods -l app=ingress-nginx
+```
+
+#### Pod container logs
+
+```
+kubectl -n ingress-nginx logs -l app=ingress-nginx
+```
+
+#### Namespace events
+
+```
+kubectl -n ingress-nginx get events
+```
+
+### Rancher agents
+
+Communication to the cluster (Kubernetes API via cattle-cluster-agent) and communication to the nodes (cluster provisioning via cattle-node-agent) is done through Rancher agents.
+
+#### cattle-node-agent
+
+Check if the cattle-node-agent pods are present on each node, have status **Running** and don't have a high count of Restarts:
+
+```
+kubectl -n cattle-system get pods -l app=cattle-agent -o wide
+```
+
+Example output:
+
+```
+NAME                      READY     STATUS    RESTARTS   AGE       IP                NODE
+cattle-node-agent-4gc2p   1/1       Running   0          2h        x.x.x.x           worker-1
+cattle-node-agent-8cxkk   1/1       Running   0          2h        x.x.x.x           etcd-1
+cattle-node-agent-kzrlg   1/1       Running   0          2h        x.x.x.x           etcd-0
+cattle-node-agent-nclz9   1/1       Running   0          2h        x.x.x.x           controlplane-0
+cattle-node-agent-pwxp7   1/1       Running   0          2h        x.x.x.x           worker-0
+cattle-node-agent-t5484   1/1       Running   0          2h        x.x.x.x           controlplane-1
+cattle-node-agent-t8mtz   1/1       Running   0          2h        x.x.x.x           etcd-2
+```
+
+Check logging of a specific cattle-node-agent pod or all cattle-node-agent pods:
+
+```
+kubectl -n cattle-system logs -l app=cattle-agent
+```
+
+#### cattle-cluster-agent
+
+Check if the cattle-cluster-agent pod is present in the cluster, has status **Running** and doesn't have a high count of Restarts:
+
+```
+kubectl -n cattle-system get pods -l app=cattle-cluster-agent -o wide
+```
+
+Example output:
+
+```
+NAME                                    READY     STATUS    RESTARTS   AGE       IP           NODE
+cattle-cluster-agent-54d7c6c54d-ht9h4   1/1       Running   0          2h        x.x.x.x      worker-1
+```
+
+Check logging of cattle-cluster-agent pod:
+
+```
+kubectl -n cattle-system logs -l app=cattle-cluster-agent
+```
+
+### Generic
+
+#### All pods/jobs should have status **Running**/**Completed**
+
+To check, run the command:
+
+```
+kubectl get pods --all-namespaces
+```
+
+If a pod is not in **Running** state, you can dig into the root cause by running:
+
+##### Describe pod
+
+```
+kubectl describe pod POD_NAME -n NAMESPACE
+```
+
+##### Pod container logs
+
+```
+kubectl logs POD_NAME -n NAMESPACE
+```
+
+If a job is not in **Completed** state, you can dig into the root cause by running:
+
+##### Describe job
+
+```
+kubectl describe job JOB_NAME -n NAMESPACE
+```
+
+##### Logs from the containers of pods of the job
+
+```
+kubectl logs -l job-name=JOB_NAME -n NAMESPACE
+```
@@ -0,0 +1,111 @@
+---
+title: Networking
+weight: 102
+---
+
+The commands/steps listed on this page can be used to check networking related issues in your cluster.
+
+Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
+
+### Double check if all the required ports are opened in your (host) firewall
+
+Double check if all the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) are opened in your (host) firewall. The overlay network uses UDP in comparison to all other required ports which are TCP.
+
+### Check if overlay network is functioning correctly
+
+The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod.
+
+To test the overlay network, you can launch the following `DaemonSet` definition. This will run an `alpine` container on every host, which we will use to run a `ping` test between containers on all hosts.
+
+1. Save the following file as `ds-alpine.yml`
+
+    ```
+    apiVersion: apps/v1
+    kind: DaemonSet
+    metadata:
+      name: alpine
+    spec:
+      selector:
+          matchLabels:
+            name: alpine
+      template:
+        metadata:
+          labels:
+            name: alpine
+        spec:
+          tolerations:
+          - effect: NoExecute
+            key: "node-role.kubernetes.io/etcd"
+            value: "true"
+          - effect: NoSchedule
+            key: "node-role.kubernetes.io/controlplane"
+            value: "true"
+          containers:
+          - image: alpine
+            imagePullPolicy: Always
+            name: alpine
+            command: ["sh", "-c", "tail -f /dev/null"]
+            terminationMessagePath: /dev/termination-log
+    ```
+
+2. Launch it using `kubectl create -f ds-alpine.yml`
+3. Wait until `kubectl rollout status ds/alpine -w` returns: `daemon set "alpine" successfully rolled out`.
+4. Run the following command to let each container on every host ping each other (it's a single line command).
+
+    ```
+    echo "=> Start"; kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End"
+    ```
+
+5. When this command has finished running, the output indicating everything is correct is:
+
+    ```
+    => Start
+    => End
+    ```
+
+If you see error in the output, that means that the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) for overlay networking are not opened between the hosts indicated.
+
+Example error output of a situation where NODE1 had the UDP ports blocked.
+
+```
+=> Start
+command terminated with exit code 1
+NODE2 cannot reach NODE1
+command terminated with exit code 1
+NODE3 cannot reach NODE1
+command terminated with exit code 1
+NODE1 cannot reach NODE2
+command terminated with exit code 1
+NODE1 cannot reach NODE3
+=> End
+```
+
+### Resolved issues
+
+#### Overlay network broken when using Canal/Flannel due to missing node annotations
+
+| | |
+|------------|------------|
+| GitHub issue | [#13644](https://github.com/rancher/rancher/issues/13644) |
+| Resolved in |  v2.1.2 |
+
+To check if your cluster is affected, the following command will list nodes that are broken (this command requires `jq` to be installed):
+
+```
+kubectl get nodes -o json | jq '.items[].metadata | select(.annotations["flannel.alpha.coreos.com/public-ip"] == null or .annotations["flannel.alpha.coreos.com/kube-subnet-manager"] == null or .annotations["flannel.alpha.coreos.com/backend-type"] == null or .annotations["flannel.alpha.coreos.com/backend-data"] == null) | .name'
+```
+
+If there is no output, the cluster is not affected.
+
+#### System namespace pods network connectivity broken
+
+> Note: This applies only to Rancher upgrades from v2.0.6 or earlier to v2.0.7 or later. Upgrades from v2.0.7 to later version are unaffected.
+
+| | |
+|------------|------------|
+| GitHub issue | [#15146](https://github.com/rancher/rancher/issues/15146) |
+
+If pods in system namespaces cannot communicate with pods in other system namespaces, you will need to follow the instructions in [Upgrading to v2.0.7+ — Namespace Migration]({{< baseurl >}}/rancher/v2.x/en/upgrades/upgrades/namespace-migration/) to restore connectivity. Symptoms include:
+
+- NGINX ingress controller showing `504 Gateway Time-out` when accessed.
+- NGINX ingress controller logging `upstream timed out (110: Connection timed out) while connecting to upstream` when accessed.
@@ -0,0 +1,70 @@
+---
+title: Rancher HA
+weight: 104
+---
+
+The commands/steps listed on this page can be used to check your Rancher HA installation.
+
+Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml`).
+
+### Check Rancher pods
+
+Rancher pods are deployed as a Deployment in the `cattle-system` namespace.
+
+Check if the pods are running on all nodes:
+
+```
+kubectl -n cattle-system get pods -l app=rancher -o wide
+```
+
+Example output:
+
+```
+NAME                       READY   STATUS    RESTARTS   AGE   IP          NODE
+rancher-7dbd7875f7-n6t5t   1/1     Running   0          8m    x.x.x.x     x.x.x.x
+rancher-7dbd7875f7-qbj5k   1/1     Running   0          8m    x.x.x.x     x.x.x.x
+rancher-7dbd7875f7-qw7wb   1/1     Running   0          8m    x.x.x.x     x.x.x.x
+```
+
+If a pod is unable to run (Status is not **Running**, Ready status is not showing `1/1` or you see a high count of Restarts), check the pod details, logs and namespace events.                                                                                                       
+
+#### Pod details
+
+```
+kubectl -n cattle-system describe pods -l app=rancher
+```
+
+#### Pod container logs
+
+```
+kubectl -n cattle-system logs -l app=rancher
+```
+
+#### Namespace events
+
+```
+kubectl -n cattle-system get events
+```
+
+### Check ingress
+
+Ingress should have the correct `HOSTS` (showing the configured FQDN) and `ADDRESS` (host address(es) it will be routed to).
+
+```
+kubectl -n cattle-system get ingress
+```
+
+Example output:
+
+```
+NAME      HOSTS                    ADDRESS                   PORTS     AGE
+rancher   rancher.yourdomain.com   x.x.x.x,x.x.x.x,x.x.x.x   80, 443   2m
+```
+
+### Check ingress controller logs
+
+When accessing your configured Rancher FQDN does not show you the UI, check the ingress controller logging to see what happens when you try to access Rancher:
+
+```
+kubectl -n ingress-nginx logs -l app=ingress-nginx
+```