Add troubleshooting section

This commit is contained in:
Sebastiaan van Steenis
2018-11-23 23:48:50 +01:00
committed by Denise
parent 13e6ae822b
commit 2fc866c67f
7 changed files with 804 additions and 0 deletions
@@ -121,3 +121,13 @@ kubectl -n ingress-nginx logs -f nginx-ingress-controller-rfjrq nginx-ingress-co
...
W0705 23:04:58.240571 7 backend_ssl.go:49] error obtaining PEM from secret cattle-system/tls-rancher-ingress: error retrieving secret cattle-system/tls-rancher-ingress: secret cattle-system/tls-rancher-ingress was not found
```
### no matches for kind "Issuer"
The [SSL configuration]({{< baseurl >}}/rancher/v2.x/en/installation/ha/helm-rancher/#choose-your-ssl-configuration) option you have chosen requires [cert-manager]({{< baseurl >}}/rancher/v2.x/en/installation/ha/helm-rancher/#optional-install-cert-manager) to be installed before installing Rancher or else the following error is shown:
```
Error: validation failed: unable to recognize "": no matches for kind "Issuer" in version "certmanager.k8s.io/v1alpha1"
```
Install [cert-manager]({{< baseurl >}}/rancher/v2.x/en/installation/ha/helm-rancher/#optional-install-cert-manager) and try installing Rancher again.
@@ -0,0 +1,7 @@
---
title: Troubleshooting
weight: 8100
---
Troubleshooting
@@ -0,0 +1,62 @@
---
title: Imported clusters
weight: 105
---
The commands/steps listed on this page can be used to check clusters that you are importing or that are imported in Rancher.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kubeconfig_from_imported_cluster.yml`)
### Rancher agents
Communication to the cluster (Kubernetes API via cattle-cluster-agent) and communication to the nodes is done through Rancher agents.
If the cattle-cluster-agent cannot connect to the configured `server-url`, the cluster will remain in **Pending** state, showing `Waiting for full cluster configuration`.
#### cattle-node-agent
Check if the cattle-node-agent pods are present on each node, have status **Running** and don't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-node-agent-4gc2p 1/1 Running 0 2h x.x.x.x worker-1
cattle-node-agent-8cxkk 1/1 Running 0 2h x.x.x.x etcd-1
cattle-node-agent-kzrlg 1/1 Running 0 2h x.x.x.x etcd-0
cattle-node-agent-nclz9 1/1 Running 0 2h x.x.x.x controlplane-0
cattle-node-agent-pwxp7 1/1 Running 0 2h x.x.x.x worker-0
cattle-node-agent-t5484 1/1 Running 0 2h x.x.x.x controlplane-1
cattle-node-agent-t8mtz 1/1 Running 0 2h x.x.x.x etcd-2
```
Check logging of a specific cattle-node-agent pod or all cattle-node-agent pods:
```
kubectl -n cattle-system logs -l app=cattle-agent
```
#### cattle-cluster-agent
Check if the cattle-cluster-agent pod is present in the cluster, has status **Running** and doesn't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-cluster-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-cluster-agent-54d7c6c54d-ht9h4 1/1 Running 0 2h x.x.x.x worker-1
```
Check logging of cattle-cluster-agent pod:
```
kubectl -n cattle-system logs -l app=cattle-cluster-agent
```
@@ -0,0 +1,363 @@
---
title: Kubernetes components
weight: 100
---
The commands/steps listed on this page apply to the core Kubernetes components on [Rancher Launched Kubernetes]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/) clusters.
## Diagram
![Cluster diagram]({{< baseurl >}}/img/rancher/clusterdiagram.svg)<br/>
<sup>Lines show the traffic flow between components. Colors are used purely for visual aid</sup>
## etcd
This section applies to nodes with the `etcd` role.
### Is etcd container is running
The container for etcd should have status **Up**. The duration shown after **Up** is the time the container has been running.
```
docker ps -a -f=name=etcd$
```
Example output:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
605a124503b9 rancher/coreos-etcd:v3.2.18 "/usr/local/bin/et..." 2 hours ago Up 2 hours etcd
```
### etcd container logging
The logging of the container can contain information on what the problem could be.
```
docker logs etcd
```
* `health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused`
A connection to the address shown on port 2380 cannot be established. Check if the etcd container is running on the host with the address shown.
* `xxx is starting a new election at term x`
The etcd cluster has lost it's quorum and is trying to establish a new leader. This can happen when the majority of the nodes running etcd go down/unreachable.
* `connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}`
The host firewall is preventing network communication.
* `rafthttp: request cluster ID mismatch`
The node with the etcd instance logging `rafthttp: request cluster ID mismatch` is trying to join a cluster that has already been formed with another peer. The node should be removed from the cluster, and re-added.
* `rafthttp: failed to find member`
The cluster state (`/var/lib/etcd`) contains wrong information to join the cluster. The node should be removed from the cluster, the state directory should be cleaned and the node should be re-added.
### etcd cluster and connectivity checks
If any of the commands respond with `Error: context deadline exceeded`, the etcd instance is unhealthy (either quorum is lost or the instance is not correctly joined in the cluster)
* Check etcd members on all nodes
Output should contain all the nodes with the `etcd` role and the output should be identical on all nodes.
```
docker exec etcd etcdctl member list
```
Example output:
```
xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001
xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001
xxx, started, etcd-xxx, https://IP:2380, https://IP:2379,https://IP:4001
```
* Check endpoint status
The values for `RAFT TERM` should be equal and `RAFT INDEX` should be not be too far apart from each other.
```
docker exec etcd etcdctl endpoint status --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") --write-out table
```
Example output:
```
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | 333ef673fc4add56 | 3.2.18 | 24 MB | false | 72 | 66887 |
| https://IP:2379 | 5feed52d940ce4cf | 3.2.18 | 24 MB | true | 72 | 66887 |
| https://IP:2379 | db6b3bdb559a848d | 3.2.18 | 25 MB | false | 72 | 66887 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
```
* Check endpoint health
```
docker exec etcd etcdctl endpoint health --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','")
```
Example output:
```
https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms
```
* Check connectivty on port TCP/2379
```
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5"); do
echo "Validating connection to ${endpoint}/health";
curl -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health";
done
```
If you are running on an operating system without `curl` (for example, RancherOS), you can use the following command which uses a Docker container to run the `curl` command.
```
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5"); do
echo "Validating connection to ${endpoint}/health";
docker run --net=host -v /opt/rke/etc/kubernetes/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health"
done
```
Example output:
```
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
```
* Check connectivty on port TCP/2380
```
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f4"); do
echo "Validating connection to ${endpoint}/version";
curl -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/version";
done
```
If you are running on an operating system without `curl` (for example, RancherOS), you can use the following command which uses a Docker container to run the `curl` command.
```
for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f4"); do
echo "Validating connection to ${endpoint}/version";
docker run --net=host -v /opt/rke/etc/kubernetes/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/version"
done
```
Example output:
```
Validating connection to https://IP:2380/version
{"etcdserver":"3.2.18","etcdcluster":"3.2.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.2.18","etcdcluster":"3.2.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.2.18","etcdcluster":"3.2.0"}
```
### etcd alarms
etcd will trigger alarms, for instance when it runs out of space.
```
docker exec etcd etcdctl alarm list
```
Example output when NOSPACE alarm is triggered:
```
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
```
### etcd space errors
Related error messages are `etcdserver: mvcc: database space exceeded` or `applying raft message exceeded backend quota`. Alarm `NOSPACE` will be triggered.
Resolution:
* Compact the keyspace
```
rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
docker exec etcd etcdctl compact "$rev"
```
Example output:
```
compacted revision xxx
```
* Defrag all etcd members
```
docker exec etcd etcdctl defrag --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','")
```
Example output:
```
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
```
* Check endpoint status
```
docker exec etcd etcdctl endpoint status --endpoints=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") --write-out table
```
Example output:
```
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | e973e4419737125 | 3.2.18 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | 4a509c997b26c206 | 3.2.18 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | b217e736575e9dd3 | 3.2.18 | 553 kB | true | 32 | 2449410 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
```
## controlplane
This section applies to nodes with the `controlplane` role.
### Are the containers for controlplane running
There are three specific containers launched on nodes with the `controlpane` role:
* `kube-apiserver`
* `kube-controller-manager`
* `kube-scheduler`
The containers should have status **Up**. The duration shown after **Up** is the time the container has been running.
```
docker ps -a -f=name='kube-apiserver|kube-controller-manager|kube-scheduler'
```
Example output:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
26c7159abbcc rancher/hyperkube:v1.11.5-rancher1 "/opt/rke-tools/en..." 3 hours ago Up 3 hours kube-apiserver
f3d287ca4549 rancher/hyperkube:v1.11.5-rancher1 "/opt/rke-tools/en..." 3 hours ago Up 3 hours kube-scheduler
bdf3898b8063 rancher/hyperkube:v1.11.5-rancher1 "/opt/rke-tools/en..." 3 hours ago Up 3 hours kube-controller-manager
```
### controlplane container logging
The logging of the containers can contain information on what the problem could be.
```
docker logs kube-apiserver
docker logs kube-controller-manager
docker logs kube-scheduler
```
## nginx-proxy
The `nginx-proxy` container is deployed on every node that does not have the `controlplane` role. It provides access to all the nodes with the `controlplane` role by dynamically generating the NGINX configuration based on available nodes with the `controlplane` role.
### Is the container running
The container is called `nginx-proxy` and should have status `Up`. The duration shown after `Up` is the time the container has been running.
```
docker ps -a -f=name=nginx-proxy
```
Example output:
```
docker ps -a -f=name=nginx-proxy
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c3e933687c0e rancher/rke-tools:v0.1.15 "nginx-proxy CP_HO..." 3 hours ago Up 3 hours nginx-proxy
```
### Check generated NGINX configuration
The generated configuration should include the IP addresses of the nodes with the `controlplane` role. The configuration can be checked using the following command:
```
docker exec nginx-proxy cat /etc/nginx/nginx.conf
```
Example output:
```
error_log stderr notice;
worker_processes auto;
events {
multi_accept on;
use epoll;
worker_connections 1024;
}
stream {
upstream kube_apiserver {
server ip_of_controlplane_node1:6443;
server ip_of_controlplane_node2:6443;
}
server {
listen 6443;
proxy_pass kube_apiserver;
proxy_timeout 30;
proxy_connect_timeout 2s;
}
}
```
### nginx-proxy container logging
The logging of the containers can contain information on what the problem could be.
```
docker logs nginx-proxy
```
## worker and generic
This section applies to every node as it includes components that run on nodes with any role.
### Are the containers running
There are three specific containers launched on nodes with the `controlpane` role:
* kubelet
* kube-proxy
The containers should have status `Up`. The duration shown after `Up` is the time the container has been running.
```
docker ps -a -f=name='kubelet|kube-proxy'
```
Example output:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
158d0dcc33a5 rancher/hyperkube:v1.11.5-rancher1 "/opt/rke-tools/en..." 3 hours ago Up 3 hours kube-proxy
a30717ecfb55 rancher/hyperkube:v1.11.5-rancher1 "/opt/rke-tools/en..." 3 hours ago Up 3 hours kubelet
```
### container logging
The logging of the containers can contain information on what the problem could be.
```
docker logs kubelet
docker logs kube-proxy
```
@@ -0,0 +1,181 @@
---
title: Kubernetes resources
weight: 101
---
The commands/steps listed on this page can be used to check the most important Kubernetes resources and apply to [Rancher Launched Kubernetes]({{< baseurl >}}/rancher/v2.x/en/cluster-provisioning/rke-clusters/) clusters.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
### Nodes
#### Get nodes
Run the command below and check the following:
- All nodes in your cluster should be listed, make sure there is not one missing.
- All nodes should have the **Ready** status (if not in **Ready** state, check the `kubelet` container logs on that node using `docker logs kubelet`)
- Check if all nodes report the correct version.
- Check if OS/Kernel/Docker values are shown as expected (possibly you can relate issues due to upgraded OS/Kernel/Docker)
```
kubectl get nodes
```
Example output:
```
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
etcd-0 Ready etcd 2m v1.11.5 <none> Ubuntu 16.04.5 LTS 4.4.0-138-generic docker://17.3.2
etcd-1 Ready etcd 2m v1.11.5 <none> Ubuntu 16.04.5 LTS 4.4.0-138-generic docker://17.3.2
etcd-2 Ready etcd 2m v1.11.5 <none> Ubuntu 16.04.5 LTS 4.4.0-138-generic docker://17.3.2
controlplane-0 Ready controlplane 2m v1.11.5 <none> Ubuntu 16.04.5 LTS 4.4.0-138-generic docker://17.3.2
controlplane-1 Ready controlplane 1m v1.11.5 <none> Ubuntu 16.04.5 LTS 4.4.0-138-generic docker://17.3.2
worker-0 Ready worker 2m v1.11.5 <none> Ubuntu 16.04.5 LTS 4.4.0-138-generic docker://17.3.2
worker-1 Ready worker 2m v1.11.5 <none> Ubuntu 16.04.5 LTS 4.4.0-138-generic docker://17.3.2
```
#### Get node conditions
Run the command below to list nodes with [Node Conditions](https://kubernetes.io/docs/concepts/architecture/nodes/#condition) that are active that could prevent normal operation.
```
kubectl get nodes -o go-template='{{range .items}}{{$node := .}}{{range .status.conditions}}{{if ne .type "Ready"}}{{if eq .status "True"}}{{$node.metadata.name}}{{": "}}{{.type}}{{":"}}{{.status}}{{"\n"}}{{end}}{{end}}{{end}}{{end}}'
```
Example output:
```
worker-0: DiskPressure:True
```
### Ingress Controller
The default Ingress Controller is NGINX and is deployed as a DaemonSet in the `ingress-nginx` namespace. The pods are only scheduled to nodes with the `worker` role.
Check if the pods are running on all nodes:
```
kubectl -n ingress-nginx get pods -o wide
```
Example output:
```
kubectl -n ingress-nginx get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
default-http-backend-797c5bc547-kwwlq 1/1 Running 0 17m x.x.x.x worker-1
nginx-ingress-controller-4qd64 1/1 Running 0 14m x.x.x.x worker-1
nginx-ingress-controller-8wxhm 1/1 Running 0 13m x.x.x.x worker-0
```
If a pod is unable to run (Status is not **Running**, Ready status is not showing `1/1` or you see a high count of Restarts), check the pod details, logs and namespace events.
#### Pod details
```
kubectl -n ingress-nginx describe pods -l app=ingress-nginx
```
#### Pod container logs
```
kubectl -n ingress-nginx logs -l app=ingress-nginx
```
#### Namespace events
```
kubectl -n ingress-nginx get events
```
### Rancher agents
Communication to the cluster (Kubernetes API via cattle-cluster-agent) and communication to the nodes (cluster provisioning via cattle-node-agent) is done through Rancher agents.
#### cattle-node-agent
Check if the cattle-node-agent pods are present on each node, have status **Running** and don't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-node-agent-4gc2p 1/1 Running 0 2h x.x.x.x worker-1
cattle-node-agent-8cxkk 1/1 Running 0 2h x.x.x.x etcd-1
cattle-node-agent-kzrlg 1/1 Running 0 2h x.x.x.x etcd-0
cattle-node-agent-nclz9 1/1 Running 0 2h x.x.x.x controlplane-0
cattle-node-agent-pwxp7 1/1 Running 0 2h x.x.x.x worker-0
cattle-node-agent-t5484 1/1 Running 0 2h x.x.x.x controlplane-1
cattle-node-agent-t8mtz 1/1 Running 0 2h x.x.x.x etcd-2
```
Check logging of a specific cattle-node-agent pod or all cattle-node-agent pods:
```
kubectl -n cattle-system logs -l app=cattle-agent
```
#### cattle-cluster-agent
Check if the cattle-cluster-agent pod is present in the cluster, has status **Running** and doesn't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-cluster-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-cluster-agent-54d7c6c54d-ht9h4 1/1 Running 0 2h x.x.x.x worker-1
```
Check logging of cattle-cluster-agent pod:
```
kubectl -n cattle-system logs -l app=cattle-cluster-agent
```
### Generic
#### All pods/jobs should have status **Running**/**Completed**
To check, run the command:
```
kubectl get pods --all-namespaces
```
If a pod is not in **Running** state, you can dig into the root cause by running:
##### Describe pod
```
kubectl describe pod POD_NAME -n NAMESPACE
```
##### Pod container logs
```
kubectl logs POD_NAME -n NAMESPACE
```
If a job is not in **Completed** state, you can dig into the root cause by running:
##### Describe job
```
kubectl describe job JOB_NAME -n NAMESPACE
```
##### Logs from the containers of pods of the job
```
kubectl logs -l job-name=JOB_NAME -n NAMESPACE
```
@@ -0,0 +1,111 @@
---
title: Networking
weight: 102
---
The commands/steps listed on this page can be used to check networking related issues in your cluster.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
### Double check if all the required ports are opened in your (host) firewall
Double check if all the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) are opened in your (host) firewall. The overlay network uses UDP in comparison to all other required ports which are TCP.
### Check if overlay network is functioning correctly
The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod.
To test the overlay network, you can launch the following `DaemonSet` definition. This will run an `alpine` container on every host, which we will use to run a `ping` test between containers on all hosts.
1. Save the following file as `ds-alpine.yml`
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: alpine
spec:
selector:
matchLabels:
name: alpine
template:
metadata:
labels:
name: alpine
spec:
tolerations:
- effect: NoExecute
key: "node-role.kubernetes.io/etcd"
value: "true"
- effect: NoSchedule
key: "node-role.kubernetes.io/controlplane"
value: "true"
containers:
- image: alpine
imagePullPolicy: Always
name: alpine
command: ["sh", "-c", "tail -f /dev/null"]
terminationMessagePath: /dev/termination-log
```
2. Launch it using `kubectl create -f ds-alpine.yml`
3. Wait until `kubectl rollout status ds/alpine -w` returns: `daemon set "alpine" successfully rolled out`.
4. Run the following command to let each container on every host ping each other (it's a single line command).
```
echo "=> Start"; kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End"
```
5. When this command has finished running, the output indicating everything is correct is:
```
=> Start
=> End
```
If you see error in the output, that means that the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) for overlay networking are not opened between the hosts indicated.
Example error output of a situation where NODE1 had the UDP ports blocked.
```
=> Start
command terminated with exit code 1
NODE2 cannot reach NODE1
command terminated with exit code 1
NODE3 cannot reach NODE1
command terminated with exit code 1
NODE1 cannot reach NODE2
command terminated with exit code 1
NODE1 cannot reach NODE3
=> End
```
### Resolved issues
#### Overlay network broken when using Canal/Flannel due to missing node annotations
| | |
|------------|------------|
| GitHub issue | [#13644](https://github.com/rancher/rancher/issues/13644) |
| Resolved in | v2.1.2 |
To check if your cluster is affected, the following command will list nodes that are broken (this command requires `jq` to be installed):
```
kubectl get nodes -o json | jq '.items[].metadata | select(.annotations["flannel.alpha.coreos.com/public-ip"] == null or .annotations["flannel.alpha.coreos.com/kube-subnet-manager"] == null or .annotations["flannel.alpha.coreos.com/backend-type"] == null or .annotations["flannel.alpha.coreos.com/backend-data"] == null) | .name'
```
If there is no output, the cluster is not affected.
#### System namespace pods network connectivity broken
> Note: This applies only to Rancher upgrades from v2.0.6 or earlier to v2.0.7 or later. Upgrades from v2.0.7 to later version are unaffected.
| | |
|------------|------------|
| GitHub issue | [#15146](https://github.com/rancher/rancher/issues/15146) |
If pods in system namespaces cannot communicate with pods in other system namespaces, you will need to follow the instructions in [Upgrading to v2.0.7+ — Namespace Migration]({{< baseurl >}}/rancher/v2.x/en/upgrades/upgrades/namespace-migration/) to restore connectivity. Symptoms include:
- NGINX ingress controller showing `504 Gateway Time-out` when accessed.
- NGINX ingress controller logging `upstream timed out (110: Connection timed out) while connecting to upstream` when accessed.
@@ -0,0 +1,70 @@
---
title: Rancher HA
weight: 104
---
The commands/steps listed on this page can be used to check your Rancher HA installation.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml`).
### Check Rancher pods
Rancher pods are deployed as a Deployment in the `cattle-system` namespace.
Check if the pods are running on all nodes:
```
kubectl -n cattle-system get pods -l app=rancher -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
rancher-7dbd7875f7-n6t5t 1/1 Running 0 8m x.x.x.x x.x.x.x
rancher-7dbd7875f7-qbj5k 1/1 Running 0 8m x.x.x.x x.x.x.x
rancher-7dbd7875f7-qw7wb 1/1 Running 0 8m x.x.x.x x.x.x.x
```
If a pod is unable to run (Status is not **Running**, Ready status is not showing `1/1` or you see a high count of Restarts), check the pod details, logs and namespace events.
#### Pod details
```
kubectl -n cattle-system describe pods -l app=rancher
```
#### Pod container logs
```
kubectl -n cattle-system logs -l app=rancher
```
#### Namespace events
```
kubectl -n cattle-system get events
```
### Check ingress
Ingress should have the correct `HOSTS` (showing the configured FQDN) and `ADDRESS` (host address(es) it will be routed to).
```
kubectl -n cattle-system get ingress
```
Example output:
```
NAME HOSTS ADDRESS PORTS AGE
rancher rancher.yourdomain.com x.x.x.x,x.x.x.x,x.x.x.x 80, 443 2m
```
### Check ingress controller logs
When accessing your configured Rancher FQDN does not show you the UI, check the ingress controller logging to see what happens when you try to access Rancher:
```
kubectl -n ingress-nginx logs -l app=ingress-nginx
```