Add DNS troubleshooting

This commit is contained in:
Sebastiaan van Steenis
2019-02-24 21:01:51 +01:00
committed by Denise
parent b4f4bfb8cd
commit 05c0016072
3 changed files with 160 additions and 18 deletions
@@ -24,6 +24,10 @@ This section contains information to help you troubleshoot issues when using Ran
Steps to troubleshoot networking issues can be found here.
- [DNS]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/dns/)
When you experience name resolution issues in your cluster.
- [Rancher HA]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/rancherha/)
If you experience issues issues with your [High Availability (HA) Install]({{< baseurl >}}/rancher/v2.x/en/installation/ha/)
@@ -0,0 +1,141 @@
---
title: DNS
weight: 103
---
The commands/steps listed on this page can be used to check name resolution issues in your cluster.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
Before running the DNS checks, make sure that [the overlay network is functioning correctly]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/networking/#check-if-overlay-network-is-functioning-correctly) as this can also be the reason why DNS resolution (partly) fails.
### Check if DNS pods are running
```
kubectl -n kube-system get pods -l k8s-app=kube-dns
```
Example output:
```
NAME READY STATUS RESTARTS AGE
kube-dns-5fd74c7488-h6f7n 3/3 Running 0 4m13s
```
### Check if the DNS service is present with the correct cluster-ip
```
kubectl -n kube-system get svc -l k8s-app=kube-dns
```
```
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP 4m13s
```
### Check if domain names are resolving
Check if internal cluster names are resolving (in this example, `kubernetes.default`), the IP shown after `Server:` should be the same as the `CLUSTER-IP` from the `kube-dns` service.
```
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
```
Example output:
```
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted
```
Check if external names are resolving (in this example, `www.google.com`)
```
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com
```
Example output:
```
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: www.google.com
Address 1: 2a00:1450:4009:80b::2004 lhr35s04-in-x04.1e100.net
Address 2: 216.58.211.100 ams15s32-in-f4.1e100.net
pod "busybox" deleted
```
If you want to check resolving of domain names on all of the hosts, execute the following steps:
1. Save the following file as `ds-dnstest.yml`
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dnstest
spec:
selector:
matchLabels:
name: dnstest
template:
metadata:
labels:
name: dnstest
spec:
tolerations:
- operator: Exists
containers:
- image: busybox:1.28
imagePullPolicy: Always
name: alpine
command: ["sh", "-c", "tail -f /dev/null"]
terminationMessagePath: /dev/termination-log
```
2. Launch it using `kubectl create -f ds-dnstest.yml`
3. Wait until `kubectl rollout status ds/dnstest -w` returns: `daemon set "dnstest" successfully rolled out`.
4. Configure the environment variable `DOMAIN` to a fully qualified domain name (FQDN) that the host should be able to resolve (`www.google.com` is used as an example) and run the following command to let each container on every host resolve the configured domain name (it's a single line command).
```
export DOMAIN=www.google.com; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
```
5. When this command has finished running, the output indicating everything is correct is:
```
=> Start DNS resolve test
=> End DNS resolve test
```
If you see error in the output, that means that the mentioned host(s) is/are not able to resolve the given FQDN.
Example error output of a situation where host with IP 209.97.182.150 had the UDP ports blocked.
```
=> Start DNS resolve test
command terminated with exit code 1
209.97.182.150 cannot resolve www.google.com
=> End DNS resolve test
```
Cleanup the alpine DaemonSet by running `kubectl delete ds/dnstest`.
### Check upstream nameservers in kubedns container
By default, the configured nameservers on the host (in `/etc/resolv.conf`) will be used as upstream nameservers for `kube-dns`. Sometimes the host will run a local caching DNS nameserver, which means the address in `/etc/resolv.conf` will point to an address in the loopback range (`127.0.0.0/8`) which will be unreachable by the container. In case of Ubuntu 18.04, this is done by `systemd-resolved`. Since Rancher v2.0.7, we detect if `systemd-resolved` is running, and will automatically use the `/etc/resolv.conf` file with the correct upstream nameservers (which is located at `/run/systemd/resolve/resolv.conf`).
Use the following command to check the upstream nameservers used by the kubedns container:
```
kubectl -n kube-system get pods -l k8s-app=kube-dns --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do echo "Pod ${pod} on host ${host}"; kubectl -n kube-system exec $pod -c kubedns cat /etc/resolv.conf; done
```
Example output:
```
Pod kube-dns-667c7cb9dd-z4dsf on host x.x.x.x
nameserver 1.1.1.1
nameserver 8.8.4.4
```
@@ -17,50 +17,45 @@ The pod can be scheduled to any of the hosts you used for your cluster, but that
To test the overlay network, you can launch the following `DaemonSet` definition. This will run an `alpine` container on every host, which we will use to run a `ping` test between containers on all hosts.
1. Save the following file as `ds-alpine.yml`
1. Save the following file as `ds-overlaytest.yml`
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: alpine
name: overlaytest
spec:
selector:
matchLabels:
name: alpine
name: overlaytest
template:
metadata:
labels:
name: alpine
name: overlaytest
spec:
tolerations:
- effect: NoExecute
key: "node-role.kubernetes.io/etcd"
value: "true"
- effect: NoSchedule
key: "node-role.kubernetes.io/controlplane"
value: "true"
- operator: Exists
containers:
- image: alpine
- image: busybox:1.28
imagePullPolicy: Always
name: alpine
command: ["sh", "-c", "tail -f /dev/null"]
terminationMessagePath: /dev/termination-log
```
2. Launch it using `kubectl create -f ds-alpine.yml`
3. Wait until `kubectl rollout status ds/alpine -w` returns: `daemon set "alpine" successfully rolled out`.
2. Launch it using `kubectl create -f ds-overlaytest.yml`
3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`.
4. Run the following command to let each container on every host ping each other (it's a single line command).
```
echo "=> Start"; kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End"
echo "=> Start network overlay test"; kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End network overlay test"
```
5. When this command has finished running, the output indicating everything is correct is:
```
=> Start
=> End
=> Start network overlay test
=> End network overlay test
```
If you see error in the output, that means that the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) for overlay networking are not opened between the hosts indicated.
@@ -68,7 +63,7 @@ If you see error in the output, that means that the [required ports]({{< baseurl
Example error output of a situation where NODE1 had the UDP ports blocked.
```
=> Start
=> Start network overlay test
command terminated with exit code 1
NODE2 cannot reach NODE1
command terminated with exit code 1
@@ -77,9 +72,11 @@ command terminated with exit code 1
NODE1 cannot reach NODE2
command terminated with exit code 1
NODE1 cannot reach NODE3
=> End
=> End network overlay test
```
Cleanup the alpine DaemonSet by running `kubectl delete ds/overlaytest`.
### Resolved issues
#### Overlay network broken when using Canal/Flannel due to missing node annotations