From 05c0016072136f95348ea88380bc6a8002d8df2a Mon Sep 17 00:00:00 2001 From: Sebastiaan van Steenis Date: Sun, 24 Feb 2019 21:01:51 +0100 Subject: [PATCH] Add DNS troubleshooting --- .../rancher/v2.x/en/troubleshooting/_index.md | 4 + .../v2.x/en/troubleshooting/dns/_index.md | 141 ++++++++++++++++++ .../en/troubleshooting/networking/_index.md | 33 ++-- 3 files changed, 160 insertions(+), 18 deletions(-) create mode 100644 content/rancher/v2.x/en/troubleshooting/dns/_index.md diff --git a/content/rancher/v2.x/en/troubleshooting/_index.md b/content/rancher/v2.x/en/troubleshooting/_index.md index 8640119ce2f..b59f147334f 100644 --- a/content/rancher/v2.x/en/troubleshooting/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/_index.md @@ -24,6 +24,10 @@ This section contains information to help you troubleshoot issues when using Ran Steps to troubleshoot networking issues can be found here. +- [DNS]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/dns/) + + When you experience name resolution issues in your cluster. + - [Rancher HA]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/rancherha/) If you experience issues issues with your [High Availability (HA) Install]({{< baseurl >}}/rancher/v2.x/en/installation/ha/) diff --git a/content/rancher/v2.x/en/troubleshooting/dns/_index.md b/content/rancher/v2.x/en/troubleshooting/dns/_index.md new file mode 100644 index 00000000000..06c298d626c --- /dev/null +++ b/content/rancher/v2.x/en/troubleshooting/dns/_index.md @@ -0,0 +1,141 @@ +--- +title: DNS +weight: 103 +--- + +The commands/steps listed on this page can be used to check name resolution issues in your cluster. + +Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI. + +Before running the DNS checks, make sure that [the overlay network is functioning correctly]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/networking/#check-if-overlay-network-is-functioning-correctly) as this can also be the reason why DNS resolution (partly) fails. + +### Check if DNS pods are running + +``` +kubectl -n kube-system get pods -l k8s-app=kube-dns +``` + +Example output: +``` +NAME READY STATUS RESTARTS AGE +kube-dns-5fd74c7488-h6f7n 3/3 Running 0 4m13s +``` + +### Check if the DNS service is present with the correct cluster-ip + +``` +kubectl -n kube-system get svc -l k8s-app=kube-dns +``` + +``` +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +service/kube-dns ClusterIP 10.43.0.10 53/UDP,53/TCP 4m13s +``` + +### Check if domain names are resolving + +Check if internal cluster names are resolving (in this example, `kubernetes.default`), the IP shown after `Server:` should be the same as the `CLUSTER-IP` from the `kube-dns` service. + +``` +kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default +``` + +Example output: +``` +Server: 10.43.0.10 +Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local + +Name: kubernetes.default +Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local +pod "busybox" deleted +``` + +Check if external names are resolving (in this example, `www.google.com`) + +``` +kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com +``` + +Example output: +``` +Server: 10.43.0.10 +Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local + +Name: www.google.com +Address 1: 2a00:1450:4009:80b::2004 lhr35s04-in-x04.1e100.net +Address 2: 216.58.211.100 ams15s32-in-f4.1e100.net +pod "busybox" deleted +``` + +If you want to check resolving of domain names on all of the hosts, execute the following steps: + +1. Save the following file as `ds-dnstest.yml` + + ``` + apiVersion: apps/v1 + kind: DaemonSet + metadata: + name: dnstest + spec: + selector: + matchLabels: + name: dnstest + template: + metadata: + labels: + name: dnstest + spec: + tolerations: + - operator: Exists + containers: + - image: busybox:1.28 + imagePullPolicy: Always + name: alpine + command: ["sh", "-c", "tail -f /dev/null"] + terminationMessagePath: /dev/termination-log + ``` + +2. Launch it using `kubectl create -f ds-dnstest.yml` +3. Wait until `kubectl rollout status ds/dnstest -w` returns: `daemon set "dnstest" successfully rolled out`. +4. Configure the environment variable `DOMAIN` to a fully qualified domain name (FQDN) that the host should be able to resolve (`www.google.com` is used as an example) and run the following command to let each container on every host resolve the configured domain name (it's a single line command). + + ``` + export DOMAIN=www.google.com; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test" + ``` + +5. When this command has finished running, the output indicating everything is correct is: + + ``` + => Start DNS resolve test + => End DNS resolve test + ``` + +If you see error in the output, that means that the mentioned host(s) is/are not able to resolve the given FQDN. + +Example error output of a situation where host with IP 209.97.182.150 had the UDP ports blocked. + +``` +=> Start DNS resolve test +command terminated with exit code 1 +209.97.182.150 cannot resolve www.google.com +=> End DNS resolve test +``` + +Cleanup the alpine DaemonSet by running `kubectl delete ds/dnstest`. + +### Check upstream nameservers in kubedns container + +By default, the configured nameservers on the host (in `/etc/resolv.conf`) will be used as upstream nameservers for `kube-dns`. Sometimes the host will run a local caching DNS nameserver, which means the address in `/etc/resolv.conf` will point to an address in the loopback range (`127.0.0.0/8`) which will be unreachable by the container. In case of Ubuntu 18.04, this is done by `systemd-resolved`. Since Rancher v2.0.7, we detect if `systemd-resolved` is running, and will automatically use the `/etc/resolv.conf` file with the correct upstream nameservers (which is located at `/run/systemd/resolve/resolv.conf`). + +Use the following command to check the upstream nameservers used by the kubedns container: + +``` +kubectl -n kube-system get pods -l k8s-app=kube-dns --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do echo "Pod ${pod} on host ${host}"; kubectl -n kube-system exec $pod -c kubedns cat /etc/resolv.conf; done +``` + +Example output: +``` +Pod kube-dns-667c7cb9dd-z4dsf on host x.x.x.x +nameserver 1.1.1.1 +nameserver 8.8.4.4 +``` diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md index 9d107228d90..53e0c0737fe 100644 --- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md @@ -17,50 +17,45 @@ The pod can be scheduled to any of the hosts you used for your cluster, but that To test the overlay network, you can launch the following `DaemonSet` definition. This will run an `alpine` container on every host, which we will use to run a `ping` test between containers on all hosts. -1. Save the following file as `ds-alpine.yml` +1. Save the following file as `ds-overlaytest.yml` ``` apiVersion: apps/v1 kind: DaemonSet metadata: - name: alpine + name: overlaytest spec: selector: matchLabels: - name: alpine + name: overlaytest template: metadata: labels: - name: alpine + name: overlaytest spec: tolerations: - - effect: NoExecute - key: "node-role.kubernetes.io/etcd" - value: "true" - - effect: NoSchedule - key: "node-role.kubernetes.io/controlplane" - value: "true" + - operator: Exists containers: - - image: alpine + - image: busybox:1.28 imagePullPolicy: Always name: alpine command: ["sh", "-c", "tail -f /dev/null"] terminationMessagePath: /dev/termination-log ``` -2. Launch it using `kubectl create -f ds-alpine.yml` -3. Wait until `kubectl rollout status ds/alpine -w` returns: `daemon set "alpine" successfully rolled out`. +2. Launch it using `kubectl create -f ds-overlaytest.yml` +3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`. 4. Run the following command to let each container on every host ping each other (it's a single line command). ``` - echo "=> Start"; kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End" + echo "=> Start network overlay test"; kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End network overlay test" ``` 5. When this command has finished running, the output indicating everything is correct is: ``` - => Start - => End + => Start network overlay test + => End network overlay test ``` If you see error in the output, that means that the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) for overlay networking are not opened between the hosts indicated. @@ -68,7 +63,7 @@ If you see error in the output, that means that the [required ports]({{< baseurl Example error output of a situation where NODE1 had the UDP ports blocked. ``` -=> Start +=> Start network overlay test command terminated with exit code 1 NODE2 cannot reach NODE1 command terminated with exit code 1 @@ -77,9 +72,11 @@ command terminated with exit code 1 NODE1 cannot reach NODE2 command terminated with exit code 1 NODE1 cannot reach NODE3 -=> End +=> End network overlay test ``` +Cleanup the alpine DaemonSet by running `kubectl delete ds/overlaytest`. + ### Resolved issues #### Overlay network broken when using Canal/Flannel due to missing node annotations