From 05c0016072136f95348ea88380bc6a8002d8df2a Mon Sep 17 00:00:00 2001
From: Sebastiaan van Steenis <mail@superseb.nl>
Date: Sun, 24 Feb 2019 21:01:51 +0100
Subject: [PATCH] Add DNS troubleshooting

---
 .../rancher/v2.x/en/troubleshooting/_index.md |   4 +
 .../v2.x/en/troubleshooting/dns/_index.md     | 141 ++++++++++++++++++
 .../en/troubleshooting/networking/_index.md   |  33 ++--
 3 files changed, 160 insertions(+), 18 deletions(-)
 create mode 100644 content/rancher/v2.x/en/troubleshooting/dns/_index.md
diff --git a/content/rancher/v2.x/en/troubleshooting/_index.md b/content/rancher/v2.x/en/troubleshooting/_index.md
index 8640119ce2f..b59f147334f 100644
--- a/content/rancher/v2.x/en/troubleshooting/_index.md
+++ b/content/rancher/v2.x/en/troubleshooting/_index.md
@@ -24,6 +24,10 @@ This section contains information to help you troubleshoot issues when using Ran
 
     Steps to troubleshoot networking issues can be found here.
 
+- [DNS]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/dns/)
+
+    When you experience name resolution issues in your cluster.
+
 - [Rancher HA]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/rancherha/)
 
     If you experience issues issues with your [High Availability (HA) Install]({{< baseurl >}}/rancher/v2.x/en/installation/ha/)
diff --git a/content/rancher/v2.x/en/troubleshooting/dns/_index.md b/content/rancher/v2.x/en/troubleshooting/dns/_index.md
new file mode 100644
index 00000000000..06c298d626c
--- /dev/null
+++ b/content/rancher/v2.x/en/troubleshooting/dns/_index.md
@@ -0,0 +1,141 @@
+---
+title: DNS
+weight: 103
+---
+
+The commands/steps listed on this page can be used to check name resolution issues in your cluster.
+
+Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
+
+Before running the DNS checks, make sure that [the overlay network is functioning correctly]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/networking/#check-if-overlay-network-is-functioning-correctly) as this can also be the reason why DNS resolution (partly) fails.
+
+### Check if DNS pods are running
+
+```
+kubectl -n kube-system get pods -l k8s-app=kube-dns
+```
+
+Example output:
+```
+NAME                        READY   STATUS    RESTARTS   AGE
+kube-dns-5fd74c7488-h6f7n   3/3     Running   0          4m13s
+```
+
+### Check if the DNS service is present with the correct cluster-ip
+
+```
+kubectl -n kube-system get svc -l k8s-app=kube-dns
+```
+
+```
+NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
+service/kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP   4m13s
+```
+
+### Check if domain names are resolving
+
+Check if internal cluster names are resolving (in this example, `kubernetes.default`), the IP shown after `Server:` should be the same as the `CLUSTER-IP` from the `kube-dns` service.
+
+```
+kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
+```
+
+Example output:
+```
+Server:    10.43.0.10
+Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
+
+Name:      kubernetes.default
+Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local
+pod "busybox" deleted
+```
+
+Check if external names are resolving (in this example, `www.google.com`)
+
+```
+kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com
+```
+
+Example output:
+```
+Server:    10.43.0.10
+Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
+
+Name:      www.google.com
+Address 1: 2a00:1450:4009:80b::2004 lhr35s04-in-x04.1e100.net
+Address 2: 216.58.211.100 ams15s32-in-f4.1e100.net
+pod "busybox" deleted
+```
+
+If you want to check resolving of domain names on all of the hosts, execute the following steps:
+
+1. Save the following file as `ds-dnstest.yml`
+
+    ```
+    apiVersion: apps/v1
+    kind: DaemonSet
+    metadata:
+      name: dnstest
+    spec:
+      selector:
+          matchLabels:
+            name: dnstest
+      template:
+        metadata:
+          labels:
+            name: dnstest
+        spec:
+          tolerations:
+          - operator: Exists
+          containers:
+          - image: busybox:1.28
+            imagePullPolicy: Always
+            name: alpine
+            command: ["sh", "-c", "tail -f /dev/null"]
+            terminationMessagePath: /dev/termination-log
+    ```
+
+2. Launch it using `kubectl create -f ds-dnstest.yml`
+3. Wait until `kubectl rollout status ds/dnstest -w` returns: `daemon set "dnstest" successfully rolled out`.
+4. Configure the environment variable `DOMAIN` to a fully qualified domain name (FQDN) that the host should be able to resolve (`www.google.com` is used as an example) and run the following command to let each container on every host resolve the configured domain name (it's a single line command).
+
+    ```
+    export DOMAIN=www.google.com; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
+    ```
+
+5. When this command has finished running, the output indicating everything is correct is:
+
+    ```
+    => Start DNS resolve test
+    => End DNS resolve test
+    ```
+
+If you see error in the output, that means that the mentioned host(s) is/are not able to resolve the given FQDN.
+
+Example error output of a situation where host with IP 209.97.182.150 had the UDP ports blocked.
+
+```
+=> Start DNS resolve test
+command terminated with exit code 1
+209.97.182.150 cannot resolve www.google.com
+=> End DNS resolve test
+```
+
+Cleanup the alpine DaemonSet by running `kubectl delete ds/dnstest`.
+
+### Check upstream nameservers in kubedns container
+
+By default, the configured nameservers on the host (in `/etc/resolv.conf`) will be used as upstream nameservers for `kube-dns`. Sometimes the host will run a local caching DNS nameserver, which means the address in `/etc/resolv.conf` will point to an address in the loopback range (`127.0.0.0/8`) which will be unreachable by the container. In case of Ubuntu 18.04, this is done by `systemd-resolved`. Since Rancher v2.0.7, we detect if `systemd-resolved` is running, and will automatically use the `/etc/resolv.conf` file with the correct upstream nameservers (which is located at `/run/systemd/resolve/resolv.conf`).
+
+Use the following command to check the upstream nameservers used by the kubedns container:
+
+```
+kubectl -n kube-system get pods -l k8s-app=kube-dns --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do echo "Pod ${pod} on host ${host}"; kubectl -n kube-system exec $pod -c kubedns cat /etc/resolv.conf; done
+```
+
+Example output:
+```
+Pod kube-dns-667c7cb9dd-z4dsf on host x.x.x.x
+nameserver 1.1.1.1
+nameserver 8.8.4.4
+```
diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md
index 9d107228d90..53e0c0737fe 100644
--- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md
+++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md
@@ -17,50 +17,45 @@ The pod can be scheduled to any of the hosts you used for your cluster, but that
 
 To test the overlay network, you can launch the following `DaemonSet` definition. This will run an `alpine` container on every host, which we will use to run a `ping` test between containers on all hosts.
 
-1. Save the following file as `ds-alpine.yml`
+1. Save the following file as `ds-overlaytest.yml`
 
     ```
     apiVersion: apps/v1
     kind: DaemonSet
     metadata:
-      name: alpine
+      name: overlaytest
     spec:
       selector:
           matchLabels:
-            name: alpine
+            name: overlaytest
       template:
         metadata:
           labels:
-            name: alpine
+            name: overlaytest
         spec:
           tolerations:
-          - effect: NoExecute
-            key: "node-role.kubernetes.io/etcd"
-            value: "true"
-          - effect: NoSchedule
-            key: "node-role.kubernetes.io/controlplane"
-            value: "true"
+          - operator: Exists
           containers:
-          - image: alpine
+          - image: busybox:1.28
             imagePullPolicy: Always
             name: alpine
             command: ["sh", "-c", "tail -f /dev/null"]
             terminationMessagePath: /dev/termination-log
     ```
 
-2. Launch it using `kubectl create -f ds-alpine.yml`
-3. Wait until `kubectl rollout status ds/alpine -w` returns: `daemon set "alpine" successfully rolled out`.
+2. Launch it using `kubectl create -f ds-overlaytest.yml`
+3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`.
 4. Run the following command to let each container on every host ping each other (it's a single line command).
 
     ```
-    echo "=> Start"; kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End"
+    echo "=> Start network overlay test"; kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End network overlay test"
     ```
 
 5. When this command has finished running, the output indicating everything is correct is:
 
     ```
-    => Start
-    => End
+    => Start network overlay test
+    => End network overlay test
     ```
 
 If you see error in the output, that means that the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) for overlay networking are not opened between the hosts indicated.
@@ -68,7 +63,7 @@ If you see error in the output, that means that the [required ports]({{< baseurl
 Example error output of a situation where NODE1 had the UDP ports blocked.
 
 ```
-=> Start
+=> Start network overlay test
 command terminated with exit code 1
 NODE2 cannot reach NODE1
 command terminated with exit code 1
@@ -77,9 +72,11 @@ command terminated with exit code 1
 NODE1 cannot reach NODE2
 command terminated with exit code 1
 NODE1 cannot reach NODE3
-=> End
+=> End network overlay test
 ```
 
+Cleanup the alpine DaemonSet by running `kubectl delete ds/overlaytest`.
+
 ### Resolved issues
 
 #### Overlay network broken when using Canal/Flannel due to missing node annotations