Add DNS troubleshooting

2026-05-14 00:53:22 +00:00 · 2019-02-24 21:01:51 +01:00
parent b4f4bfb8cd
commit 05c0016072
3 changed files with 160 additions and 18 deletions
@@ -24,6 +24,10 @@ This section contains information to help you troubleshoot issues when using Ran

    Steps to troubleshoot networking issues can be found here.

+- [DNS]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/dns/)
+
+    When you experience name resolution issues in your cluster.
+
 - [Rancher HA]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/rancherha/)

    If you experience issues issues with your [High Availability (HA) Install]({{< baseurl >}}/rancher/v2.x/en/installation/ha/)
@@ -0,0 +1,141 @@
+---
+title: DNS
+weight: 103
+---
+
+The commands/steps listed on this page can be used to check name resolution issues in your cluster.
+
+Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_rancher-cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
+
+Before running the DNS checks, make sure that [the overlay network is functioning correctly]({{< baseurl >}}/rancher/v2.x/en/troubleshooting/networking/#check-if-overlay-network-is-functioning-correctly) as this can also be the reason why DNS resolution (partly) fails.
+
+### Check if DNS pods are running
+
+```
+kubectl -n kube-system get pods -l k8s-app=kube-dns
+```
+
+Example output:
+```
+NAME                        READY   STATUS    RESTARTS   AGE
+kube-dns-5fd74c7488-h6f7n   3/3     Running   0          4m13s
+```
+
+### Check if the DNS service is present with the correct cluster-ip
+
+```
+kubectl -n kube-system get svc -l k8s-app=kube-dns
+```
+
+```
+NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
+service/kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP   4m13s
+```
+
+### Check if domain names are resolving
+
+Check if internal cluster names are resolving (in this example, `kubernetes.default`), the IP shown after `Server:` should be the same as the `CLUSTER-IP` from the `kube-dns` service.
+
+```
+kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
+```
+
+Example output:
+```
+Server:    10.43.0.10
+Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
+
+Name:      kubernetes.default
+Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local
+pod "busybox" deleted
+```
+
+Check if external names are resolving (in this example, `www.google.com`)
+
+```
+kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com
+```
+
+Example output:
+```
+Server:    10.43.0.10
+Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
+
+Name:      www.google.com
+Address 1: 2a00:1450:4009:80b::2004 lhr35s04-in-x04.1e100.net
+Address 2: 216.58.211.100 ams15s32-in-f4.1e100.net
+pod "busybox" deleted
+```
+
+If you want to check resolving of domain names on all of the hosts, execute the following steps:
+
+1. Save the following file as `ds-dnstest.yml`
+
+    ```
+    apiVersion: apps/v1
+    kind: DaemonSet
+    metadata:
+      name: dnstest
+    spec:
+      selector:
+          matchLabels:
+            name: dnstest
+      template:
+        metadata:
+          labels:
+            name: dnstest
+        spec:
+          tolerations:
+          - operator: Exists
+          containers:
+          - image: busybox:1.28
+            imagePullPolicy: Always
+            name: alpine
+            command: ["sh", "-c", "tail -f /dev/null"]
+            terminationMessagePath: /dev/termination-log
+    ```
+
+2. Launch it using `kubectl create -f ds-dnstest.yml`
+3. Wait until `kubectl rollout status ds/dnstest -w` returns: `daemon set "dnstest" successfully rolled out`.
+4. Configure the environment variable `DOMAIN` to a fully qualified domain name (FQDN) that the host should be able to resolve (`www.google.com` is used as an example) and run the following command to let each container on every host resolve the configured domain name (it's a single line command).
+
+    ```
+    export DOMAIN=www.google.com; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
+    ```
+
+5. When this command has finished running, the output indicating everything is correct is:
+
+    ```
+    => Start DNS resolve test
+    => End DNS resolve test
+    ```
+
+If you see error in the output, that means that the mentioned host(s) is/are not able to resolve the given FQDN.
+
+Example error output of a situation where host with IP 209.97.182.150 had the UDP ports blocked.
+
+```
+=> Start DNS resolve test
+command terminated with exit code 1
+209.97.182.150 cannot resolve www.google.com
+=> End DNS resolve test
+```
+
+Cleanup the alpine DaemonSet by running `kubectl delete ds/dnstest`.
+
+### Check upstream nameservers in kubedns container
+
+By default, the configured nameservers on the host (in `/etc/resolv.conf`) will be used as upstream nameservers for `kube-dns`. Sometimes the host will run a local caching DNS nameserver, which means the address in `/etc/resolv.conf` will point to an address in the loopback range (`127.0.0.0/8`) which will be unreachable by the container. In case of Ubuntu 18.04, this is done by `systemd-resolved`. Since Rancher v2.0.7, we detect if `systemd-resolved` is running, and will automatically use the `/etc/resolv.conf` file with the correct upstream nameservers (which is located at `/run/systemd/resolve/resolv.conf`).
+
+Use the following command to check the upstream nameservers used by the kubedns container:
+
+```
+kubectl -n kube-system get pods -l k8s-app=kube-dns --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do echo "Pod ${pod} on host ${host}"; kubectl -n kube-system exec $pod -c kubedns cat /etc/resolv.conf; done
+```
+
+Example output:
+```
+Pod kube-dns-667c7cb9dd-z4dsf on host x.x.x.x
+nameserver 1.1.1.1
+nameserver 8.8.4.4
+```
@@ -17,50 +17,45 @@ The pod can be scheduled to any of the hosts you used for your cluster, but that

 To test the overlay network, you can launch the following `DaemonSet` definition. This will run an `alpine` container on every host, which we will use to run a `ping` test between containers on all hosts.

-1. Save the following file as `ds-alpine.yml`
+1. Save the following file as `ds-overlaytest.yml`

    ```
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
-      name: alpine
+      name: overlaytest
    spec:
      selector:
          matchLabels:
-            name: alpine
+            name: overlaytest
      template:
        metadata:
          labels:
-            name: alpine
+            name: overlaytest
        spec:
          tolerations:
-          - effect: NoExecute
-            key: "node-role.kubernetes.io/etcd"
-            value: "true"
-          - effect: NoSchedule
-            key: "node-role.kubernetes.io/controlplane"
-            value: "true"
+          - operator: Exists
          containers:
-          - image: alpine
+          - image: busybox:1.28
            imagePullPolicy: Always
            name: alpine
            command: ["sh", "-c", "tail -f /dev/null"]
            terminationMessagePath: /dev/termination-log
    ```

-2. Launch it using `kubectl create -f ds-alpine.yml`
-3. Wait until `kubectl rollout status ds/alpine -w` returns: `daemon set "alpine" successfully rolled out`.
+2. Launch it using `kubectl create -f ds-overlaytest.yml`
+3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`.
 4. Run the following command to let each container on every host ping each other (it's a single line command).

    ```
-    echo "=> Start"; kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=alpine -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End"
+    echo "=> Start network overlay test"; kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End network overlay test"
    ```

 5. When this command has finished running, the output indicating everything is correct is:

    ```
-    => Start
-    => End
+    => Start network overlay test
+    => End network overlay test
    ```

 If you see error in the output, that means that the [required ports]({{< baseurl >}}/rancher/v2.x/en/installation/references/) for overlay networking are not opened between the hosts indicated.
@@ -68,7 +63,7 @@ If you see error in the output, that means that the [required ports]({{< baseurl
 Example error output of a situation where NODE1 had the UDP ports blocked.

 ```
-=> Start
+=> Start network overlay test
 command terminated with exit code 1
 NODE2 cannot reach NODE1
 command terminated with exit code 1
@@ -77,9 +72,11 @@ command terminated with exit code 1
 NODE1 cannot reach NODE2
 command terminated with exit code 1
 NODE1 cannot reach NODE3
-=> End
+=> End network overlay test
 ```

+Cleanup the alpine DaemonSet by running `kubectl delete ds/overlaytest`.
+
 ### Resolved issues

 #### Overlay network broken when using Canal/Flannel due to missing node annotations