From e6fe8a24ee7bad6d9252bba0f14ee73e935e3921 Mon Sep 17 00:00:00 2001 From: Tejeev Date: Tue, 28 Jul 2020 14:58:57 +0100 Subject: [PATCH 1/6] synchronised and updated overlay test RKE and Rancher had different versions of the test. I synced them and changed them so that they utilize Murali's swiss-army-knife which allows much more freedom for troubleshooting. --- .../v2.x/en/troubleshooting/networking/_index.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md index 7259b61a3e0..23e6e6bed02 100644 --- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md @@ -15,7 +15,7 @@ Double check if all the [required ports]({{}}/rancher/v2.x/en/cluster-p The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod. -To test the overlay network, you can launch the following `DaemonSet` definition. This will run a `busybox` container on every host, which we will use to run a `ping` test between containers on all hosts. +To test the overlay network, you can launch the following `DaemonSet` definition. This will run a `swiss-army-knife` container on every host (image was developed by Rancher engineers), which we will use to run a `ping` test between containers on all hosts. 1. Save the following file as `ds-overlaytest.yml` @@ -34,11 +34,16 @@ To test the overlay network, you can launch the following `DaemonSet` definition name: overlaytest spec: tolerations: - - operator: Exists + - effect: NoExecute + key: "node-role.kubernetes.io/etcd" + value: "true" + - effect: NoSchedule + key: "node-role.kubernetes.io/controlplane" + value: "true" containers: - - image: busybox:1.28 + - image: leodotcloud/swiss-army-knife imagePullPolicy: Always - name: busybox + name: swiss-army-knife command: ["sh", "-c", "tail -f /dev/null"] terminationMessagePath: /dev/termination-log ``` From 6d966738cf9359aa049113e85739d0d720d748d4 Mon Sep 17 00:00:00 2001 From: Tejeev Date: Tue, 28 Jul 2020 15:35:28 +0100 Subject: [PATCH 2/6] Update content/rancher/v2.x/en/troubleshooting/networking/_index.md Co-authored-by: Murali Paluru --- content/rancher/v2.x/en/troubleshooting/networking/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md index 23e6e6bed02..dd14ebd0a49 100644 --- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md @@ -31,7 +31,7 @@ To test the overlay network, you can launch the following `DaemonSet` definition template: metadata: labels: - name: overlaytest + name: ds-overlaytest spec: tolerations: - effect: NoExecute From 07b8d7073da842ec93bdb2d0d3d3c13f09db37fd Mon Sep 17 00:00:00 2001 From: Tejeev Date: Tue, 28 Jul 2020 15:35:44 +0100 Subject: [PATCH 3/6] Update content/rancher/v2.x/en/troubleshooting/networking/_index.md Co-authored-by: Murali Paluru --- content/rancher/v2.x/en/troubleshooting/networking/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md index dd14ebd0a49..fc2fb59144f 100644 --- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md @@ -43,7 +43,7 @@ To test the overlay network, you can launch the following `DaemonSet` definition containers: - image: leodotcloud/swiss-army-knife imagePullPolicy: Always - name: swiss-army-knife + name: overlaytest command: ["sh", "-c", "tail -f /dev/null"] terminationMessagePath: /dev/termination-log ``` From 050b03b36582e73cddeec12a7d9e3dc3129b427b Mon Sep 17 00:00:00 2001 From: Tejeev Date: Tue, 28 Jul 2020 16:02:38 +0100 Subject: [PATCH 4/6] Fixed the command to run with new name and added debug --- .../en/troubleshooting/networking/_index.md | 78 ++++++++++++++++++- 1 file changed, 75 insertions(+), 3 deletions(-) diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md index fc2fb59144f..0115725e433 100644 --- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md @@ -10,7 +10,6 @@ Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG ### Double check if all the required ports are opened in your (host) firewall Double check if all the [required ports]({{}}/rancher/v2.x/en/cluster-provisioning/node-requirements/#networking-requirements/) are opened in your (host) firewall. The overlay network uses UDP in comparison to all other required ports which are TCP. - ### Check if overlay network is functioning correctly The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod. @@ -53,7 +52,79 @@ To test the overlay network, you can launch the following `DaemonSet` definition 4. Run the following command, from the same location, to let each container on every host ping each other (it's a single line bash command). ``` - echo "=> Start network overlay test"; kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End network overlay test" + ### Check if overlay network is functioning correctly + +The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod. + +To test the overlay network, you can launch the following `DaemonSet` definition. This will run a `swiss-army-knife` container on every host (image was developed by Rancher engineers), which we will use to run a `ping` test between containers on all hosts. + +1. Save the following file as `ds-overlaytest.yml` + + ``` + apiVersion: apps/v1 + kind: DaemonSet + metadata: + name: overlaytest + spec: + selector: + matchLabels: + name: overlaytest + template: + metadata: + labels: + name: ds-overlaytest + spec: + tolerations: + - effect: NoExecute + key: "node-role.kubernetes.io/etcd" + value: "true" + - effect: NoSchedule + key: "node-role.kubernetes.io/controlplane" + value: "true" + containers: + - image: leodotcloud/swiss-army-knife + imagePullPolicy: Always + name: overlaytest + command: ["sh", "-c", "tail -f /dev/null"] + terminationMessagePath: /dev/termination-log + ``` + +2. Launch it using `kubectl create -f ds-overlaytest.yml` +3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`. +4. Run the following command, from the same location, to let each container on every host ping each other (it's a single line bash command). + + ``` + echo "=> Start network overlay test"; kubectl get pods -l name=ds-overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=ds-overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End network overlay test" + ``` + +5. When this command has finished running, the output indicating everything is correct is: + + ``` + => Start network overlay test + Pinging from pod overlaytest-4cpx5 on host NODE1 + to pod ip 10.42.1.29 on host NODE1 + to pod ip 10.42.2.19 on host NODE2 + to pod ip 10.42.4.12 on host NODE3 + Pinging from pod overlaytest-vms6w on host NODE2 + to pod ip 10.42.1.29 on host NODE1 + to pod ip 10.42.2.19 on host NODE2 + to pod ip 10.42.4.12 on host NODE3 + => End network overlay test + ``` + +If you see error in the output, that means that the [required ports]({{}}/rancher/v2.x/en/cluster-provisioning/node-requirements/#networking-requirements/) for overlay networking are not opened between the hosts indicated. + +If a path fails the overlay test, you will see errors like the following: + +``` +command terminated with exit code 1 +NODE2 cannot reach NODE1 +command terminated with exit code 1 +NODE3 cannot reach NODE1 +``` + +Cleanup the DaemonSet by running `kubectl delete ds/overlaytest`. + ``` 5. When this command has finished running, the output indicating everything is correct is: @@ -80,7 +151,8 @@ NODE1 cannot reach NODE3 => End network overlay test ``` -Cleanup the busybox DaemonSet by running `kubectl delete ds/overlaytest`. +Cleanup the DaemonSet by running `kubectl delete ds/overlaytest`. + ### Check if MTU is correctly configured on hosts and on peering/tunnel appliances/devices From 43f4eadad3d25b53823d5b2b9d34f301ebdc8118 Mon Sep 17 00:00:00 2001 From: Tejeev Date: Tue, 28 Jul 2020 16:05:43 +0100 Subject: [PATCH 5/6] Update _index.md --- content/rancher/v2.x/en/troubleshooting/networking/_index.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md index 0115725e433..76dd5b7cd71 100644 --- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md @@ -119,8 +119,6 @@ If a path fails the overlay test, you will see errors like the following: ``` command terminated with exit code 1 NODE2 cannot reach NODE1 -command terminated with exit code 1 -NODE3 cannot reach NODE1 ``` Cleanup the DaemonSet by running `kubectl delete ds/overlaytest`. From e19ee79d9636ce80aabcdcf202bb29705443c086 Mon Sep 17 00:00:00 2001 From: Tejeev Date: Tue, 17 Nov 2020 18:24:10 +0000 Subject: [PATCH 6/6] Overlay test updated Standardized names and etc. Used `operator: Exists` toleration from the rancher overlay test rather than the no schedule and execute from the RKE. Added some verbosity --- .../en/troubleshooting/networking/_index.md | 142 +++++------------- 1 file changed, 39 insertions(+), 103 deletions(-) diff --git a/content/rancher/v2.x/en/troubleshooting/networking/_index.md b/content/rancher/v2.x/en/troubleshooting/networking/_index.md index 76dd5b7cd71..d476c1695ee 100644 --- a/content/rancher/v2.x/en/troubleshooting/networking/_index.md +++ b/content/rancher/v2.x/en/troubleshooting/networking/_index.md @@ -14,9 +14,9 @@ Double check if all the [required ports]({{}}/rancher/v2.x/en/cluster-p The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod. -To test the overlay network, you can launch the following `DaemonSet` definition. This will run a `swiss-army-knife` container on every host (image was developed by Rancher engineers), which we will use to run a `ping` test between containers on all hosts. +To test the overlay network, you can launch the following `DaemonSet` definition. This will run a `swiss-army-knife` container on every host (image was developed by Rancher engineers and can be found here: https://github.com/leodotcloud/swiss-army-knife), which we will use to run a `ping` test between containers on all hosts. -1. Save the following file as `ds-overlaytest.yml` +1. Save the following file as `overlaytest.yml` ``` apiVersion: apps/v1 @@ -30,126 +30,62 @@ To test the overlay network, you can launch the following `DaemonSet` definition template: metadata: labels: - name: ds-overlaytest + name: overlaytest spec: tolerations: - - effect: NoExecute - key: "node-role.kubernetes.io/etcd" - value: "true" - - effect: NoSchedule - key: "node-role.kubernetes.io/controlplane" - value: "true" + - operator: Exists containers: - image: leodotcloud/swiss-army-knife imagePullPolicy: Always name: overlaytest command: ["sh", "-c", "tail -f /dev/null"] terminationMessagePath: /dev/termination-log + ``` -2. Launch it using `kubectl create -f ds-overlaytest.yml` +2. Launch it using `kubectl create -f overlaytest.yml` 3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`. -4. Run the following command, from the same location, to let each container on every host ping each other (it's a single line bash command). - +4. Run the following script, from the same location. It will have each `overlaytest` container on every host ping each other: ``` - ### Check if overlay network is functioning correctly - -The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod. - -To test the overlay network, you can launch the following `DaemonSet` definition. This will run a `swiss-army-knife` container on every host (image was developed by Rancher engineers), which we will use to run a `ping` test between containers on all hosts. - -1. Save the following file as `ds-overlaytest.yml` - - ``` - apiVersion: apps/v1 - kind: DaemonSet - metadata: - name: overlaytest - spec: - selector: - matchLabels: - name: overlaytest - template: - metadata: - labels: - name: ds-overlaytest - spec: - tolerations: - - effect: NoExecute - key: "node-role.kubernetes.io/etcd" - value: "true" - - effect: NoSchedule - key: "node-role.kubernetes.io/controlplane" - value: "true" - containers: - - image: leodotcloud/swiss-army-knife - imagePullPolicy: Always - name: overlaytest - command: ["sh", "-c", "tail -f /dev/null"] - terminationMessagePath: /dev/termination-log + #!/bin/bash + echo "=> Start network overlay test" + kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | + while read spod shost + do kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | + while read tip thost + do kubectl --request-timeout='10s' exec $spod -c overlaytest -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1" + RC=$? + if [ $RC -ne 0 ] + then echo FAIL: $spod on $shost cannot reach pod IP $tip on $thost + else echo $shost can reach $thost + fi + done + done + echo "=> End network overlay test" ``` -2. Launch it using `kubectl create -f ds-overlaytest.yml` -3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`. -4. Run the following command, from the same location, to let each container on every host ping each other (it's a single line bash command). - - ``` - echo "=> Start network overlay test"; kubectl get pods -l name=ds-overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | while read spod shost; do kubectl get pods -l name=ds-overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | while read tip thost; do kubectl --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $shost cannot reach $thost; fi; done; done; echo "=> End network overlay test" - ``` - -5. When this command has finished running, the output indicating everything is correct is: +5. When this command has finished running, it will output the state of each route: ``` => Start network overlay test - Pinging from pod overlaytest-4cpx5 on host NODE1 - to pod ip 10.42.1.29 on host NODE1 - to pod ip 10.42.2.19 on host NODE2 - to pod ip 10.42.4.12 on host NODE3 - Pinging from pod overlaytest-vms6w on host NODE2 - to pod ip 10.42.1.29 on host NODE1 - to pod ip 10.42.2.19 on host NODE2 - to pod ip 10.42.4.12 on host NODE3 + Error from server (NotFound): pods "wk2" not found + FAIL: overlaytest-5bglp on wk2 cannot reach pod IP 10.42.7.3 on wk2 + Error from server (NotFound): pods "wk2" not found + FAIL: overlaytest-5bglp on wk2 cannot reach pod IP 10.42.0.5 on cp1 + Error from server (NotFound): pods "wk2" not found + FAIL: overlaytest-5bglp on wk2 cannot reach pod IP 10.42.2.12 on wk1 + command terminated with exit code 1 + FAIL: overlaytest-v4qkl on cp1 cannot reach pod IP 10.42.7.3 on wk2 + cp1 can reach cp1 + cp1 can reach wk1 + command terminated with exit code 1 + FAIL: overlaytest-xpxwp on wk1 cannot reach pod IP 10.42.7.3 on wk2 + wk1 can reach cp1 + wk1 can reach wk1 => End network overlay test ``` - -If you see error in the output, that means that the [required ports]({{}}/rancher/v2.x/en/cluster-provisioning/node-requirements/#networking-requirements/) for overlay networking are not opened between the hosts indicated. - -If a path fails the overlay test, you will see errors like the following: - -``` -command terminated with exit code 1 -NODE2 cannot reach NODE1 -``` - -Cleanup the DaemonSet by running `kubectl delete ds/overlaytest`. - - ``` - -5. When this command has finished running, the output indicating everything is correct is: - - ``` - => Start network overlay test - => End network overlay test - ``` - -If you see error in the output, that means that the [required ports]({{}}/rancher/v2.x/en/cluster-provisioning/node-requirements/#networking-requirements/) for overlay networking are not opened between the hosts indicated. - -Example error output of a situation where NODE1 had the UDP ports blocked. - -``` -=> Start network overlay test -command terminated with exit code 1 -NODE2 cannot reach NODE1 -command terminated with exit code 1 -NODE3 cannot reach NODE1 -command terminated with exit code 1 -NODE1 cannot reach NODE2 -command terminated with exit code 1 -NODE1 cannot reach NODE3 -=> End network overlay test -``` - -Cleanup the DaemonSet by running `kubectl delete ds/overlaytest`. + If you see error in the output, there is some issue with the route between the pods on the two hosts. In the above output the node `wk2` has no connectivity over the overlay network. This could be because the [required ports]({{}}/rancher/v2.x/en/cluster-provisioning/node-requirements/#networking-requirements/) for overlay networking are not opened for `wk2`. +6. You can now clean up the DaemonSet by running `kubectl delete ds/overlaytest`. ### Check if MTU is correctly configured on hosts and on peering/tunnel appliances/devices