Add v2.14 preview docs (#2212)

This commit is contained in:
Lucas Saintarbor
2026-03-05 12:30:57 -08:00
committed by GitHub
parent 4a0d71b3f3
commit 2dcfa6f6b8
874 changed files with 92618 additions and 0 deletions
@@ -0,0 +1,224 @@
---
title: DNS
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/dns"/>
</head>
The commands/steps listed on this page can be used to check name resolution issues in your cluster.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
Before running the DNS checks, check the [default DNS provider](../../reference-guides/cluster-configuration/rancher-server-configuration/rke1-cluster-configuration.md#default-dns-provider) for your cluster and make sure that [the overlay network is functioning correctly](networking.md#check-if-overlay-network-is-functioning-correctly) as this can also be the reason why DNS resolution (partly) fails.
## Check if DNS pods are running
```
kubectl -n kube-system get pods -l k8s-app=kube-dns
```
Example output when using CoreDNS:
```
NAME READY STATUS RESTARTS AGE
coredns-799dffd9c4-6jhlz 1/1 Running 0 76m
```
Example output when using kube-dns:
```
NAME READY STATUS RESTARTS AGE
kube-dns-5fd74c7488-h6f7n 3/3 Running 0 4m13s
```
## Check if the DNS service is present with the correct cluster-ip
```
kubectl -n kube-system get svc -l k8s-app=kube-dns
```
```
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP 4m13s
```
## Check if domain names are resolving
Check if internal cluster names are resolving (in this example, `kubernetes.default`), the IP shown after `Server:` should be the same as the `CLUSTER-IP` from the `kube-dns` service.
```
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
```
Example output:
```
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.43.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted
```
Check if external names are resolving (in this example, `www.google.com`)
```
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup www.google.com
```
Example output:
```
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: www.google.com
Address 1: 2a00:1450:4009:80b::2004 lhr35s04-in-x04.1e100.net
Address 2: 216.58.211.100 ams15s32-in-f4.1e100.net
pod "busybox" deleted
```
If you want to check resolving of domain names on all of the hosts, execute the following steps:
1. Save the following file as `ds-dnstest.yml`
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dnstest
spec:
selector:
matchLabels:
name: dnstest
template:
metadata:
labels:
name: dnstest
spec:
tolerations:
- operator: Exists
containers:
- image: busybox:1.28
imagePullPolicy: Always
name: alpine
command: ["sleep", "infinity"]
terminationMessagePath: /dev/termination-log
```
2. Launch it using `kubectl create -f ds-dnstest.yml`
3. Wait until `kubectl rollout status ds/dnstest -w` returns: `daemon set "dnstest" successfully rolled out`.
4. Configure the environment variable `DOMAIN` to a fully qualified domain name (FQDN) that the host should be able to resolve (`www.google.com` is used as an example) and run the following command to let each container on every host resolve the configured domain name (it's a single line command).
```
export DOMAIN=www.google.com; echo "=> Start DNS resolve test"; kubectl get pods -l name=dnstest --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do kubectl exec $pod -- /bin/sh -c "nslookup $DOMAIN > /dev/null 2>&1"; RC=$?; if [ $RC -ne 0 ]; then echo $host cannot resolve $DOMAIN; fi; done; echo "=> End DNS resolve test"
```
5. When this command has finished running, the output indicating everything is correct is:
```
=> Start DNS resolve test
=> End DNS resolve test
```
If you see error in the output, that means that the mentioned host(s) is/are not able to resolve the given FQDN.
Example error output of a situation where host with IP 209.97.182.150 had the UDP ports blocked.
```
=> Start DNS resolve test
command terminated with exit code 1
209.97.182.150 cannot resolve www.google.com
=> End DNS resolve test
```
Cleanup the alpine DaemonSet by running `kubectl delete ds/dnstest`.
## CoreDNS specific
### Check CoreDNS logging
```
kubectl -n kube-system logs -l k8s-app=kube-dns
```
### Check configuration
CoreDNS configuration is stored in the configmap `coredns` in the `kube-system` namespace.
```
kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}
```
### Check upstream nameservers in resolv.conf
By default, the configured nameservers on the host (in `/etc/resolv.conf`) will be used as upstream nameservers for CoreDNS. You can check this file on the host or run the following Pod with `dnsPolicy` set to `Default`, which will inherit the `/etc/resolv.conf` from the host it is running on.
```
kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'
```
### Enable query logging
Enabling query logging can be done by enabling the [log plugin](https://coredns.io/plugins/log/) in the Corefile configuration in the configmap `coredns`. You can do so by using `kubectl -n kube-system edit configmap coredns` or use the command below to replace the configuration in place:
```
kubectl get configmap -n kube-system coredns -o json | sed -e 's_loadbalance_log\\n loadbalance_g' | kubectl apply -f -
```
All queries will now be logged and can be checked using the command in [Check CoreDNS logging](#check-coredns-logging).
## kube-dns specific
### Check upstream nameservers in kubedns container
By default, the configured nameservers on the host (in `/etc/resolv.conf`) will be used as upstream nameservers for kube-dns. Sometimes the host will run a local caching DNS nameserver, which means the address in `/etc/resolv.conf` will point to an address in the loopback range (`127.0.0.0/8`) which will be unreachable by the container. In case of Ubuntu 18.04, this is done by `systemd-resolved`. We detect if `systemd-resolved` is running, and will automatically use the `/etc/resolv.conf` file with the correct upstream nameservers (which is located at `/run/systemd/resolve/resolv.conf`).
Use the following command to check the upstream nameservers used by the kubedns container:
```
kubectl -n kube-system get pods -l k8s-app=kube-dns --no-headers -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP | while read pod host; do echo "Pod ${pod} on host ${host}"; kubectl -n kube-system exec $pod -c kubedns cat /etc/resolv.conf; done
```
Example output:
```
Pod kube-dns-667c7cb9dd-z4dsf on host x.x.x.x
nameserver 1.1.1.1
nameserver 8.8.4.4
```
If the output shows an address in the loopback range (`127.0.0.0/8`), you can correct this in two ways:
* Make sure the correct nameservers are listed in `/etc/resolv.conf` on your nodes in the cluster, please consult your operating system documentation on how to do this. Make sure you execute this before provisioning a cluster, or reboot the nodes after making the modification.
* Configure the `kubelet` to use a different file for resolving names, by using `extra_args` as shown below (where `/run/resolvconf/resolv.conf` is the file with the correct nameservers):
```
services:
kubelet:
extra_args:
resolv-conf: "/run/resolvconf/resolv.conf"
```
:::note
As the `kubelet` is running inside a container, the path for files located in `/etc` and `/usr` are in `/host/etc` and `/host/usr` inside the `kubelet` container.
:::
See [Editing Cluster as YAML](../../reference-guides/cluster-configuration/rancher-server-configuration/rke1-cluster-configuration.md#editing-clusters-with-yaml) how to apply this change. When the provisioning of the cluster has finished, you have to remove the kube-dns pod to activate the new setting in the pod:
```
kubectl delete pods -n kube-system -l k8s-app=kube-dns
pod "kube-dns-5fd74c7488-6pwsf" deleted
```
Try to resolve name again using [Check if domain names are resolving](#check-if-domain-names-are-resolving).
If you want to check the kube-dns configuration in your cluster (for example, to check if there are different upstream nameservers configured), you can run the following command to list the kube-dns configuration:
```
kubectl -n kube-system get configmap kube-dns -o go-template='{{range $key, $value := .data}}{{ $key }}{{":"}}{{ $value }}{{"\n"}}{{end}}'
```
Example output:
```
upstreamNameservers:["1.1.1.1"]
```
@@ -0,0 +1,33 @@
---
title: Rotation of Expired Webhook Certificates
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/expired-webhook-certificate-rotation"/>
</head>
For Rancher versions that have `rancher-webhook` installed, certain versions created certificates that will expire after one year. It will be necessary for you to rotate your webhook certificate if the certificate did not renew.
In Rancher v2.6.3 and up, rancher-webhook deployments will automatically renew their TLS certificate when it is within 30 or fewer days of its expiration date. If you are using v2.6.2 or below, there are two methods to work around this issue:
## 1. Users with Cluster Access, Run the Following Commands:
```
kubectl delete secret -n cattle-system cattle-webhook-tls
kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io --ignore-not-found=true rancher.cattle.io
kubectl delete pod -n cattle-system -l app=rancher-webhook
```
## 2. Users with No Cluster Access Via `kubectl`:
1. Delete the `cattle-webhook-tls` secret in the `cattle-system` namespace in the local cluster.
2. Delete the `rancher.cattle.io` mutating webhook
3. Delete the `rancher-webhook` pod in the `cattle-system` namespace in the local cluster.
:::note
The webhook certificate expiration issue is not specific to `cattle-webhook-tls` as listed in the examples. You will fill in your expired certificate secret accordingly.
:::
@@ -0,0 +1,259 @@
---
title: Kubernetes Resources
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/kubernetes-resources"/>
</head>
The commands/steps listed on this page can be used to check the most important Kubernetes resources and apply to [Rancher Launched Kubernetes](../../how-to-guides/new-user-guides/launch-kubernetes-with-rancher/launch-kubernetes-with-rancher.md) clusters.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
## Nodes
### Get nodes
Run the command below and check the following:
- All nodes in your cluster should be listed, make sure there is not one missing.
- All nodes should have the **Ready** status (if not in **Ready** state, check the `kubelet` container logs on that node using `docker logs kubelet`)
- Check if all nodes report the correct version.
- Check if OS/Kernel/Docker values are shown as expected (possibly you can relate issues due to upgraded OS/Kernel/Docker)
```
kubectl get nodes -o wide
```
Example output:
```
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
controlplane-0 Ready controlplane 31m v1.13.5 138.68.188.91 <none> Ubuntu 18.04.2 LTS 4.15.0-47-generic docker://18.9.5
etcd-0 Ready etcd 31m v1.13.5 138.68.180.33 <none> Ubuntu 18.04.2 LTS 4.15.0-47-generic docker://18.9.5
worker-0 Ready worker 30m v1.13.5 139.59.179.88 <none> Ubuntu 18.04.2 LTS 4.15.0-47-generic docker://18.9.5
```
### Get node conditions
Run the command below to list nodes with [Node Conditions](https://kubernetes.io/docs/concepts/architecture/nodes/#condition)
```
kubectl get nodes -o go-template='{{range .items}}{{$node := .}}{{range .status.conditions}}{{$node.metadata.name}}{{": "}}{{.type}}{{":"}}{{.status}}{{"\n"}}{{end}}{{end}}'
```
Run the command below to list nodes with [Node Conditions](https://kubernetes.io/docs/concepts/architecture/nodes/#condition) that are active that could prevent normal operation.
```
kubectl get nodes -o go-template='{{range .items}}{{$node := .}}{{range .status.conditions}}{{if ne .type "Ready"}}{{if eq .status "True"}}{{$node.metadata.name}}{{": "}}{{.type}}{{":"}}{{.status}}{{"\n"}}{{end}}{{else}}{{if ne .status "True"}}{{$node.metadata.name}}{{": "}}{{.type}}{{": "}}{{.status}}{{"\n"}}{{end}}{{end}}{{end}}{{end}}'
```
Example output:
```
worker-0: DiskPressure:True
```
## Kubernetes leader election
### Kubernetes Controller Manager leader
The leader is determined by a leader election process. After the leader has been determined, the leader (`holderIdentity`) is saved in the `kube-controller-manager` endpoint (in this example, `controlplane-0`).
```
kubectl -n kube-system get endpoints kube-controller-manager -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
{"holderIdentity":"controlplane-0_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","leaseDurationSeconds":15,"acquireTime":"2018-12-27T08:59:45Z","renewTime":"2018-12-27T09:44:57Z","leaderTransitions":0}>
```
### Kubernetes Scheduler leader
The leader is determined by a leader election process. After the leader has been determined, the leader (`holderIdentity`) is saved in the `kube-scheduler` endpoint (in this example, `controlplane-0`).
```
kubectl -n kube-system get endpoints kube-scheduler -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
{"holderIdentity":"controlplane-0_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","leaseDurationSeconds":15,"acquireTime":"2018-12-27T08:59:45Z","renewTime":"2018-12-27T09:44:57Z","leaderTransitions":0}>
```
## Ingress Controller
The default Ingress Controller is NGINX and is deployed as a DaemonSet in the `ingress-nginx` namespace. The pods are only scheduled to nodes with the `worker` role.
Check if the pods are running on all nodes:
```
kubectl -n ingress-nginx get pods -o wide
```
Example output:
```
kubectl -n ingress-nginx get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
default-http-backend-797c5bc547-kwwlq 1/1 Running 0 17m x.x.x.x worker-1
nginx-ingress-controller-4qd64 1/1 Running 0 14m x.x.x.x worker-1
nginx-ingress-controller-8wxhm 1/1 Running 0 13m x.x.x.x worker-0
```
If a pod is unable to run (Status is not **Running**, Ready status is not showing `1/1` or you see a high count of Restarts), check the pod details, logs and namespace events.
### Pod details
```
kubectl -n ingress-nginx describe pods -l app=ingress-nginx
```
### Pod container logs
The below command can show the logs of all the pods labeled "app=ingress-nginx", but it will display only 10 lines of log because of the restrictions of the `kubectl logs` command. Refer to `--tail` of `kubectl logs -h` for more information.
```
kubectl -n ingress-nginx logs -l app=ingress-nginx
```
If the full log is needed, specify the pod name in the trailing command:
```
kubectl -n ingress-nginx logs <pod name>
```
### Namespace events
```
kubectl -n ingress-nginx get events
```
### Debug logging
To enable debug logging:
```
kubectl -n ingress-nginx patch ds nginx-ingress-controller --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--v=5"}]'
```
### Check configuration
Retrieve generated configuration in each pod:
```
kubectl -n ingress-nginx get pods -l app=ingress-nginx --no-headers -o custom-columns=.NAME:.metadata.name | while read pod; do kubectl -n ingress-nginx exec $pod -- cat /etc/nginx/nginx.conf; done
```
## Rancher agents
Communication to the cluster (Kubernetes API via `cattle-cluster-agent`) and communication to the nodes (cluster provisioning via `cattle-node-agent`) is done through Rancher agents.
#### cattle-node-agent
Check if the cattle-node-agent pods are present on each node, have status **Running** and don't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-node-agent-4gc2p 1/1 Running 0 2h x.x.x.x worker-1
cattle-node-agent-8cxkk 1/1 Running 0 2h x.x.x.x etcd-1
cattle-node-agent-kzrlg 1/1 Running 0 2h x.x.x.x etcd-0
cattle-node-agent-nclz9 1/1 Running 0 2h x.x.x.x controlplane-0
cattle-node-agent-pwxp7 1/1 Running 0 2h x.x.x.x worker-0
cattle-node-agent-t5484 1/1 Running 0 2h x.x.x.x controlplane-1
cattle-node-agent-t8mtz 1/1 Running 0 2h x.x.x.x etcd-2
```
Check logging of a specific cattle-node-agent pod or all cattle-node-agent pods:
```
kubectl -n cattle-system logs -l app=cattle-agent
```
#### cattle-cluster-agent
Check if the cattle-cluster-agent pod is present in the cluster, has status **Running** and doesn't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-cluster-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-cluster-agent-54d7c6c54d-ht9h4 1/1 Running 0 2h x.x.x.x worker-1
```
Check logging of cattle-cluster-agent pod:
```
kubectl -n cattle-system logs -l app=cattle-cluster-agent
```
## Jobs and Pods
### Check that pods or jobs have status **Running**/**Completed**
To check, run the command:
```
kubectl get pods --all-namespaces
```
If a pod is not in **Running** state, you can dig into the root cause by running:
### Describe pod
```
kubectl describe pod POD_NAME -n NAMESPACE
```
### Pod container logs
```
kubectl logs POD_NAME -n NAMESPACE
```
If a job is not in **Completed** state, you can dig into the root cause by running:
### Describe job
```
kubectl describe job JOB_NAME -n NAMESPACE
```
### Logs from the containers of pods of the job
```
kubectl logs -l job-name=JOB_NAME -n NAMESPACE
```
### Evicted pods
Pods can be evicted based on [eviction signals](https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy).
Retrieve a list of evicted pods (podname and namespace):
```
kubectl get pods --all-namespaces -o go-template='{{range .items}}{{if eq .status.phase "Failed"}}{{if eq .status.reason "Evicted"}}{{.metadata.name}}{{" "}}{{.metadata.namespace}}{{"\n"}}{{end}}{{end}}{{end}}'
```
To delete all evicted pods:
```
kubectl get pods --all-namespaces -o go-template='{{range .items}}{{if eq .status.phase "Failed"}}{{if eq .status.reason "Evicted"}}{{.metadata.name}}{{" "}}{{.metadata.namespace}}{{"\n"}}{{end}}{{end}}{{end}}' | while read epod enamespace; do kubectl -n $enamespace delete pod $epod; done
```
Retrieve a list of evicted pods, scheduled node and the reason:
```
kubectl get pods --all-namespaces -o go-template='{{range .items}}{{if eq .status.phase "Failed"}}{{if eq .status.reason "Evicted"}}{{.metadata.name}}{{" "}}{{.metadata.namespace}}{{"\n"}}{{end}}{{end}}{{end}}' | while read epod enamespace; do kubectl -n $enamespace get pod $epod -o=custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,MSG:.status.message; done
```
### Job does not complete
If you have enabled Istio, and you are having issues with a Job you deployed not completing, you will need to add an annotation to your pod using [these steps.](../../how-to-guides/advanced-user-guides/istio-setup-guide/enable-istio-in-namespace.md)
Since Istio Sidecars run indefinitely, a Job cannot be considered complete even after its task has completed. This is a temporary workaround and will disable Istio for any traffic to/from the annotated Pod. Keep in mind this may not allow you to continue to use a Job for integration testing, as the Job will not have access to the service mesh.
@@ -0,0 +1,96 @@
---
title: Logging
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/logging"/>
</head>
## Log Levels
The following log levels are used in Rancher:
| Name | Description |
|---------|-------------|
| `info` | Logs informational messages. This is the default log level. |
| `debug` | Logs more detailed messages that can be used to debug. |
| `trace` | Logs very detailed messages on internal functions. This is very verbose and can contain sensitive information. |
### How to Configure a Log Level
#### Kubernetes Install
* Configure debug log level
```
$ KUBECONFIG=./kube_config_cluster.yml
$ kubectl -n cattle-system get pods -l app=rancher --no-headers -o custom-columns=name:.metadata.name | while read rancherpod; do kubectl -n cattle-system exec $rancherpod -c rancher -- loglevel --set debug; done
OK
OK
OK
$ kubectl -n cattle-system logs -l app=rancher -c rancher
```
* Configure info log level
```
$ KUBECONFIG=./kube_config_cluster.yml
$ kubectl -n cattle-system get pods -l app=rancher --no-headers -o custom-columns=name:.metadata.name | while read rancherpod; do kubectl -n cattle-system exec $rancherpod -c rancher -- loglevel --set info; done
OK
OK
OK
```
#### Docker Install
* Configure debug log level
```
$ docker exec -ti <container_id> loglevel --set debug
OK
$ docker logs -f <container_id>
```
* Configure info log level
```
$ docker exec -ti <container_id> loglevel --set info
OK
```
## Rancher Machine Debug Logs
If you need to troubleshoot the creation of objects in your infrastructure provider of choice, `rancher-machine`
debug logs might be helpful to you.
It's possible to enable debug logs for `rancher-machine` by setting environment variables when launching Rancher.
The `CATTLE_WHITELIST_ENVVARS` environment variable allows users to whitelist specific environment variables to be
passed down to `rancher-machine` during provisioning.
The `MACHINE_DEBUG` variable enables debug logs in `rancher-machine`.
Thus, by setting `MACHINE_DEBUG=true` and adding `MACHINE_DEBUG` to the default list of variables in
`CATTLE_WHITELIST_ENVVARS` (e.g. `CATTLE_WHITELIST_ENVVARS=HTTP_PROXY,HTTPS_PROXY,NO_PROXY,MACHINE_DEBUG`) it is
possible to enable debug logs in `rancher-machine` when provisioning RKE1, RKE2 and k3s clusters.
:::caution
Just like the `trace` log level above, `rancher-machine` debug logs can contain sensitive information.
:::
## Cattle-cluster-agent Debug Logs
The `cattle-cluster-agent` log levels can be set when you initialize downstream clusters.
When you create a cluster under **Cluster Configuration > Agent Environment Vars** you can set variables to define the log level.
- Trace-level logging: Set `CATTLE_TRACE` or `RANCHER_TRACE` to `true`
- Debug-level logging: Set `CATTLE_DEBUG` or `RANCHER_DEBUG` to `true`
:::caution
The `cattle-cluster-agent` debug logs may contain sensitive information.
:::
@@ -0,0 +1,110 @@
---
title: Networking
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/networking"/>
</head>
The commands/steps listed on this page can be used to check networking related issues in your cluster.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_cluster.yml` for Rancher HA) or are using the embedded kubectl via the UI.
## Double Check if All the Required Ports are Opened in Your (Host) Firewall
Double check if all the [required ports](../../how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/node-requirements-for-rancher-managed-clusters.md#networking-requirements) are opened in your (host) firewall. The overlay network uses UDP in comparison to all other required ports which are TCP.
## Check if Overlay Network is Functioning Correctly
The pod can be scheduled to any of the hosts you used for your cluster, but that means that the NGINX ingress controller needs to be able to route the request from `NODE_1` to `NODE_2`. This happens over the overlay network. If the overlay network is not functioning, you will experience intermittent TCP/HTTP connection failures due to the NGINX ingress controller not being able to route to the pod.
To test the overlay network, you can launch the following `DaemonSet` definition. This will run a `swiss-army-knife` container on every host (image was developed by Rancher engineers and can be found here: https://github.com/rancherlabs/swiss-army-knife), which we will use to run a `ping` test between containers on all hosts.
:::caution
The `swiss-army-knife` container does not support Windows nodes. It also [does not support ARM nodes](https://github.com/leodotcloud/swiss-army-knife/issues/18), such as a Raspberry Pi. When the test encounters incompatible nodes, this is recorded in the pod logs as an error message, such as `exec user process caused: exec format error` for ARM nodes, or `ImagePullBackOff (Back-off pulling image "rancherlabs/swiss-army-knife)` for Windows nodes.
:::
1. Save the following file as `overlaytest.yml`
```
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: overlaytest
spec:
selector:
matchLabels:
name: overlaytest
template:
metadata:
labels:
name: overlaytest
spec:
tolerations:
- operator: Exists
containers:
- image: rancherlabs/swiss-army-knife
imagePullPolicy: Always
name: overlaytest
command: ["sleep", "infinity"]
terminationMessagePath: /dev/termination-log
```
2. Launch it using `kubectl create -f overlaytest.yml`
3. Wait until `kubectl rollout status ds/overlaytest -w` returns: `daemon set "overlaytest" successfully rolled out`.
4. Run the following script, from the same location. It will have each `overlaytest` container on every host ping each other:
```
#!/bin/bash
echo "=> Start network overlay test"
kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' |
while read spod shost
do kubectl get pods -l name=overlaytest -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' |
while read tip thost
do kubectl --request-timeout='10s' exec $spod -c overlaytest -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"
RC=$?
if [ $RC -ne 0 ]
then echo FAIL: $spod on $shost cannot reach pod IP $tip on $thost
else echo $shost can reach $thost
fi
done
done
echo "=> End network overlay test"
```
5. When this command has finished running, it will output the state of each route:
```
=> Start network overlay test
Error from server (NotFound): pods "wk2" not found
FAIL: overlaytest-5bglp on wk2 cannot reach pod IP 10.42.7.3 on wk2
Error from server (NotFound): pods "wk2" not found
FAIL: overlaytest-5bglp on wk2 cannot reach pod IP 10.42.0.5 on cp1
Error from server (NotFound): pods "wk2" not found
FAIL: overlaytest-5bglp on wk2 cannot reach pod IP 10.42.2.12 on wk1
command terminated with exit code 1
FAIL: overlaytest-v4qkl on cp1 cannot reach pod IP 10.42.7.3 on wk2
cp1 can reach cp1
cp1 can reach wk1
command terminated with exit code 1
FAIL: overlaytest-xpxwp on wk1 cannot reach pod IP 10.42.7.3 on wk2
wk1 can reach cp1
wk1 can reach wk1
=> End network overlay test
```
If you see error in the output, there is some issue with the route between the pods on the two hosts. In the above output the node `wk2` has no connectivity over the overlay network. This could be because the [required ports](../../how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/node-requirements-for-rancher-managed-clusters.md#networking-requirements) for overlay networking are not opened for `wk2`.
6. You can now clean up the DaemonSet by running `kubectl delete ds/overlaytest`.
### Check if MTU is Correctly Configured on Hosts and on Peering/Tunnel Appliances/Devices
When the MTU is incorrectly configured (either on hosts running Rancher, nodes in created/imported clusters or on appliances/devices in between), error messages will be logged in Rancher and in the agents, similar to:
* `websocket: bad handshake`
* `Failed to connect to proxy`
* `read tcp: i/o timeout`
See [Google Cloud VPN: MTU Considerations](https://cloud.google.com/vpn/docs/concepts/mtu-considerations#gateway_mtu_vs_system_mtu) for an example how to configure MTU correctly when using Google Cloud VPN between Rancher and cluster nodes.
@@ -0,0 +1,112 @@
---
title: Rancher HA
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/rancher-ha"/>
</head>
The commands/steps listed on this page can be used to check your Rancher Kubernetes Installation.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kube_config_cluster.yml`).
## Check Rancher Pods
Rancher pods are deployed as a Deployment in the `cattle-system` namespace.
Check if the pods are running on all nodes:
```
kubectl -n cattle-system get pods -l app=rancher -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
rancher-7dbd7875f7-n6t5t 1/1 Running 0 8m x.x.x.x x.x.x.x
rancher-7dbd7875f7-qbj5k 1/1 Running 0 8m x.x.x.x x.x.x.x
rancher-7dbd7875f7-qw7wb 1/1 Running 0 8m x.x.x.x x.x.x.x
```
If a pod is unable to run (Status is not **Running**, Ready status is not showing `1/1` or you see a high count of Restarts), check the pod details, logs and namespace events.
### Pod Details
```
kubectl -n cattle-system describe pods -l app=rancher
```
### Pod Container Logs
```
kubectl -n cattle-system logs -l app=rancher
```
### Namespace Events
```
kubectl -n cattle-system get events
```
## Check Ingress
Ingress should have the correct `HOSTS` (showing the configured FQDN) and `ADDRESS` (host address(es) it will be routed to).
```
kubectl -n cattle-system get ingress
```
Example output:
```
NAME HOSTS ADDRESS PORTS AGE
rancher rancher.yourdomain.com x.x.x.x,x.x.x.x,x.x.x.x 80, 443 2m
```
## Check Ingress Controller Logs
When accessing your configured Rancher FQDN does not show you the UI, check the ingress controller logging to see what happens when you try to access Rancher:
```
kubectl -n ingress-nginx logs -l app=ingress-nginx
```
## Leader Election
The leader is determined by a leader election process. After the leader has been determined, the leader (`holderIdentity`) is saved in the `cattle-controllers` Lease in the `kube-system` namespace (in this example, `rancher-dbc7ff869-gvg6k`).
```
kubectl -n kube-system get lease cattle-controllers
```
Example output:
```
NAME HOLDER AGE
cattle-controllers rancher-dbc7ff869-gvg6k 6h10m
```
### Configuration
_Available as of Rancher 2.8.3_
If the Kubernetes API experiences latency, the Rancher replica holding the leader lock may not be able to renew the lease before the lease becomes invalid, which can be observed in the Rancher logs:
```
E0629 04:13:07.293461 34 leaderelection.go:364] Failed to update lock: Put "https://172.17.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cattle-controllers?timeout=15m0s": context deadline exceeded
I0629 04:13:07.293594 34 leaderelection.go:280] failed to renew lease kube-system/cattle-controllers: timed out waiting for the condition
...
2024/06/29 04:13:10 [FATAL] leaderelection lost for cattle-controllers
```
To mitigate this, you can set environment variables in the `rancher` Deployment to modify the default parameters for leader election:
- `CATTLE_ELECTION_LEASE_DURATION`: The [lease duration](https://pkg.go.dev/k8s.io/client-go/tools/leaderelection#LeaderElectionConfig.LeaseDuration). The default value is 45s.
- `CATTLE_ELECTION_RENEW_DEADLINE`: The [renew deadline](https://pkg.go.dev/k8s.io/client-go/tools/leaderelection#LeaderElectionConfig.RenewDeadline). The default value is 30s.
- `CATTLE_ELECTION_RETRY_PERIOD`: The [retry period](https://pkg.go.dev/k8s.io/client-go/tools/leaderelection#LeaderElectionConfig.RetryPeriod). The default value is 2s.
Example:
```
kubectl -n cattle-system set env deploy/rancher CATTLE_ELECTION_LEASE_DURATION=2m CATTLE_ELECTION_RENEW_DEADLINE=90s CATTLE_ELECTION_RETRY_PERIOD=10s
```
This will temporarily increase the lease duration, renew deadline and retry period to 120, 90 and 10 seconds respectively.
Alternatively, in order to make such changes permanent, these environment variables can be set by [using Helm values](../../getting-started/installation-and-upgrade/installation-references/helm-chart-options.md#setting-extra-environment-variables) instead.
@@ -0,0 +1,71 @@
---
title: Registered Clusters
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/registered-clusters"/>
</head>
The commands/steps listed on this page can be used to check clusters that you are registering or that are registered in Rancher.
Make sure you configured the correct kubeconfig (for example, `export KUBECONFIG=$PWD/kubeconfig_from_imported_cluster.yml`)
## Rancher Agents
Communication to the cluster (Kubernetes API via cattle-cluster-agent) and communication to the nodes is done through Rancher agents.
If the cattle-cluster-agent cannot connect to the configured `server-url`, the cluster will remain in **Pending** state, showing `Waiting for full cluster configuration`.
### cattle-node-agent
:::note
cattle-node-agents are only present in clusters created in Rancher with RKE.
:::
Check if the cattle-node-agent pods are present on each node, have status **Running** and don't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-node-agent-4gc2p 1/1 Running 0 2h x.x.x.x worker-1
cattle-node-agent-8cxkk 1/1 Running 0 2h x.x.x.x etcd-1
cattle-node-agent-kzrlg 1/1 Running 0 2h x.x.x.x etcd-0
cattle-node-agent-nclz9 1/1 Running 0 2h x.x.x.x controlplane-0
cattle-node-agent-pwxp7 1/1 Running 0 2h x.x.x.x worker-0
cattle-node-agent-t5484 1/1 Running 0 2h x.x.x.x controlplane-1
cattle-node-agent-t8mtz 1/1 Running 0 2h x.x.x.x etcd-2
```
Check logging of a specific cattle-node-agent pod or all cattle-node-agent pods:
```
kubectl -n cattle-system logs -l app=cattle-agent
```
### cattle-cluster-agent
Check if the cattle-cluster-agent pod is present in the cluster, has status **Running** and doesn't have a high count of Restarts:
```
kubectl -n cattle-system get pods -l app=cattle-cluster-agent -o wide
```
Example output:
```
NAME READY STATUS RESTARTS AGE IP NODE
cattle-cluster-agent-54d7c6c54d-ht9h4 1/1 Running 0 2h x.x.x.x worker-1
```
Check logging of cattle-cluster-agent pod:
```
kubectl -n cattle-system logs -l app=cattle-cluster-agent
```
@@ -0,0 +1,26 @@
---
title: User ID Tracking in Audit Logs
---
<head>
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/user-id-tracking-in-audit-logs"/>
</head>
The following audit logs are used in Rancher to track events occuring on the local and downstream clusters:
* [Kubernetes Audit Logs](https://rancher.com/docs/rke/latest/en/config-options/audit-log/)
* [Rancher API Audit Logs](../../how-to-guides/advanced-user-guides/enable-api-audit-log.md)
Audit logs in Rancher v2.6 have been enhanced to include the external Identity Provider name (common name of the user in the external Auth provider) in both the Rancher and downstream Kubernetes audit logs.
Before v2.6, a Rancher Admin could not trace an event from the Rancher audit logs and into the Kubernetes audit logs without knowing the mapping of the external Identity Provider username to the userId (`u-xXXX`) used in Rancher.
To know this mapping, the cluster admins needed to have access to Rancher API, UI, and the local management cluster.
Now with this feature, a downstream cluster admin should be able to look at the Kubernetes audit logs and know which specific external Identity Provider (IDP) user performed an action without needing to view anything in Rancher.
If the audit logs are shipped off of the cluster, a user of the logging system should be able to identify the user in the external Identity Provider system.
A Rancher Admin should now be able to view Rancher audit logs and follow through to the Kubernetes audit log by using the external Identity Provider username.
## Feature Description
- When Kubernetes Audit logs are enabled on the downstream cluster, in each event that is logged, the external Identity Provider's username is now logged for each request, at the "metadata" level.
- When you enable Rancher API Audit logs for a Rancher installation, the external Identity Provider's username is also logged now at the `auditLog.level=0` for each request that hits the Rancher API server, including login requests.