Docs: Migrate etcd troubleshooting guide from RKE/Docker to RKE2/K3s/containerd

- Updates `troubleshooting-etcd-nodes.md` to replace Docker-based commands with `crictl` and `etcdctl` for RKE2 and K3s.
- Replaces `curl` connectivity checks with `openssl s_client` to support etcd 3.5+ gRPC requirements and isolate transport layer testing.
- Adds prerequisites section with necessary environment exports.
- Updates all `etcdctl` commands to use explicit inline certificate paths for RKE2 and K3s.
- Replaces shell-dependent container commands with host-side processing to support distroless images.
- Updates log level configuration instructions for RKE2/K3s config files.
This commit is contained in:
Manuel Simón Nóvoa
2026-02-25 15:22:55 +01:00
parent a7a0a05827
commit 64269bea22
3 changed files with 627 additions and 306 deletions
@@ -6,29 +6,63 @@ title: Troubleshooting etcd Nodes
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes"/>
</head>
This section contains commands and tips for troubleshooting nodes with the `etcd` role.
This section contains commands and tips for troubleshooting nodes with the `etcd` role in RKE2 and K3s clusters.
## Prerequisites
As RKE2 and K3s rely on `containerd` as the container runtime, `crictl` replaces Docker for container management. Before proceeding with the troubleshooting commands, configure your environment by exporting the following variables:
### RKE2
```bash
export PATH=$PATH:/var/lib/rancher/rke2/bin/
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(crictl ps --name etcd --quiet)
```
### K3s
> ### ⚠️ **Warning**
> K3s does not include `etcdctl` in the system PATH. If you need to perform etcd troubleshooting on a K3s cluster, you may need to install it or locate it within the K3s data directory.
```bash
export PATH=$PATH:/usr/local/bin
export CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml
```
## Checking if the etcd Container is Running
The container for etcd should have status **Up**. The duration shown after **Up** is the time the container has been running.
**RKE2**: The container for etcd should be in the **Running** state.
```
docker ps -a -f=name=etcd$
```bash
crictl ps --name etcd
```
Example output:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d26adbd23643 rancher/mirrored-coreos-etcd:v3.5.7 "/usr/local/bin/etcd…" 30 minutes ago Up 30 minutes etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD NAMESPACE
f1e289d202ed0 11ad16872a9cf 58 minutes ago Running etcd 0 7b56aab8204ea etcd-cluster1 kube-system
```
## etcd Container Logging
The logging of the container can contain information on what the problem could be.
**K3s**: Etcd runs as an embedded process in the K3s service. Check the service status:
```bash
systemctl status k3s
```
docker logs etcd
## etcd Logging
The logs can contain information on what the problem could be.
**RKE2**:
```bash
crictl logs $etcdcontainer
```
**K3s**:
```bash
journalctl -u k3s | grep -i etcd
```
| Log | Explanation |
|-----|------------------|
@@ -46,18 +80,43 @@ The address where etcd is listening depends on the address configuration of the
Output should contain all the nodes with the `etcd` role and the output should be identical on all nodes.
Command:
**RKE2**:
Run the command inside the etcd container.
```bash
crictl exec $etcdcontainer etcdctl member list \
--cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl member list
**K3s**:
```bash
etcdctl member list \
--cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/k3s/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
```
1c424074df86e854, started, cluster-node1-f289ac71, https://IP:2380, https://IP:2379, false
45c68c44c5a792ff, started, cluster-node2-67e3cf6f, https://IP:2380, https://IP:2379, false
7c584f77c5180258, started, cluster-node3-e976bc00, https://IP:2380, https://IP:2379, false
```
### Check Endpoint Status
The values for `RAFT TERM` should be equal and `RAFT INDEX` should be not be too far apart from each other.
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl endpoint status --write-out table --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
**K3s**:
```bash
etcdctl endpoint status --write-out table --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -65,17 +124,22 @@ Example output:
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | 333ef673fc4add56 | 3.5.7 | 24 MB | false | 72 | 66887 |
| https://IP:2379 | 5feed52d940ce4cf | 3.5.7 | 24 MB | true | 72 | 66887 |
| https://IP:2379 | db6b3bdb559a848d | 3.5.7 | 25 MB | false | 72 | 66887 |
| https://IP:2379 | 333ef673fc4add56 | 3.6.7 | 24 MB | false | 72 | 66887 |
| https://IP:2379 | 5feed52d940ce4cf | 3.6.7 | 24 MB | true | 72 | 66887 |
| https://IP:2379 | db6b3bdb559a848d | 3.6.7 | 25 MB | false | 72 | 66887 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
```
### Check Endpoint Health
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl endpoint health --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint health
**K3s**:
```bash
etcdctl endpoint health --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -84,54 +148,104 @@ https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms
```
### Check Connectivity on etcd Ports
### Check Connectivity on Port TCP/2379
> In modern versions of Kubernetes, the etcd database (versions 3.5 and newer) introduced significant architectural changes regarding network traffic handling. Previously, etcd permitted standard HTTP REST requests on its primary client port (`2379`). However, to enhance performance and security, etcd 3.5+ strictly enforces the gRPC protocol on this port.<br />
If you attempt to use standard HTTP tools like `curl` to test connectivity on port `2379`, the etcd server will automatically terminate the connection or return an error. This behavior often leads administrators to misinterpret the result as a closed port or a node failure.
Command:
Since standard HTTP clients can no longer probe the primary etcd ports, the transport layer must be utilized for network troubleshooting. Using `openssl s_client` instead of `curl` bypasses the gRPC application requirement, allowing the raw TCP and TLS handshake to be tested directly.
These script isolate the network and security infrastructure from the database application. A successful `Verify return code: 0 (ok)` explicitly confirms four critical infrastructure components:
* **Network Path:** Routing is functional, and firewalls permit traffic on TCP port `2379` or `2380`.
* **Process Availability:** The etcd service is running and actively listening on the designated port.
* **Certificate Validity:** The TLS certificates are active, correctly formatted, and have not expired.
* **Mutual Authentication (mTLS):** The node successfully authenticates against the cluster's specific Certificate Authority (CA).
**How these tests differ from the `etcdctl endpoint health` test**:
If `etcdctl endpoint health` test is failing, run these Connectivity Ports test scripts. If the scripts succeed, your network and certificates are intact, and the issue is likely confined to the etcd database itself. If these scripts fail, the issue is related to a firewall/network restriction, or certificate expiration.
#### Port TCP/2379
**RKE2**:
```bash
for endpoint in $(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5); do
echo "Validating connection to ${endpoint} (Client)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
-cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
-key /var/lib/rancher/rke2/server/tls/etcd/server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f5); do
echo "Validating connection to ${endpoint}/health"
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/health"
**K3s**:
```bash
for endpoint in $(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5); do
echo "Validating connection to ${endpoint} (Client)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
-cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
-key /var/lib/rancher/k3s/server/tls/etcd/server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
Example output:
```
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
```
### Check Connectivity on Port TCP/2380
#### Port TCP/2380
Command:
**RKE2**:
```bash
for endpoint in $(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f4); do
echo "Validating connection to ${endpoint} (Peer)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt \
-cert /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt \
-key /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f4); do
echo "Validating connection to ${endpoint}/version";
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/version"
**K3s**:
```bash
for endpoint in $(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f4); do
echo "Validating connection to ${endpoint} (Peer)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt \
-cert /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt \
-key /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
Example output:
```
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
```
## etcd Alarms
etcd will trigger alarms, for instance when it runs out of space.
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl alarm list
**K3s**:
```bash
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output when NOSPACE alarm is triggered:
@@ -154,10 +268,16 @@ Resolutions:
### Compact the Keyspace
Command:
**RKE2**:
```bash
rev=$(crictl exec $etcdcontainer etcdctl endpoint status --write-out json --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' | head -1)
crictl exec $etcdcontainer etcdctl compact "$rev" --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
docker exec etcd etcdctl compact "$rev"
**K3s**:
```bash
rev=$(etcdctl endpoint status --write-out json --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' | head -1)
etcdctl compact "$rev" --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -167,55 +287,39 @@ compacted revision xxx
### Defrag All etcd Members
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl defrag --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl defrag
**K3s**:
```bash
etcdctl defrag --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
```
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
```
### Check Endpoint Status
Command:
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
```
Example output:
```
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | e973e4419737125 | 3.5.7 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | 4a509c997b26c206 | 3.5.7 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | b217e736575e9dd3 | 3.5.7 | 553 kB | true | 32 | 2449410 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
```
### Disarm Alarm
After verifying that the DB size went down after compaction and defragmenting, the alarm needs to be disarmed for etcd to allow writes again.
Command:
```
docker exec etcd etcdctl alarm list
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
crictl exec $etcdcontainer etcdctl alarm disarm --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
Example output:
```
docker exec etcd etcdctl alarm list
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list
**K3s**:
```bash
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
etcdctl alarm disarm --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
## Configure Log Level
@@ -228,7 +332,7 @@ You can no longer dynamically change the log level in etcd v3.5 or later.
### etcd v3.5 And Later
To configure the log level for etcd, edit the cluster YAML:
To configure the log level for etcd, edit the cluster configuration YAML:
```
services:
@@ -237,20 +341,7 @@ services:
log-level: "debug"
```
### etcd v3.4 And Earlier
In earlier etcd versions, you can use the API to dynamically change the log level. Configure debug logging using the commands below:
```
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
```
To reset the log level back to the default (`INFO`), you can use the following command.
Command:
```
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
```
After modifying the configuration, restart the service (`systemctl restart rke2-server` or `systemctl restart k3s`) if you are configuring a stand-alone cluster.
## etcd Content
@@ -258,24 +349,40 @@ If you want to investigate the contents of your etcd, you can either watch strea
### Watch Streaming Events
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl watch --prefix /registry --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl watch --prefix /registry
**K3s**:
```bash
etcdctl watch --prefix /registry --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
If you only want to see the affected keys (and not the binary data), you can append `| grep -a ^/registry` to the command to filter for keys only.
### Query etcd Directly
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl get /registry --prefix=true --keys-only
**K3s**:
```bash
etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
You can process the data to get a summary of count per key, using the command below:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
```
docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
**K3s**:
```bash
etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
```
## Replacing Unhealthy etcd Nodes
@@ -6,29 +6,63 @@ title: Troubleshooting etcd Nodes
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes"/>
</head>
This section contains commands and tips for troubleshooting nodes with the `etcd` role.
This section contains commands and tips for troubleshooting nodes with the `etcd` role in RKE2 and K3s clusters.
## Prerequisites
As RKE2 and K3s rely on `containerd` as the container runtime, `crictl` replaces Docker for container management. Before proceeding with the troubleshooting commands, configure your environment by exporting the following variables:
### RKE2
```bash
export PATH=$PATH:/var/lib/rancher/rke2/bin/
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(crictl ps --name etcd --quiet)
```
### K3s
> ### ⚠️ **Warning**
> K3s does not include `etcdctl` in the system PATH. If you need to perform etcd troubleshooting on a K3s cluster, you may need to install it or locate it within the K3s data directory.
```bash
export PATH=$PATH:/usr/local/bin
export CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml
```
## Checking if the etcd Container is Running
The container for etcd should have status **Up**. The duration shown after **Up** is the time the container has been running.
**RKE2**: The container for etcd should be in the **Running** state.
```
docker ps -a -f=name=etcd$
```bash
crictl ps --name etcd
```
Example output:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d26adbd23643 rancher/mirrored-coreos-etcd:v3.5.7 "/usr/local/bin/etcd…" 30 minutes ago Up 30 minutes etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD NAMESPACE
f1e289d202ed0 11ad16872a9cf 58 minutes ago Running etcd 0 7b56aab8204ea etcd-cluster1 kube-system
```
## etcd Container Logging
The logging of the container can contain information on what the problem could be.
**K3s**: Etcd runs as an embedded process in the K3s service. Check the service status:
```bash
systemctl status k3s
```
docker logs etcd
## etcd Logging
The logs can contain information on what the problem could be.
**RKE2**:
```bash
crictl logs $etcdcontainer
```
**K3s**:
```bash
journalctl -u k3s | grep -i etcd
```
| Log | Explanation |
|-----|------------------|
@@ -46,18 +80,43 @@ The address where etcd is listening depends on the address configuration of the
Output should contain all the nodes with the `etcd` role and the output should be identical on all nodes.
Command:
**RKE2**:
Run the command inside the etcd container.
```bash
crictl exec $etcdcontainer etcdctl member list \
--cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl member list
**K3s**:
```bash
etcdctl member list \
--cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/k3s/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
```
1c424074df86e854, started, cluster-node1-f289ac71, https://IP:2380, https://IP:2379, false
45c68c44c5a792ff, started, cluster-node2-67e3cf6f, https://IP:2380, https://IP:2379, false
7c584f77c5180258, started, cluster-node3-e976bc00, https://IP:2380, https://IP:2379, false
```
### Check Endpoint Status
The values for `RAFT TERM` should be equal and `RAFT INDEX` should be not be too far apart from each other.
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl endpoint status --write-out table --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
**K3s**:
```bash
etcdctl endpoint status --write-out table --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -65,17 +124,22 @@ Example output:
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | 333ef673fc4add56 | 3.5.7 | 24 MB | false | 72 | 66887 |
| https://IP:2379 | 5feed52d940ce4cf | 3.5.7 | 24 MB | true | 72 | 66887 |
| https://IP:2379 | db6b3bdb559a848d | 3.5.7 | 25 MB | false | 72 | 66887 |
| https://IP:2379 | 333ef673fc4add56 | 3.6.7 | 24 MB | false | 72 | 66887 |
| https://IP:2379 | 5feed52d940ce4cf | 3.6.7 | 24 MB | true | 72 | 66887 |
| https://IP:2379 | db6b3bdb559a848d | 3.6.7 | 25 MB | false | 72 | 66887 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
```
### Check Endpoint Health
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl endpoint health --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint health
**K3s**:
```bash
etcdctl endpoint health --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -84,54 +148,104 @@ https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms
```
### Check Connectivity on etcd Ports
### Check Connectivity on Port TCP/2379
> In modern versions of Kubernetes, the etcd database (versions 3.5 and newer) introduced significant architectural changes regarding network traffic handling. Previously, etcd permitted standard HTTP REST requests on its primary client port (`2379`). However, to enhance performance and security, etcd 3.5+ strictly enforces the gRPC protocol on this port.<br />
If you attempt to use standard HTTP tools like `curl` to test connectivity on port `2379`, the etcd server will automatically terminate the connection or return an error. This behavior often leads administrators to misinterpret the result as a closed port or a node failure.
Command:
Since standard HTTP clients can no longer probe the primary etcd ports, the transport layer must be utilized for network troubleshooting. Using `openssl s_client` instead of `curl` bypasses the gRPC application requirement, allowing the raw TCP and TLS handshake to be tested directly.
These script isolate the network and security infrastructure from the database application. A successful `Verify return code: 0 (ok)` explicitly confirms four critical infrastructure components:
* **Network Path:** Routing is functional, and firewalls permit traffic on TCP port `2379` or `2380`.
* **Process Availability:** The etcd service is running and actively listening on the designated port.
* **Certificate Validity:** The TLS certificates are active, correctly formatted, and have not expired.
* **Mutual Authentication (mTLS):** The node successfully authenticates against the cluster's specific Certificate Authority (CA).
**How these tests differ from the `etcdctl endpoint health` test**:
If `etcdctl endpoint health` test is failing, run these Connectivity Ports test scripts. If the scripts succeed, your network and certificates are intact, and the issue is likely confined to the etcd database itself. If these scripts fail, the issue is related to a firewall/network restriction, or certificate expiration.
#### Port TCP/2379
**RKE2**:
```bash
for endpoint in $(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5); do
echo "Validating connection to ${endpoint} (Client)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
-cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
-key /var/lib/rancher/rke2/server/tls/etcd/server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f5); do
echo "Validating connection to ${endpoint}/health"
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/health"
**K3s**:
```bash
for endpoint in $(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5); do
echo "Validating connection to ${endpoint} (Client)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
-cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
-key /var/lib/rancher/k3s/server/tls/etcd/server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
Example output:
```
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
```
### Check Connectivity on Port TCP/2380
#### Port TCP/2380
Command:
**RKE2**:
```bash
for endpoint in $(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f4); do
echo "Validating connection to ${endpoint} (Peer)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt \
-cert /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt \
-key /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f4); do
echo "Validating connection to ${endpoint}/version";
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/version"
**K3s**:
```bash
for endpoint in $(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f4); do
echo "Validating connection to ${endpoint} (Peer)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt \
-cert /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt \
-key /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
Example output:
```
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
```
## etcd Alarms
etcd will trigger alarms, for instance when it runs out of space.
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl alarm list
**K3s**:
```bash
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output when NOSPACE alarm is triggered:
@@ -154,10 +268,16 @@ Resolutions:
### Compact the Keyspace
Command:
**RKE2**:
```bash
rev=$(crictl exec $etcdcontainer etcdctl endpoint status --write-out json --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' | head -1)
crictl exec $etcdcontainer etcdctl compact "$rev" --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
docker exec etcd etcdctl compact "$rev"
**K3s**:
```bash
rev=$(etcdctl endpoint status --write-out json --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' | head -1)
etcdctl compact "$rev" --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -167,55 +287,39 @@ compacted revision xxx
### Defrag All etcd Members
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl defrag --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl defrag
**K3s**:
```bash
etcdctl defrag --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
```
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
```
### Check Endpoint Status
Command:
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
```
Example output:
```
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | e973e4419737125 | 3.5.7 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | 4a509c997b26c206 | 3.5.7 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | b217e736575e9dd3 | 3.5.7 | 553 kB | true | 32 | 2449410 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
```
### Disarm Alarm
After verifying that the DB size went down after compaction and defragmenting, the alarm needs to be disarmed for etcd to allow writes again.
Command:
```
docker exec etcd etcdctl alarm list
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
crictl exec $etcdcontainer etcdctl alarm disarm --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
Example output:
```
docker exec etcd etcdctl alarm list
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list
**K3s**:
```bash
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
etcdctl alarm disarm --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
## Configure Log Level
@@ -228,7 +332,7 @@ You can no longer dynamically change the log level in etcd v3.5 or later.
### etcd v3.5 And Later
To configure the log level for etcd, edit the cluster YAML:
To configure the log level for etcd, edit the cluster configuration YAML:
```
services:
@@ -237,20 +341,7 @@ services:
log-level: "debug"
```
### etcd v3.4 And Earlier
In earlier etcd versions, you can use the API to dynamically change the log level. Configure debug logging using the commands below:
```
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
```
To reset the log level back to the default (`INFO`), you can use the following command.
Command:
```
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
```
After modifying the configuration, restart the service (`systemctl restart rke2-server` or `systemctl restart k3s`) if you are configuring a stand-alone cluster.
## etcd Content
@@ -258,24 +349,40 @@ If you want to investigate the contents of your etcd, you can either watch strea
### Watch Streaming Events
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl watch --prefix /registry --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl watch --prefix /registry
**K3s**:
```bash
etcdctl watch --prefix /registry --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
If you only want to see the affected keys (and not the binary data), you can append `| grep -a ^/registry` to the command to filter for keys only.
### Query etcd Directly
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl get /registry --prefix=true --keys-only
**K3s**:
```bash
etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
You can process the data to get a summary of count per key, using the command below:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
```
docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
**K3s**:
```bash
etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
```
## Replacing Unhealthy etcd Nodes
@@ -6,29 +6,63 @@ title: Troubleshooting etcd Nodes
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes"/>
</head>
This section contains commands and tips for troubleshooting nodes with the `etcd` role.
This section contains commands and tips for troubleshooting nodes with the `etcd` role in RKE2 and K3s clusters.
## Prerequisites
As RKE2 and K3s rely on `containerd` as the container runtime, `crictl` replaces Docker for container management. Before proceeding with the troubleshooting commands, configure your environment by exporting the following variables:
### RKE2
```bash
export PATH=$PATH:/var/lib/rancher/rke2/bin/
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(crictl ps --name etcd --quiet)
```
### K3s
> ### ⚠️ **Warning**
> K3s does not include `etcdctl` in the system PATH. If you need to perform etcd troubleshooting on a K3s cluster, you may need to install it or locate it within the K3s data directory.
```bash
export PATH=$PATH:/usr/local/bin
export CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml
```
## Checking if the etcd Container is Running
The container for etcd should have status **Up**. The duration shown after **Up** is the time the container has been running.
**RKE2**: The container for etcd should be in the **Running** state.
```
docker ps -a -f=name=etcd$
```bash
crictl ps --name etcd
```
Example output:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d26adbd23643 rancher/mirrored-coreos-etcd:v3.5.7 "/usr/local/bin/etcd…" 30 minutes ago Up 30 minutes etcd
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD NAMESPACE
f1e289d202ed0 11ad16872a9cf 58 minutes ago Running etcd 0 7b56aab8204ea etcd-cluster1 kube-system
```
## etcd Container Logging
The logging of the container can contain information on what the problem could be.
**K3s**: Etcd runs as an embedded process in the K3s service. Check the service status:
```bash
systemctl status k3s
```
docker logs etcd
## etcd Logging
The logs can contain information on what the problem could be.
**RKE2**:
```bash
crictl logs $etcdcontainer
```
**K3s**:
```bash
journalctl -u k3s | grep -i etcd
```
| Log | Explanation |
|-----|------------------|
@@ -46,18 +80,43 @@ The address where etcd is listening depends on the address configuration of the
Output should contain all the nodes with the `etcd` role and the output should be identical on all nodes.
Command:
**RKE2**:
Run the command inside the etcd container.
```bash
crictl exec $etcdcontainer etcdctl member list \
--cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl member list
**K3s**:
```bash
etcdctl member list \
--cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/k3s/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
```
1c424074df86e854, started, cluster-node1-f289ac71, https://IP:2380, https://IP:2379, false
45c68c44c5a792ff, started, cluster-node2-67e3cf6f, https://IP:2380, https://IP:2379, false
7c584f77c5180258, started, cluster-node3-e976bc00, https://IP:2380, https://IP:2379, false
```
### Check Endpoint Status
The values for `RAFT TERM` should be equal and `RAFT INDEX` should be not be too far apart from each other.
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl endpoint status --write-out table --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
**K3s**:
```bash
etcdctl endpoint status --write-out table --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -65,17 +124,22 @@ Example output:
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | 333ef673fc4add56 | 3.5.7 | 24 MB | false | 72 | 66887 |
| https://IP:2379 | 5feed52d940ce4cf | 3.5.7 | 24 MB | true | 72 | 66887 |
| https://IP:2379 | db6b3bdb559a848d | 3.5.7 | 25 MB | false | 72 | 66887 |
| https://IP:2379 | 333ef673fc4add56 | 3.6.7 | 24 MB | false | 72 | 66887 |
| https://IP:2379 | 5feed52d940ce4cf | 3.6.7 | 24 MB | true | 72 | 66887 |
| https://IP:2379 | db6b3bdb559a848d | 3.6.7 | 25 MB | false | 72 | 66887 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
```
### Check Endpoint Health
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl endpoint health --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint health
**K3s**:
```bash
etcdctl endpoint health --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -84,54 +148,104 @@ https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms
```
### Check Connectivity on etcd Ports
### Check Connectivity on Port TCP/2379
> In modern versions of Kubernetes, the etcd database (versions 3.5 and newer) introduced significant architectural changes regarding network traffic handling. Previously, etcd permitted standard HTTP REST requests on its primary client port (`2379`). However, to enhance performance and security, etcd 3.5+ strictly enforces the gRPC protocol on this port.<br />
If you attempt to use standard HTTP tools like `curl` to test connectivity on port `2379`, the etcd server will automatically terminate the connection or return an error. This behavior often leads administrators to misinterpret the result as a closed port or a node failure.
Command:
Since standard HTTP clients can no longer probe the primary etcd ports, the transport layer must be utilized for network troubleshooting. Using `openssl s_client` instead of `curl` bypasses the gRPC application requirement, allowing the raw TCP and TLS handshake to be tested directly.
These script isolate the network and security infrastructure from the database application. A successful `Verify return code: 0 (ok)` explicitly confirms four critical infrastructure components:
* **Network Path:** Routing is functional, and firewalls permit traffic on TCP port `2379` or `2380`.
* **Process Availability:** The etcd service is running and actively listening on the designated port.
* **Certificate Validity:** The TLS certificates are active, correctly formatted, and have not expired.
* **Mutual Authentication (mTLS):** The node successfully authenticates against the cluster's specific Certificate Authority (CA).
**How these tests differ from the `etcdctl endpoint health` test**:
If `etcdctl endpoint health` test is failing, run these Connectivity Ports test scripts. If the scripts succeed, your network and certificates are intact, and the issue is likely confined to the etcd database itself. If these scripts fail, the issue is related to a firewall/network restriction, or certificate expiration.
#### Port TCP/2379
**RKE2**:
```bash
for endpoint in $(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5); do
echo "Validating connection to ${endpoint} (Client)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
-cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
-key /var/lib/rancher/rke2/server/tls/etcd/server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f5); do
echo "Validating connection to ${endpoint}/health"
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/health"
**K3s**:
```bash
for endpoint in $(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5); do
echo "Validating connection to ${endpoint} (Client)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
-cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
-key /var/lib/rancher/k3s/server/tls/etcd/server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
Example output:
```
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
Validating connection to https://IP:2379/health (Client)
Verify return code: 0 (ok)
```
### Check Connectivity on Port TCP/2380
#### Port TCP/2380
Command:
**RKE2**:
```bash
for endpoint in $(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f4); do
echo "Validating connection to ${endpoint} (Peer)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt \
-cert /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt \
-key /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f4); do
echo "Validating connection to ${endpoint}/version";
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/version"
**K3s**:
```bash
for endpoint in $(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f4); do
echo "Validating connection to ${endpoint} (Peer)";
echo | openssl s_client -connect ${endpoint#https://} \
-CAfile /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt \
-cert /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt \
-key /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key 2>/dev/null | grep -E 'Verify return code' || echo "Connection Failed/Timeout"
done
```
Example output:
```
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
Validating connection to https://IP:2380/version (Peer)
Verify return code: 0 (ok)
```
## etcd Alarms
etcd will trigger alarms, for instance when it runs out of space.
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl alarm list
**K3s**:
```bash
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output when NOSPACE alarm is triggered:
@@ -154,10 +268,16 @@ Resolutions:
### Compact the Keyspace
Command:
**RKE2**:
```bash
rev=$(crictl exec $etcdcontainer etcdctl endpoint status --write-out json --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' | head -1)
crictl exec $etcdcontainer etcdctl compact "$rev" --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
docker exec etcd etcdctl compact "$rev"
**K3s**:
```bash
rev=$(etcdctl endpoint status --write-out json --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*' | head -1)
etcdctl compact "$rev" --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
@@ -167,55 +287,39 @@ compacted revision xxx
### Defrag All etcd Members
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl defrag --endpoints=$(crictl exec $etcdcontainer etcdctl member list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl defrag
**K3s**:
```bash
etcdctl defrag --endpoints=$(etcdctl member list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
Example output:
```
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
Finished defragmenting etcd member[https://IP:2379]
```
### Check Endpoint Status
Command:
```
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
```
Example output:
```
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | e973e4419737125 | 3.5.7 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | 4a509c997b26c206 | 3.5.7 | 553 kB | false | 32 | 2449410 |
| https://IP:2379 | b217e736575e9dd3 | 3.5.7 | 553 kB | true | 32 | 2449410 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
Finished defragmenting etcd member[https://IP:2379]. took xx.xxxxxxms
```
### Disarm Alarm
After verifying that the DB size went down after compaction and defragmenting, the alarm needs to be disarmed for etcd to allow writes again.
Command:
```
docker exec etcd etcdctl alarm list
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
crictl exec $etcdcontainer etcdctl alarm disarm --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
crictl exec $etcdcontainer etcdctl alarm list --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
Example output:
```
docker exec etcd etcdctl alarm list
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list
**K3s**:
```bash
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
etcdctl alarm disarm --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
etcdctl alarm list --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
## Configure Log Level
@@ -228,7 +332,7 @@ You can no longer dynamically change the log level in etcd v3.5 or later.
### etcd v3.5 And Later
To configure the log level for etcd, edit the cluster YAML:
To configure the log level for etcd, edit the cluster configuration YAML:
```
services:
@@ -237,20 +341,7 @@ services:
log-level: "debug"
```
### etcd v3.4 And Earlier
In earlier etcd versions, you can use the API to dynamically change the log level. Configure debug logging using the commands below:
```
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
```
To reset the log level back to the default (`INFO`), you can use the following command.
Command:
```
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
```
After modifying the configuration, restart the service (`systemctl restart rke2-server` or `systemctl restart k3s`) if you are configuring a stand-alone cluster.
## etcd Content
@@ -258,24 +349,40 @@ If you want to investigate the contents of your etcd, you can either watch strea
### Watch Streaming Events
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl watch --prefix /registry --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl watch --prefix /registry
**K3s**:
```bash
etcdctl watch --prefix /registry --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
If you only want to see the affected keys (and not the binary data), you can append `| grep -a ^/registry` to the command to filter for keys only.
### Query etcd Directly
Command:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
```
docker exec etcd etcdctl get /registry --prefix=true --keys-only
**K3s**:
```bash
etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt
```
You can process the data to get a summary of count per key, using the command below:
**RKE2**:
```bash
crictl exec $etcdcontainer etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
```
docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
**K3s**:
```bash
etcdctl get /registry --prefix=true --keys-only --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
```
## Replacing Unhealthy etcd Nodes