mirror of
https://github.com/rancher/rancher-docs.git
synced 2026-05-05 12:43:16 +00:00
284 lines
12 KiB
Markdown
284 lines
12 KiB
Markdown
---
|
|
title: etcd 节点故障排除
|
|
---
|
|
|
|
<head>
|
|
<link rel="canonical" href="https://ranchermanager.docs.rancher.com/zh/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes"/>
|
|
</head>
|
|
|
|
本文介绍了对具有 `etcd` 角色的节点进行故障排除的命令和提示。
|
|
|
|
|
|
## 检查 etcd 容器是否正在运行
|
|
|
|
etcd 容器的状态应该是 **Up**。**Up** 后面显示的时间指的是容器运行的时间。
|
|
|
|
```
|
|
docker ps -a -f=name=etcd$
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
|
d26adbd23643 rancher/mirrored-coreos-etcd:v3.5.7 "/usr/local/bin/etcd…" 30 minutes ago Up 30 minutes etcd
|
|
```
|
|
|
|
## etcd 容器日志记录
|
|
|
|
容器的日志记录可能包含问题的信息。
|
|
|
|
```
|
|
docker logs etcd
|
|
```
|
|
| 日志 | 解释 |
|
|
|-----|------------------|
|
|
| `health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused` | 无法连接到端口 2380 上显示的地址。检查 etcd 容器是否在显示地址的主机上运行。 |
|
|
| `xxx is starting a new election at term x` | etcd 集群失去了集群仲裁数量,并正在尝试建立一个新的 leader。运行 etcd 的大多数节点关闭/无法访问时,可能会发生这种情况。 |
|
|
| `connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}` | 主机防火墙正在阻止网络通信。 |
|
|
| `rafthttp: request cluster ID mismatch` | 具有 etcd 实例日志 `rafthttp: request cluster ID mismatch` 的节点正在尝试加入已经添加另一个对等节点(peer)的集群。你需要从集群中删除该节点,然后再重新添加。 |
|
|
| `rafthttp: failed to find member` | 集群状态(`/var/lib/etcd`)包含加入集群的错误信息。你需要从集群中删除该节点,清理状态目录,然后再重新添加。 |
|
|
|
|
## etcd 集群和连接检查
|
|
|
|
运行 etcd 的主机的地址配置决定了 etcd 监听的地址。如果为运行 etcd 的主机配置了内部地址,则需要显式指定 `etcdctl` 的端点。如果任何命令的响应是 `Error: context deadline exceeded`,则 etcd 实例不健康(仲裁丢失或实例未正确加入集群)。
|
|
|
|
### 检查所有节点上的 etcd 成员
|
|
|
|
输出应包含具有 `etcd` 角色的所有节点,而且所有节点上的输出应该是相同的。
|
|
|
|
命令:
|
|
```
|
|
docker exec etcd etcdctl member list
|
|
```
|
|
|
|
### 检查端点状态
|
|
|
|
`RAFT TERM` 的值应该是相等的,而且 `RAFT INDEX` 相差不能太大。
|
|
|
|
命令:
|
|
```
|
|
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
+-----------------+------------------+---------+---------+-----------+-----------+------------+
|
|
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
|
|
+-----------------+------------------+---------+---------+-----------+-----------+------------+
|
|
| https://IP:2379 | 333ef673fc4add56 | 3.5.7 | 24 MB | false | 72 | 66887 |
|
|
| https://IP:2379 | 5feed52d940ce4cf | 3.5.7 | 24 MB | true | 72 | 66887 |
|
|
| https://IP:2379 | db6b3bdb559a848d | 3.5.7 | 25 MB | false | 72 | 66887 |
|
|
+-----------------+------------------+---------+---------+-----------+-----------+------------+
|
|
```
|
|
|
|
### 检查端点健康
|
|
|
|
命令:
|
|
```
|
|
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint health
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
|
|
https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
|
|
https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms
|
|
```
|
|
|
|
### 检查端口 TCP/2379 上的连接
|
|
|
|
命令:
|
|
```
|
|
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f5); do
|
|
echo "Validating connection to ${endpoint}/health"
|
|
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/health"
|
|
done
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
Validating connection to https://IP:2379/health
|
|
{"health": "true"}
|
|
Validating connection to https://IP:2379/health
|
|
{"health": "true"}
|
|
Validating connection to https://IP:2379/health
|
|
{"health": "true"}
|
|
```
|
|
|
|
### 检查端口 TCP/2380 上的连接
|
|
|
|
命令:
|
|
```
|
|
for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f4); do
|
|
echo "Validating connection to ${endpoint}/version";
|
|
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/version"
|
|
done
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
Validating connection to https://IP:2380/version
|
|
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
|
|
Validating connection to https://IP:2380/version
|
|
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
|
|
Validating connection to https://IP:2380/version
|
|
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
|
|
```
|
|
|
|
## etcd 告警
|
|
|
|
etcd 会触发告警(例如空间不足时)。
|
|
|
|
命令:
|
|
```
|
|
docker exec etcd etcdctl alarm list
|
|
```
|
|
|
|
触发 NOSPACE 告警的输出示例:
|
|
```
|
|
memberID:x alarm:NOSPACE
|
|
memberID:x alarm:NOSPACE
|
|
memberID:x alarm:NOSPACE
|
|
```
|
|
|
|
## etcd 空间错误
|
|
|
|
相关的错误消息是 `etcdserver: mvcc: database space exceeded` 或 `applying raft message exceeded backend quota`。告警 `NOSPACE` 会被触发。
|
|
|
|
解决:
|
|
|
|
- [压缩键空间](#压缩键空间)
|
|
- [对所有 etcd 成员进行碎片整理](#对所有-etcd-成员进行碎片整理)
|
|
- [检查端点状态](#检查端点状态)
|
|
- [解除告警](#解除告警)
|
|
|
|
### 压缩键空间
|
|
|
|
命令:
|
|
```
|
|
rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
|
|
docker exec etcd etcdctl compact "$rev"
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
compacted revision xxx
|
|
```
|
|
|
|
### 对所有 etcd 成员进行碎片整理
|
|
|
|
命令:
|
|
```
|
|
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl defrag
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
Finished defragmenting etcd member[https://IP:2379]
|
|
Finished defragmenting etcd member[https://IP:2379]
|
|
Finished defragmenting etcd member[https://IP:2379]
|
|
```
|
|
|
|
### 检查端点状态
|
|
|
|
命令:
|
|
```
|
|
docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
+-----------------+------------------+---------+---------+-----------+-----------+------------+
|
|
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
|
|
+-----------------+------------------+---------+---------+-----------+-----------+------------+
|
|
| https://IP:2379 | e973e4419737125 | 3.5.7 | 553 kB | false | 32 | 2449410 |
|
|
| https://IP:2379 | 4a509c997b26c206 | 3.5.7 | 553 kB | false | 32 | 2449410 |
|
|
| https://IP:2379 | b217e736575e9dd3 | 3.5.7 | 553 kB | true | 32 | 2449410 |
|
|
+-----------------+------------------+---------+---------+-----------+-----------+------------+
|
|
```
|
|
|
|
### 解除告警
|
|
|
|
如果压缩和整理碎片后确定数据库大小下降了,则需要解除告警来允许 etcd 再次写入。
|
|
|
|
命令:
|
|
```
|
|
docker exec etcd etcdctl alarm list
|
|
docker exec etcd etcdctl alarm disarm
|
|
docker exec etcd etcdctl alarm list
|
|
```
|
|
|
|
输出示例:
|
|
```
|
|
docker exec etcd etcdctl alarm list
|
|
memberID:x alarm:NOSPACE
|
|
memberID:x alarm:NOSPACE
|
|
memberID:x alarm:NOSPACE
|
|
docker exec etcd etcdctl alarm disarm
|
|
docker exec etcd etcdctl alarm list
|
|
```
|
|
|
|
## 配置日志级别
|
|
|
|
:::note
|
|
|
|
你无法再动态更改 etcd v3.5 或更高版本中的日志级别。
|
|
|
|
:::
|
|
|
|
### etcd v3.5 及更高版本
|
|
|
|
要配置 etcd 的日志级别,请编辑集群 YAML:
|
|
|
|
```
|
|
services:
|
|
etcd:
|
|
extra_args:
|
|
log-level: "debug"
|
|
```
|
|
|
|
### etcd v3.4 及更早版本
|
|
|
|
在早期的 etcd 版本中,你可以使用 API 动态更改日志级别。使用以下命令来配置调试日志:
|
|
|
|
```
|
|
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
|
|
```
|
|
|
|
要将日志级别重置回默认值 (`INFO`),你可以使用以下命令。
|
|
|
|
命令:
|
|
```
|
|
docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log
|
|
```
|
|
|
|
## etcd 内容
|
|
|
|
如果要查看 etcd 的内容,你可以查看流事件,也可以直接查询 etcd。详情请参阅以下示例。
|
|
|
|
### 查看流事件
|
|
|
|
命令:
|
|
```
|
|
docker exec etcd etcdctl watch --prefix /registry
|
|
```
|
|
|
|
如果你只想查看受影响的键(而不是二进制数据),你可以将 `| grep -a ^/registry` 尾附到该命令来过滤键。
|
|
|
|
### 直接查询 etcd
|
|
|
|
命令:
|
|
```
|
|
docker exec etcd etcdctl get /registry --prefix=true --keys-only
|
|
```
|
|
|
|
你可以使用以下命令来处理数据,从而获取每个键的计数摘要:
|
|
|
|
```
|
|
docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr
|
|
```
|
|
|
|
## 更换不健康的 etcd 节点
|
|
|
|
如果你 etcd 集群中的某个节点变得不健康,在将新的 etcd 节点添加到集群之前,我们建议你修复或删除故障/不健康的节点。
|