121. 排查 etcd 时间同步问题

张开发
2026/4/8 23:19:08 15 分钟阅读

分享文章

121. 排查 etcd 时间同步问题
Situation 地理位置etcd is a critical component in Kubernetes environments, serving as the distributed reliable key-value store that holds the clusters state. Its proper functioning is highly dependent on accurate time synchronisation across all etcd nodes. Even a small clock drift can lead to significant issues, impacting cluster stability and performance.etcd 是 Kubernetes 环境中的关键组件作为分布式且可靠的键值存储保存集群的状态。 其正常工作高度依赖于所有 etcd 节点间的时间同步准确。即使是小的时钟漂移也可能引发重大问题影响集群的稳定性和性能。Time synchronisation issues in etcd can manifest in various ways, often leading to a cascade of errors throughout your cluster. Common symptoms include:etcd 中的时间同步问题可能以多种方式表现出来常常导致整个集群中出现连锁错误。常见症状包括High Clock Drift errors in etcd logs:Youll see warnings likeprober found high clock drift.etcd 日志中的高时钟漂移错误你会看到类似prober found high clock drift的警告。Slow etcd requests:Messages such asapply request took too longorwaiting for ReadIndex response took too longindicate that etcd operations are being delayed, often due to inconsistencies between nodes. While these kinds of messages can also indicate other etcd issues, they will commonly be included when a time synchronisation error is present.慢速 etcd 请求诸如“应用请求耗时过长”等消息或waiting for ReadIndex response took too long表明 etcd 操作被延迟通常是由于节点间的不一致。虽然这类消息也可能提示其他 etcd 问题但当存在时间同步错误时通常会包含这些消息。Kubernetes API Server timeouts:Thekube-apiservermight struggle to connect to etcd, resulting inhttp: Handler timeouterrors.Kubernetes API 服务器超时kube-apiserver可能难以连接 etcd导致http Handler 超时错误。Unhealthy etcd nodes:Youll see your etcd nodes as unhealthy or flapping (losing and regaining connection).不健康的 etcd 节点你会看到你的 etcd 节点不健康或颤动连接断开和恢复。Raft misalignment:Inconsistent Raft indexes among etcd members.筏子错位ETCHD 成员间的筏子指数不一致。Resolution 结局In order to solve a time synchronisation issue, you can follow these steps:为了解决时间同步问题你可以按照以下步骤操作Check the current time and synchronisation status on all your etcd nodes:检查所有 etcd 节点当前的时间和同步状态Check current time: 查看当前时间date %T.%NExecute this command simultaneously on all etcd nodes to quickly spot any significant differences. Even differences of a few seconds can be critical for etcd.同时对所有 etcd 节点执行此命令以快速发现任何显著差异。即使是几秒钟的差异对 ETC 来说也可能至关重要。Check the status of your NTP client.E.g., forchrony:检查你的 NTP 客户状态。例如对于时间性timedatectl chronyc sources list chronyc trackingBased on your verification, identify potential clock misalignments, and fix the issue:根据你的验证识别潜在的时钟错位并解决问题Review your NTP configuration:Examine/etc/chrony.conf(for chrony) or/etc/ntp.conf(for ntpd) on each etcd node. Ensure that the configured NTP servers are correct and reachable. Its recommended to use reliable and accessible NTP sources.检查你的 NTP 配置在每个 etcd 节点上检查/etc/chrony.confchrony或/etc/ntp.conf用于 ntpd。确保配置好的 NTP 服务器正确且可访问。建议使用可靠且可访问的 NTP 资源。Check Firewall Rules:Verify that UDP port 123 is open in your firewall configuration on all etcd nodes to allow NTP traffic to and from your configured time servers.检查防火墙规则确认所有 ETC 节点的防火墙配置中 UDP 123 端口是否开放以便允许 NTP 流量往返你配置的时服务器。Once youve corrected the configuration, force the NTP client to resynchronize the time.一旦你纠正了配置强制 NTP 客户端重新同步时间。sudo systemctl restart chronydImportant:When restarting the time synchronisation service, its generally safer to do itone etcd node at a time. While a temporary clock adjustment might occur, restarting the service on all nodes simultaneously could introduce further instability if a significant time jump occurs across the entire etcd cluster at once.重要提示重启时间同步服务时通常一次处理一个 etcd 节点更安全。虽然可能会进行临时时钟调整但如果整个 ETCHD 集群同时发生显著时间跳跃所有节点同时重启服务可能会带来进一步的不稳定性。After ensuring that time synchronisation is correct on all etcd nodes, verify the health of your etcd cluster and monitor the etcd logs for any recurring high clock drift or took too long errors. These should disappear once the time synchronisation is stable.在确保所有 etcd 节点的时间同步正确后检查你的 etcd 集群健康状况并监控 etcd 日志中是否有反复出现的“高时钟漂移”或“耗时过长”错误。一旦时间同步稳定这些问题应该会消失。Cause 病因The primary cause of time synchronisation issues on etcd nodes is often a misconfigured or non-functional NTP (Network Time Protocol) client, such aschronyorntpd. Specific problems can include:etcd 节点时间同步问题的主要原因通常是配置错误或无法正常工作的 NTP网络时间协议客户端如chrony或ntpd。具体问题可能包括Incorrect NTP Server Configuration:Thechrony.conforntp.conffile might point to incorrect or unreachable NTP servers.NTP 服务器配置错误chrony.conf或ntp.conf文件可能指向错误或无法访问的 NTP 服务器。Firewall Rules:Necessary firewall rules (e.g., UDP port 123 for NTP) might be missing, preventing the nodes from reaching the configured time servers.防火墙规则必要的防火墙规则例如 NTP 的 UDP 端口 123可能缺失导致节点无法到达配置的时服务器。Network Connectivity Issues:General network problems can also prevent NTP synchronisation.网络连接问题一般网络问题也可能阻碍 NTP 同步。Additional Information 附加信息Environment 环境An RKE or RKE2 cluster with multiple etcd nodes.一个带有多个 etcd 节点的 RKE 或 RKE2 集群。访问Rancher-K8S解决方案博主企业合作伙伴 https://blog.csdn.net/lidw2009

更多文章