重新启动kube-controller-manager和kube-scheduler后,kubelet无法获取节点状态

问题描述

我的k8s 1.12.8集群(通过kops创建)已经运行了6个月以上。最近,某件事导致主节点上的kube-schedulerkube-controller-manager都死亡并重新启动:

SyncLoop (PLEG): "kube-controller-manager-ip-x-x-x-x.z.compute.internal_kube-system(abc123)",event: &pleg.PodLifecycleEvent{ID:"abc123",Type:"ContainerDied",Data:"def456"}
hostname for pod:"kube-controller-manager-ip-x-x-x-x.z.compute.internal" was longer than 63. Truncated hostname to :"kube-controller-manager-ip-x-x-x-x.z.compute.inter"
SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z.compute.internal_kube-system(hij678)",event: &pleg.PodLifecycleEvent{ID:"hij678",Data:"890klm"}
SyncLoop (PLEG): "kube-controller-manager-ip-x-x-x-x.eu-west-2.compute.internal_kube-system(abc123)",Type:"ContainerStarted",Data:"def345"}
SyncLoop (container unhealthy): "kube-scheduler-ip-x-x-x-x.z.compute.internal_kube-system(hjk678)"
SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z.compute.internal_kube-system(ghj567)",event: &pleg.PodLifecycleEvent{ID:"ghj567",Data:"hjk768"}

kube-schedulerkube-controller-manager重新启动以来,kubelet完全无法获取或更新任何节点状态:

Error updating node status,will retry: failed to patch status "{"status":{"$setElementOrder/conditions":[{"type":"NetworkUnavailable"},{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"PIDPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2020-08-12T09:22:08Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2020-08-12T09:22:08Z","type":"MemoryPressure"},"type":"DiskPressure"},"type":"PIDPressure"},"type":"Ready"}]}}" for node "ip-172-20-60-88.eu-west-2.compute.internal": Patch https://127.0.0.1/api/v1/nodes/ip-172-20-60-88.eu-west-2.compute.internal/status?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Error updating node status,will retry: error getting node "ip-x-x-x-x.z.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z.compute.internal?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Error updating node status,will retry: error getting node "ip-x-x-x-x.z.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z.compute.internal?timeout=10s: context deadline exceeded
Error updating node status,will retry: error getting node "ip-x-x-x-x.z.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z.compute.internal?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Unable to update node status: update node status exceeds retry count

在此状态下,群集完全无法执行任何更新。

  • 是什么会导致主节点失去与诸如 这个吗?
  • 是第一个日志输出中的第二行'被截断 主机名.. 可能是问题的根源?
  • 我该如何进一步 诊断是什么真正导致获取/更新节点操作 失败?

解决方法

我记得kubernetes将主机名限制为少于64个字符。这次是否有主机名更新的情况? 如果是这样,那么使用此文档来重建kubelet配置会很好 https://kubernetes.io/docs/tasks/administer-cluster/reconfigure-kubelet/

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...