问题描述
Kubernetes kube-controller-manager 和 kube-scheduler 不断重启。以下是 pod 日志。
~$ kubectl logs -n kube-system kube-scheduler-node1 -p I1228 16:59:26.709076 1 serving.go:319] Generated self-signed cert in-memory I1228 16:59:27.072726 1 server.go:143] Version: v1.16.0 I1228 16:59:27.072806 1 defaults.go:91] TaintNodesByCondition is enabled,PodToleratesNodeTaints predicate is mandatory W1228 16:59:27.075087 1 authorization.go:47] Authorization is disabled W1228 16:59:27.075103 1 authentication.go:79] Authentication is disabled I1228 16:59:27.075117 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251 I1228 16:59:27.075623 1 secure_serving.go:123] Serving securely on [::]:10259 I1228 16:59:28.077293 1 leaderelection.go:241] attempting to acquire leader lease kube-system/kube-scheduler... E1228 16:59:45.353862 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get https://IPaddress/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I1228 16:59:47.969930 1 leaderelection.go:251] successfully acquired lease kube-system/kube-scheduler I1228 17:00:42.008006 1 leaderelection.go:287] Failed to renew lease kube-system/kube-scheduler: Failed to tryAcquireOrRenew context deadline exceeded F1228 17:00:42.008059 1 server.go:264] leaderelection lost
:~$ kubectl logs -n kube-system kube-controller-manager-node1 -p W1228 17:00:04.721378 1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true,because nodeName="node4" does not exist I1228 17:00:04.726825 1 shared_informer.go:204] Caches are synced for certificate I1228 17:00:04.732538 1 shared_informer.go:204] Caches are synced for TTL I1228 17:00:04.739613 1 shared_informer.go:204] Caches are synced for ClusterRoleAggregator I1228 17:00:04.754683 1 shared_informer.go:204] Caches are synced for certificate I1228 17:00:04.760101 1 shared_informer.go:204] Caches are synced for stateful set I1228 17:00:04.768974 1 shared_informer.go:204] Caches are synced for namespace I1228 17:00:04.769914 1 shared_informer.go:204] Caches are synced for deployment I1228 17:00:04.790541 1 shared_informer.go:204] Caches are synced for daemon sets I1228 17:00:04.790710 1 shared_informer.go:204] Caches are synced for ReplicationController I1228 17:00:04.796386 1 shared_informer.go:204] Caches are synced for disruption I1228 17:00:04.796403 1 disruption.go:341] Sending events to api server. I1228 17:00:04.804131 1 shared_informer.go:204] Caches are synced for replicaset I1228 17:00:04.806910 1 shared_informer.go:204] Caches are synced for GC I1228 17:00:04.809821 1 shared_informer.go:204] Caches are synced for taint I1228 17:00:04.809909 1 node_lifecycle_controller.go:1208] Initializing eviction metric for zone: W1228 17:00:04.809999 1 node_lifecycle_controller.go:903] Missing timestamp for Node node3. Assuming Now as a timestamp. W1228 17:00:04.810038 1 node_lifecycle_controller.go:903] Missing timestamp for Node node4. Assuming Now as a timestamp. W1228 17:00:04.810065 1 node_lifecycle_controller.go:903] Missing timestamp for Node node1. Assuming Now as a timestamp. W1228 17:00:04.810086 1 node_lifecycle_controller.go:903] Missing timestamp for Node node2. Assuming Now as a timestamp. I1228 17:00:04.810101 1 node_lifecycle_controller.go:1108] Controller detected that zone is Now in state normal. I1228 17:00:04.810145 1 event.go:255] Event(v1.ObjectReference{Kind:"Node",Namespace:"",Name:"node2",UID:"68d34fcf-fd86-42a5-9833-57108c93baee",APIVersion:"",ResourceVersion:"",Fieldpath:""}): type: 'normal' reason: 'Registerednode' Node node2 event: Registered Node node2 in Controller I1228 17:00:04.810164 1 taint_manager.go:186] Starting NoExecuteTaintManager I1228 17:00:04.810224 1 event.go:255] Event(v1.ObjectReference{Kind:"Node",Name:"node3",UID:"dc80b75f-ce55-4247-84e3-bf0474ac1057",Fieldpath:""}): type: 'normal' reason: 'Registerednode' Node node3 event: Registered Node node3 in Controller I1228 17:00:04.810233 1 event.go:255] Event(v1.ObjectReference{Kind:"Node",Name:"node4",UID:"c9d859df-795e-4b2a-9def-08efc67ba4e3",Fieldpath:""}): type: 'normal' reason: 'Registerednode' Node node4 event: Registered Node node4 in Controller I1228 17:00:04.810242 1 event.go:255] Event(v1.ObjectReference{Kind:"Node",Name:"node1",UID:"8bfe45c3-2ce7-4013-a11f-c1ac052e9e00",Fieldpath:""}): type: 'normal' reason: 'Registerednode' Node node1 event: Registered Node node1 in Controller I1228 17:00:04.811241 1 shared_informer.go:204] Caches are synced for node I1228 17:00:04.811367 1 range_allocator.go:172] Starting range CIDR allocator I1228 17:00:04.811381 1 shared_informer.go:197] Waiting for caches to sync for cidrallocator I1228 17:00:04.859423 1 shared_informer.go:204] Caches are synced for HPA I1228 17:00:04.911545 1 shared_informer.go:204] Caches are synced for cidrallocator I1228 17:00:04.997853 1 shared_informer.go:204] Caches are synced for bootstrap_signer I1228 17:00:05.023218 1 shared_informer.go:204] Caches are synced for expand I1228 17:00:05.030277 1 shared_informer.go:204] Caches are synced for PV protection I1228 17:00:05.059763 1 shared_informer.go:204] Caches are synced for endpoint I1228 17:00:05.060705 1 shared_informer.go:204] Caches are synced for persistent volume I1228 17:00:05.118184 1 shared_informer.go:204] Caches are synced for attach detach I1228 17:00:05.246897 1 shared_informer.go:204] Caches are synced for job I1228 17:00:05.248850 1 shared_informer.go:204] Caches are synced for resource quota I1228 17:00:05.257547 1 shared_informer.go:204] Caches are synced for garbage collector I1228 17:00:05.257566 1 garbagecollector.go:139] Garbage collector: all resource monitors have synced. Proceeding to collect garbage I1228 17:00:05.260287 1 shared_informer.go:204] Caches are synced for resource quota I1228 17:00:05.305093 1 shared_informer.go:204] Caches are synced for garbage collector I1228 17:00:44.906594 1 leaderelection.go:287] Failed to renew lease kube-system/kube-controller-manager: Failed to tryAcquireOrRenew context deadline exceeded F1228 17:00:44.906687 1 controllermanager.go:279] leaderelection lost
解决方法
增加节点的 CPU 和内存后,问题得到解决。
当您遇到资源紧缩或网络问题时会出现此问题。 就我而言,领导选举 API 调用超时,因为 Kube API 服务器出现资源紧缺,增加了 API 调用的延迟。
K8S API 服务器日志:
apiserver was unable to write a JSON response: http: Handler timeout
apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
apiserver was unable to write a fallback JSON response: http: Handler timeout