kubernetes kube控制器管理器和kube调度程序引导丢失反射崩溃回送

gijlo24d  于 2023-03-01  发布在  Kubernetes
关注(0)|答案(2)|浏览(143)

Kubernetes kube控制器管理器和kube调度程序持续重启。以下是Pod日志。

~$ kubectl logs -n kube-system kube-scheduler-node1 -p
I1228 16:59:26.709076       1 serving.go:319] Generated self-signed cert in-memory
I1228 16:59:27.072726       1 server.go:143] Version: v1.16.0
I1228 16:59:27.072806       1 defaults.go:91] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W1228 16:59:27.075087       1 authorization.go:47] Authorization is disabled
W1228 16:59:27.075103       1 authentication.go:79] Authentication is disabled
I1228 16:59:27.075117       1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I1228 16:59:27.075623       1 secure_serving.go:123] Serving securely on [::]:10259
I1228 16:59:28.077293       1 leaderelection.go:241] attempting to acquire leader lease  kube-system/kube-scheduler...
E1228 16:59:45.353862       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get https://IPaddress/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I1228 16:59:47.969930       1 leaderelection.go:251] successfully acquired lease kube-system/kube-scheduler
I1228 17:00:42.008006       1 leaderelection.go:287] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
F1228 17:00:42.008059       1 server.go:264] leaderelection lost
:~$ kubectl logs -n kube-system kube-controller-manager-node1 -p
W1228 17:00:04.721378       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="node4" does not exist
I1228 17:00:04.726825       1 shared_informer.go:204] Caches are synced for certificate 
I1228 17:00:04.732538       1 shared_informer.go:204] Caches are synced for TTL 
I1228 17:00:04.739613       1 shared_informer.go:204] Caches are synced for ClusterRoleAggregator 
I1228 17:00:04.754683       1 shared_informer.go:204] Caches are synced for certificate 
I1228 17:00:04.760101       1 shared_informer.go:204] Caches are synced for stateful set 
I1228 17:00:04.768974       1 shared_informer.go:204] Caches are synced for namespace 
I1228 17:00:04.769914       1 shared_informer.go:204] Caches are synced for deployment 
I1228 17:00:04.790541       1 shared_informer.go:204] Caches are synced for daemon sets 
I1228 17:00:04.790710       1 shared_informer.go:204] Caches are synced for ReplicationController 
I1228 17:00:04.796386       1 shared_informer.go:204] Caches are synced for disruption 
I1228 17:00:04.796403       1 disruption.go:341] Sending events to api server.
I1228 17:00:04.804131       1 shared_informer.go:204] Caches are synced for ReplicaSet 
I1228 17:00:04.806910       1 shared_informer.go:204] Caches are synced for GC 
I1228 17:00:04.809821       1 shared_informer.go:204] Caches are synced for taint 
I1228 17:00:04.809909       1 node_lifecycle_controller.go:1208] Initializing eviction metric for zone: 
W1228 17:00:04.809999       1 node_lifecycle_controller.go:903] Missing timestamp for Node node3. Assuming now as a timestamp.
W1228 17:00:04.810038       1 node_lifecycle_controller.go:903] Missing timestamp for Node node4. Assuming now as a timestamp.
W1228 17:00:04.810065       1 node_lifecycle_controller.go:903] Missing timestamp for Node node1. Assuming now as a timestamp.
W1228 17:00:04.810086       1 node_lifecycle_controller.go:903] Missing timestamp for Node node2. Assuming now as a timestamp.
I1228 17:00:04.810101       1 node_lifecycle_controller.go:1108] Controller detected that zone  is now in state Normal.
I1228 17:00:04.810145       1 event.go:255] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"node2", UID:"68d34fcf-fd86-42a5-9833-57108c93baee", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node2 event: Registered Node node2 in Controller
I1228 17:00:04.810164       1 taint_manager.go:186] Starting NoExecuteTaintManager
I1228 17:00:04.810224       1 event.go:255] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"node3", UID:"dc80b75f-ce55-4247-84e3-bf0474ac1057", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node3 event: Registered Node node3 in Controller
I1228 17:00:04.810233       1 event.go:255] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"node4", UID:"c9d859df-795e-4b2a-9def-08efc67ba4e3", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node4 event: Registered Node node4 in Controller
I1228 17:00:04.810242       1 event.go:255] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"node1", UID:"8bfe45c3-2ce7-4013-a11f-c1ac052e9e00", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node1 event: Registered Node node1 in Controller
I1228 17:00:04.811241       1 shared_informer.go:204] Caches are synced for node 
I1228 17:00:04.811367       1 range_allocator.go:172] Starting range CIDR allocator
I1228 17:00:04.811381       1 shared_informer.go:197] Waiting for caches to sync for cidrallocator
I1228 17:00:04.859423       1 shared_informer.go:204] Caches are synced for HPA 
I1228 17:00:04.911545       1 shared_informer.go:204] Caches are synced for cidrallocator 
I1228 17:00:04.997853       1 shared_informer.go:204] Caches are synced for bootstrap_signer 
I1228 17:00:05.023218       1 shared_informer.go:204] Caches are synced for expand 
I1228 17:00:05.030277       1 shared_informer.go:204] Caches are synced for PV protection 
I1228 17:00:05.059763       1 shared_informer.go:204] Caches are synced for endpoint 
I1228 17:00:05.060705       1 shared_informer.go:204] Caches are synced for persistent volume 
I1228 17:00:05.118184       1 shared_informer.go:204] Caches are synced for attach detach 
I1228 17:00:05.246897       1 shared_informer.go:204] Caches are synced for job 
I1228 17:00:05.248850       1 shared_informer.go:204] Caches are synced for resource quota 
I1228 17:00:05.257547       1 shared_informer.go:204] Caches are synced for garbage collector 
I1228 17:00:05.257566       1 garbagecollector.go:139] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
I1228 17:00:05.260287       1 shared_informer.go:204] Caches are synced for resource quota 
I1228 17:00:05.305093       1 shared_informer.go:204] Caches are synced for garbage collector 
I1228 17:00:44.906594       1 leaderelection.go:287] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F1228 17:00:44.906687       1 controllermanager.go:279] leaderelection lost
6rqinv9w

6rqinv9w1#

    • 增加节点的CPU和内存后,问题得到解决。**

当您遇到资源紧缩或网络问题时会出现此问题。在我的情况下,由于Kube API服务器遇到资源紧缩,领导者选举API调用超时,这增加了API调用的延迟。
K8S API服务器日志:

apiserver was unable to write a JSON response: http: Handler timeout
apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
apiserver was unable to write a fallback JSON response: http: Handler timeout
cqoc49vn

cqoc49vn2#

在我的例子中,这是一个网络问题,修复方法是在kube-controller-manager.yaml清单中增加leader-elect-lease-duration和leader-elect-renewal-deadline。

--leader-elect-lease-duration duration     Default: 15s
--leader-elect-renew-deadline duration     Default: 10s

我把它分别增加到120 s和60 s来检查是否有帮助。

# grep leader-elect /etc/kubernetes/manifests/kube-controller-manager.yaml
    - --leader-elect=true
    - --leader-elect-lease-duration=120s
    - --leader-elect-renew-deadline=60s

请确保租赁期限大于续订期限。

相关问题