发生了什么?
在使用 kubeadm init --pod-network-cidr 10.112.0.0/12 --service-cidr 10.16.0.0/12 --apiserver-advertise-address 172.X.X.X --v=5
初始化集群时,在 wait-control-plane
阶段启动了 kubelet,并期望它能启动控制平面所需的 pods。然而,kubeadm 在等待 kubelet 变得健康时超时:
[kubelet-start] Starting the kubelet
I0419 13:28:21.800518 83681 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
...
查看 kubelet 日志:(提供的日志是在节点注册后)
Apr 19 13:28:28 avije-master kubelet[83794]: I0419 13:28:28.049050 83794 kubelet_node_status.go:76] "Successfully registered node" node="avijeh-master"
Apr 19 13:28:28 avije-master kubelet[83794]: I0419 13:28:28.267975 83794 apiserver.go:52] "Watching apiserver"
Apr 19 13:28:28 avije-master kubelet[83794]: I0419 13:28:28.290394 83794 desired_state_of_world_populator.go:159] "Finished populating initial desired state of world"
Apr 19 13:28:28 avije-master kubelet[83794]: E0419 13:28:28.400654 83794 kubelet.go:1921] "Failed creating a mirror pod for" err="pods \"kube-apiserver-avijeh-master\" is forbidden: no PriorityClass with name system-node-critical was found" pod="kube-system/kube-apiserver-avijeh-master"
Apr 19 13:28:28 avije-master kubelet[83794]: E0419 13:28:28.527008 83794 kubelet.go:1921] "Failed creating a mirror pod for" err="pods \"kube-controller-manager-avijeh-master\" is forbidden: no PriorityClass with name system-node-critical was found" pod="kube-system/kube-controller-manager-avijeh-master"
Apr 19 13:28:28 avije-master kubelet[83794]: E0419 13:28:28.551525 83794 kubelet.go:1921] "Failed creating a mirror pod for" err="pods \"kube-scheduler-avijeh-master\" is forbidden: no PriorityClass with name system-node-critical was found" pod="kube-system/kube-scheduler-avijeh-master"
Apr 19 13:28:36 avije-master kubelet[83794]: I0419 13:28:36.666828 83794 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/etcd-avijeh-master" podStartSLOduration=2.6666824719999997 podStartE2EDuration="2.666682472s" podCreationTimestamp="2024-04-19 13:28:34 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-04-19 13:28:34.456833141 +0000 UTC m=+12.642373986" watchObservedRunningTime="2024-04-19 13:28:36.666682472 +0000 UTC m=+14.852223314"
Apr 19 13:28:38 avije-master kubelet[83794]: I0419 13:28:38.545904 83794 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/kube-apiserver-avijeh-master" podStartSLOduration=2.545820659 podStartE2EDuration="2.545820659s" podCreationTimestamp="2024-04-19 13:28:36 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-04-19 13:28:36.694198023 +0000 UTC m=+14.879738885" watchObservedRunningTime="2024-04-19 13:28:38.545820659 +0000 UTC m=+16.731361485"
Apr 19 13:28:43 avije-master kubelet[83794]: I0419 13:28:43.359153 83794 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/kube-controller-manager-avijeh-master" podStartSLOduration=5.358999714 podStartE2EDuration="5.358999714s" podCreationTimestamp="2024-04-19 13:28:38 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-04-19 13:28:43.358794059 +0000 UTC m=+21.544334918" watchObservedRunningTime="2024-04-19 13:28:43.358999714 +0000 UTC m=+21.544540543"
Apr 19 13:29:23 avije-master kubelet[83794]: E0419 13:29:23.304870 83794 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\": not found" containerID="95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2"
Apr 19 13:29:23 avije-master kubelet[83794]: I0419 13:29:23.304969 83794 kuberuntime_gc.go:360] "Error getting ContainerStatus for containerID" containerID="95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2" err="rpc error: code = NotFound desc = an error occurred when try to find container \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\": not found"
Apr 19 13:29:23 avije-master kubelet[83794]: E0419 13:29:23.306097 83794 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\": not found" containerID="eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5"
Apr 19 13:29:23 avije-master kubelet[83794]: I0419 13:29:23.306185 83794 kuberuntime_gc.go:360] "Error getting ContainerStatus for containerID" containerID="eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5" err="rpc error: code = NotFound desc = an error occurred when try to find container \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\": not found"
Apr 19 13:30:23 avije-master kubelet[83794]: E0419 13:30:23.361448 83794 kubelet_node_status.go:456] "Node not becoming ready in time after startup"
Apr 19 13:30:23 avije-master kubelet[83794]: E0419 13:30:23.420288 83794 kubelet.go:2892] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Apr 19 13:30:28 avije-master kubelet[83794]: E0419 13:30:28.422517 83794 kubelet.go:2892] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
此时,根据 containerd,已启动所需的容器:
$ crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
cd75ded425c7e 3861cfcd7c04c About a minute ago Running etcd 82 5d5eff6d06dac etcd-master
251638dc7ae54 e444022412717 About a minute ago Running kube-scheduler 0 463a0e09c6d7d kube-scheduler-master
c232a00ad060c 48ad18e13fb4f About a minute ago Running kube-controller-manager 0 1da3b8f00b55b kube-controller-manager-master
de2982e4c6a16 7ae3494460614 About a minute ago Running kube-apiserver 1 42fbbede44751 kube-apiserver-master
然而,kubelet 试图获取的容器不在 containerd 创建的容器中!
containerd 日志也证实了这一点:
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.205106219Z" level=info msg="CreateContainer within sandbox \"5d5eff6d06dacfd8b07d5f441cf5e76697a896f8d9f111ff07e716ffe1179c5d\" for container &ContainerMetadata{Name:etcd,Attempt:82,}"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.242328418Z" level=info msg="CreateContainer within sandbox \"5d5eff6d06dacfd8b07d5f441cf5e76697a896f8d9f111ff07e716ffe1179c5d\" for &ContainerMetadata{Name:etcd,Attempt:82,} returns container id \"cd75ded425c7e18e40958266e8e0abb551a8a9864f2d4c57b3a4547c26255ef1\""
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.243302417Z" level=info msg="StartContainer for \"cd75ded425c7e18e40958266e8e0abb551a8a9864f2d4c57b3a4547c26255ef1\""
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.395299613Z" level=info msg="StartContainer for \"de2982e4c6a161b4328e1f8545b51a9e4ee8b194a88bc8854beb4d1f21ff1ff5\" returns successfully"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.404490886Z" level=info msg="StartContainer for \"c232a00ad060cc4ba8837a51b0a3e70bdbbcc7e8066caaba01e1943204666e47\" returns successfully"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.459529956Z" level=info msg="StartContainer for \"251638dc7ae54cf85a671493cfb9f28eeb724c86a44480e1da8032295e8547fc\" returns successfully"
Apr 19 13:28:24 avije-master containerd[80846]: time="2024-04-19T13:28:24.522611185Z" level=info msg="StartContainer for \"cd75ded425c7e18e40958266e8e0abb551a8a9864f2d4c57b3a4547c26255ef1\" returns successfully"
Apr 19 13:29:23 avije-master containerd[80846]: time="2024-04-19T13:29:23.304289064Z" level=error msg="ContainerStatus for \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"95685c1b9abef2e5cb5d7fc82874b1f5879244d16f12d4773ddc63b7d9e57dc2\": not found"
Apr 19 13:29:23 avije-master containerd[80846]: time="2024-04-19T13:29:23.305689987Z" level=error msg="ContainerStatus for \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"eb5090a7b6984cea307e3b89205512fc4df427d459ba77d3f4ad3c9609f710c5\": not found"
这种行为导致初始化集群失败。
你期望发生什么?
kubelet 能够正确验证创建的容器的就绪状态,kubeadm 能够验证 kubelet 的健康状况,并继续进行其余的初始化过程。
我们如何尽可能精确地重现它?
运行 kubeadm init
并使用适当的标准标志,等待它达到 wait-control-plane
阶段。
跟踪 kubelet & containerd 日志:
$ journalctl -xeu kubelet -f
...
$ journalctl -xeu containerd -f
...
同时跟踪 containerd 容器。
$ crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a
...
我们需要了解其他信息吗?
如果有疑虑:
- 交换关闭。
SystemdCgroup
设置为true
在/etc/containerd/config.toml
中。- 使用
setenforce 0
关闭 SELinux。 - runc 版本:1.1.12
Kubernetes 版本
$ kubectl version
Client Version: v1.29.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
The connection to the server localhost:8080 was refused - did you specify the right host or port?
云提供商
自托管
OS 版本
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux master 5.15.0-102-generic #112-Ubuntu SMP Tue Mar 5 16:50:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
安装工具
容器运行时 (CRI) 和版本(如适用)
containerd
$ containerd -v
containerd github.com/containerd/containerd v1.7.15 926c9586fe4a6236699318391cd44976a98e31f1
相关插件(CNI,CSI,...)和版本(如适用)
没有安装 CSI/CNI。
5条答案
按热度按时间aor9mmx11#
这个问题目前正在等待分类。
如果SIG或子项目确定这是一个相关的问题,他们将通过应用
triage/accepted
标签并提供进一步的指导来接受它。组织成员可以通过在评论中写入
/triage accepted
来添加triage/accepted
标签。有关使用PR评论与我互动的说明,请查看here。如果您对我的行为有任何问题或建议,请针对kubernetes/test-infra仓库提出一个问题。
jei2mxaa2#
/sig node
pbwdgjma3#
/sig cluster-lifecycle
/remove-sig node
Someone from sig-cluster-lifecycle can triage this first
r8xiu3jd4#
然而,kubelet试图获取的容器不在containerd创建的容器中!
containerd日志也证实了这一点:
似乎kubelet/containerd中有些东西出了问题。
在那时,kubeadm只是等待kubelet在其/healthz端点上报告200。如果它无法报告200,kubeadm将以错误退出。
/remove-sig cluster-lifecycle
/sig node
/kind support
iq0todco5#
我正在使用1.26.1版本的crio和1.26.4版本的k8s,它们的行为相同,似乎它并不依赖于k8s的版本,但是有一些东西导致了这个问题。