哪些任务出现了问题?
master-blocking:
- gce-cos-master-default
哪些测试出现了问题?
Kubernetes e2e suite.[It] [sig-storage] CSI Mock volume fsgroup policies CSI FSGroupPolicy Update [LinuxOnly] should update fsGroup if update from File to default.
Kubernetes e2e suite.[It] [sig-network] Networking Granular Checks: Services should update endpoints: http
从何时开始出现问题?
这个测试最近在2024年5月31日只失败了一次
Testgrid链接
https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd
失败原因(如果可能)
有两个测试似乎因为与containerd套接字相关的原因而失败。
{ failed [FAILED] failed: writing the contents: unable to upgrade connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory": unable to upgrade connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
In [It] at: k8s.io/kubernetes/test/e2e/storage/csimock/csi_fsgroup_policy.go:175 @ 05/31/24 05:40:11.726
}
与CSI Mock卷更新FSGroup策略时的行为有关。似乎是存储问题(?)
[FAIL] [sig-storage] CSI Mock volume fsgroup policies CSI FSGroupPolicy Update [LinuxOnly] [It] should update fsGroup if update from File to default [sig-storage]
k8s.io/kubernetes/test/e2e/storage/csimock/csi_fsgroup_policy.go:175
2024/05/31 06:04:10 main.go:326: Something went wrong: encountered 1 errors: [error during ./hack/ginkgo-e2e.sh --ginkgo.skip=\[Driver:.gcepd\]|\[Slow\]|\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\] --minStartupPods=8 --report-dir=/logs/artifacts --disable-log-dump=true --cluster-ip-range=10.64.0.0/14: exit status 1]
subprocess.CalledProcessError: Command '('kubetest', '--dump=/logs/artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--check-leaked-resources', '--extract=ci/fast/latest-fast', '--extract-ci-bucket=k8s-release-dev', '--gcp-master-image=ubuntu', '--gcp-node-image=ubuntu', '--gcp-nodes=4', '--gcp-zone=us-west1-b', '--ginkgo-parallel=30', '--test_args=--ginkgo.skip=\\[Driver:.gcepd\\]|\\[Slow\\]|\\[Serial\\]|\\[Disruptive\\]|\\[Flaky\\]|\\[Feature:.+\\] --minStartupPods=8', '--timeout=50m')' returned non-zero exit status 1.
我们需要了解其他信息吗?
如果你想审查Networking Granular Checks: Services should update endpoints: http
测试的历史性能,请查看此问题:#123760
相关的SIG(s)
/sig cloud-provider
/sig storage
@kubernetes/release-team-release-signal
9条答案
按热度按时间8e2ybdfx1#
/remove-sig storage
bwntbbo32#
Analyzing https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1796412325301850112
The test fails with
GetResponseFromContainer: failed to execute "curl -g -q -s ' http://10.64.4.65:9080/dial?request=hostname&protocol=http&host=10.0.79.230&port=80&tries=1 '": unable to upgrade connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory", stdout: "", stderr: ""
I0531 05:39:28.731652 10367 utils.go:383] encountered error during dial (did not find expected responses...
The Pod is running on node
I0531 05:39:28.904639 10367 dump.go:53] At 2024-05-31 05:36:51 +0000 UTC - event for test-container-pod: {kubelet bootstrap-e2e-minion-group-79vz} Started: Started container webserver
The containerd logs at that node https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1796412325301850112/artifacts/bootstrap-e2e-minion-group-79vz/containerd.log shows it receives a SIGTERM
May 31 05:39:25.552612 bootstrap-e2e-minion-group-79vz systemd[1]: containerd.service: Sent signal SIGTERM to main process 8648 (containerd) on client request.
and containerd is restarted
May 31 05:39:26.010977 bootstrap-e2e-minion-group-79vz containerd[8648]: time="2024-05-31T05:39:25.823748625Z" level=info msg="Stop CRI service"
May 31 05:39:34.091375 bootstrap-e2e-minion-group-79vz containerd[41887]: time="2024-05-31T05:39:34.090214404Z" level=info msg="containerd successfully booted in 2.142999s"
impacting all the tests that are running at that time.
is the node-problem-detector who restarts containerd?
I0531 05:39:31.630170 1 log_monitor.go:159] New status generated: &{Source:systemd-monitor Events:[{Severity:warn Timestamp:2024-05-31 05:39:31.531417 +0000 UTC Reason:ContainerdStart Message:Starting containerd container runtime...}] Conditions:[]}
/sig node
/cc @SergeyKanzhelev
kulphzqa3#
谁重启了容器?
NPD只监视日志。或者你是指一些NPD测试会重启它?
zynd9foi4#
如果容器在60秒内没有响应,
health-monitor.sh
脚本将重启containerd - https://github.com/kubernetes/kubernetes/blob/5bf1e95541d90e37f6c6637b5b45d8783e7907aa/cluster/gce/gci/health-monitor.sh#L46C1-L47C1ahy6op9u5#
你好,AnishShah 和 SergeyKanzhelev。关于这个问题,你认为这会是一个发布阻塞吗?因为我们正在计划今天发布 v1.31.0-alpha.1 版本。
qcuzuvrc6#
health-monitor.sh
调用crictl pods
对容器d进行健康检查。超时并重新启动了容器d。在容器d日志中我没有看到任何不健康的原因。@SergeyKanzhelev提到crictl pods
命令非常消耗内存,可能只是加载 crictl 到内存中的延迟。我认为将这次测试失败视为发布阻碍是安全的。
我们应该修改
health-monitor.sh
以使用一些不那么消耗内存的健康检查命令。xriantvc7#
我们正在今天的sig云提供商办公时间审查这个,并且我们没有看到任何突出的问题与云控制器有关。有什么是我们遗漏的吗?
r7s23pms8#
这是sig-node CI会议的笔记:
50pmv0ei9#