kubernetes [Flaking Test] gce-cos-master-default (containerd socket errors)

piok6c0g  于 5个月前  发布在  Kubernetes
关注(0)|答案(9)|浏览(58)

哪些任务出现了问题?

master-blocking:

  • gce-cos-master-default

哪些测试出现了问题?

  1. Kubernetes e2e suite.[It] [sig-storage] CSI Mock volume fsgroup policies CSI FSGroupPolicy Update [LinuxOnly] should update fsGroup if update from File to default.
  2. Kubernetes e2e suite.[It] [sig-network] Networking Granular Checks: Services should update endpoints: http

从何时开始出现问题?

这个测试最近在2024年5月31日只失败了一次

Testgrid链接

https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd

失败原因(如果可能)

有两个测试似乎因为与containerd套接字相关的原因而失败。

{ failed [FAILED] failed: writing the contents: unable to upgrade connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory": unable to upgrade connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
In [It] at: k8s.io/kubernetes/test/e2e/storage/csimock/csi_fsgroup_policy.go:175 @ 05/31/24 05:40:11.726
}

与CSI Mock卷更新FSGroup策略时的行为有关。似乎是存储问题(?)

[FAIL] [sig-storage] CSI Mock volume fsgroup policies CSI FSGroupPolicy Update [LinuxOnly] [It] should update fsGroup if update from File to default [sig-storage]
  k8s.io/kubernetes/test/e2e/storage/csimock/csi_fsgroup_policy.go:175
2024/05/31 06:04:10 main.go:326: Something went wrong: encountered 1 errors: [error during ./hack/ginkgo-e2e.sh --ginkgo.skip=\[Driver:.gcepd\]|\[Slow\]|\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\] --minStartupPods=8 --report-dir=/logs/artifacts --disable-log-dump=true --cluster-ip-range=10.64.0.0/14: exit status 1]
subprocess.CalledProcessError: Command '('kubetest', '--dump=/logs/artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--check-leaked-resources', '--extract=ci/fast/latest-fast', '--extract-ci-bucket=k8s-release-dev', '--gcp-master-image=ubuntu', '--gcp-node-image=ubuntu', '--gcp-nodes=4', '--gcp-zone=us-west1-b', '--ginkgo-parallel=30', '--test_args=--ginkgo.skip=\\[Driver:.gcepd\\]|\\[Slow\\]|\\[Serial\\]|\\[Disruptive\\]|\\[Flaky\\]|\\[Feature:.+\\] --minStartupPods=8', '--timeout=50m')' returned non-zero exit status 1.

我们需要了解其他信息吗?

如果你想审查Networking Granular Checks: Services should update endpoints: http测试的历史性能,请查看此问题:#123760

相关的SIG(s)

/sig cloud-provider
/sig storage
@kubernetes/release-team-release-signal

bwntbbo3

bwntbbo32#

Analyzing https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1796412325301850112
The test fails with
GetResponseFromContainer: failed to execute "curl -g -q -s ' http://10.64.4.65:9080/dial?request=hostname&protocol=http&host=10.0.79.230&port=80&tries=1 '": unable to upgrade connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory", stdout: "", stderr: ""
I0531 05:39:28.731652 10367 utils.go:383] encountered error during dial (did not find expected responses...
The Pod is running on node
I0531 05:39:28.904639 10367 dump.go:53] At 2024-05-31 05:36:51 +0000 UTC - event for test-container-pod: {kubelet bootstrap-e2e-minion-group-79vz} Started: Started container webserver
The containerd logs at that node https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1796412325301850112/artifacts/bootstrap-e2e-minion-group-79vz/containerd.log shows it receives a SIGTERM
May 31 05:39:25.552612 bootstrap-e2e-minion-group-79vz systemd[1]: containerd.service: Sent signal SIGTERM to main process 8648 (containerd) on client request.
and containerd is restarted
May 31 05:39:26.010977 bootstrap-e2e-minion-group-79vz containerd[8648]: time="2024-05-31T05:39:25.823748625Z" level=info msg="Stop CRI service"
May 31 05:39:34.091375 bootstrap-e2e-minion-group-79vz containerd[41887]: time="2024-05-31T05:39:34.090214404Z" level=info msg="containerd successfully booted in 2.142999s"
impacting all the tests that are running at that time.
is the node-problem-detector who restarts containerd?
I0531 05:39:31.630170 1 log_monitor.go:159] New status generated: &{Source:systemd-monitor Events:[{Severity:warn Timestamp:2024-05-31 05:39:31.531417 +0000 UTC Reason:ContainerdStart Message:Starting containerd container runtime...}] Conditions:[]}
/sig node
/cc @SergeyKanzhelev

kulphzqa

kulphzqa3#

谁重启了容器?
NPD只监视日志。或者你是指一些NPD测试会重启它?

zynd9foi

zynd9foi4#

如果容器在60秒内没有响应,health-monitor.sh脚本将重启containerd - https://github.com/kubernetes/kubernetes/blob/5bf1e95541d90e37f6c6637b5b45d8783e7907aa/cluster/gce/gci/health-monitor.sh#L46C1-L47C1

ahy6op9u

ahy6op9u5#

你好,AnishShah 和 SergeyKanzhelev。关于这个问题,你认为这会是一个发布阻塞吗?因为我们正在计划今天发布 v1.31.0-alpha.1 版本。

qcuzuvrc

qcuzuvrc6#

health-monitor.sh 调用 crictl pods 对容器d进行健康检查。超时并重新启动了容器d。在容器d日志中我没有看到任何不健康的原因。@SergeyKanzhelev提到 crictl pods 命令非常消耗内存,可能只是加载 crictl 到内存中的延迟。
我认为将这次测试失败视为发布阻碍是安全的。
我们应该修改 health-monitor.sh 以使用一些不那么消耗内存的健康检查命令。

xriantvc

xriantvc7#

我们正在今天的sig云提供商办公时间审查这个,并且我们没有看到任何突出的问题与云控制器有关。有什么是我们遗漏的吗?

r7s23pms

r7s23pms8#

这是sig-node CI会议的笔记:

  • 这不是发布阻塞
50pmv0ei

50pmv0ei9#

/assign @cheftako
/triage accepted

相关问题