kubernetes [Flaking Test] [sig-node] [节点特性:节点问题检测器] 应该在没有错误的情况下运行,

epfja78i  于 6个月前  发布在  Kubernetes
关注(0)|答案(4)|浏览(57)

哪些工作正在失败?

NodeProblemDetector [节点特性:NodeProblemDetector] 应该在没有错误的情况下运行

哪些测试正在失败?

ci-kubernetes-e2e-gci-gce-alpha-enabled-default

自何时以来一直在失败?

很长时间了

Testgrid链接

https://testgrid.k8s.io/google-gce#gci-gce-alpha-enabled-default

失败原因(如果可能)

失败 [失败] 服务器上出现错误 ("Internal Error: failed to list pod stats: rpc error: code = Unknown desc = 1 error occurred:
\t* failed to decode sandbox container metrics for sandbox "2dda500c81c49c73488ad52cb6a19563ac33ccd365b01ea328a1d6381c226398": ttrpc: closed: unknown") 已阻止请求成功 (获取节点 bootstrap-e2e-minion-group-vgh1:10250)
在 k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:381 @ 07/16/24 21:09:33.122
}

STEP: Gather node-problem-detector cpu and memory stats - k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:164 @ 07/16/24 21:06:26.828
I0716 21:09:33.122599 10712 node_problem_detector.go:381] Unexpected error: 
    <*errors.StatusError | 0xc0003c1e00>: 
    an error on the server ("Internal Error: failed to list pod stats: rpc error: code = Unknown desc = 1 error occurred:\n\t* failed to decode sandbox container metrics for sandbox \"2dda500c81c49c73488ad52cb6a19563ac33ccd365b01ea328a1d6381c226398\": ttrpc: closed: unknown") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-vgh1:10250)
    {
        ErrStatus: 
            code: 500
            details:
              causes:
              - message: "Internal Error: failed to list pod stats: rpc error: code = Unknown
                  desc = 1 error occurred:\n\t* failed to decode sandbox container metrics for
                  sandbox \"2dda500c81c49c73488ad52cb6a19563ac33ccd365b01ea328a1d6381c226398\":
                  ttrpc: closed: unknown"
                reason: UnexpectedServerResponse
              kind: nodes
              name: bootstrap-e2e-minion-group-vgh1:10250
            message: 'an error on the server ("Internal Error: failed to list pod stats: rpc error:
              code = Unknown desc = 1 error occurred:\n\t* failed to decode sandbox container
              metrics for sandbox \"2dda500c81c49c73488ad52cb6a19563ac33ccd365b01ea328a1d6381c226398\":
              ttrpc: closed: unknown") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-vgh1:10250)'
            metadata: {}
            reason: InternalError
            status: Failure,
    }
[FAILED] an error on the server ("Internal Error: failed to list pod stats: rpc error: code = Unknown desc = 1 error occurred:\n\t* failed to decode sandbox container metrics for sandbox \"2dda500c81c49c73488ad52cb6a19563ac33ccd365b01ea328a1d6381c226398\": ttrpc: closed: unknown") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-vgh1:10250)
In [It] at: k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:381 @ 07/16/24 21:09:33.122
< Exit [It] should run without error - k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:63 @ 07/16/24 21:09:33.122 (3m10.139s)

我们还需要了解其他什么吗?

类似的问题 #122118

相关的SIG(s)

/sig node

ozxc1zmp

ozxc1zmp1#

这个问题目前正在等待分类。
如果SIG或子项目确定这是一个相关的问题,他们将通过应用triage/accepted标签并提供进一步的指导来接受它。
组织成员可以通过在评论中写入/triage accepted来添加triage/accepted标签。
有关使用PR评论与我互动的说明,请查看here。如果您对我的行为有任何问题或建议,请针对kubernetes-sigs/prow仓库提出一个问题。

cetgtptt

cetgtptt2#

/cc @humblec@SergeyKanzhelev@wangzhen127

mbzjlibv

mbzjlibv3#

NPD特定的CI测试正在正常运行:https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-kubernetes-gce-gci,它以系统守护进程的形式运行NPD。
gci-gce-alpha-enabled-default测试是否以daemonset的形式运行NPD?@hakman,你能帮忙看一下吗?
/cc @AnishShah@DigitalVeer

aydmsdu9

aydmsdu94#

Kubernetes集群中的节点问题检测器(Node Problem Detector)的配置文件位于kubernetes/cluster/addons/node-problem-detector/npd.yaml,在第26行到第48行之间。这是一个守护进程集(DaemonSet),用于在Kubernetes集群中检测节点问题。

以下是该配置文件的内容:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: kube-system
labels:
  app.kubernetes.io/name: node-problem-detector
  app.kubernetes.io/version: v0.8.19
addonmanager.kubernetes.io/mode: Reconcile
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-problem-detector
      app.kubernetes.io/version: v0.8.19
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-problem-detector
        app.kubernetes.io/version: v0.8.19
    spec:
      containers:
        - name: node-problem-detector
          image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19

根据配置文件,这个守护进程集使用registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19镜像来运行。它的作用是在Kubernetes集群中检测节点问题,并将检测结果添加到相应的标签上。

相关问题