ArangoDB Pod在Running状态和CrashLoopBackOff状态之间来回切换

j1dl9f46  于 2023-11-15  发布在  Go
关注(0)|答案(2)|浏览(302)

群集信息:

  1. Kubernetes version:
  2. root@k8s-eu-1-master:~# kubectl version
  3. Client Version: v1.28.2
  4. Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
  5. Server Version: v1.28.2

字符串
使用的云:Contabo Cloud(bare-metal)安装方法:遵循以下步骤:https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/?utm_content=cmp-true主机操作系统:Ubuntu 22.04 CNI和版本:

  1. root@k8s-eu-1-master:~# ls /etc/cni/net.d/
  2. 10-flannel.conflist
  3. root@k8s-eu-1-master:~# cat /etc/cni/net.d/10-flannel.conflist
  4. {
  5. "name": "cbr0",
  6. "cniVersion": "0.3.1",
  7. "plugins": [
  8. {
  9. "type": "flannel",
  10. "delegate": {
  11. "hairpinMode": true,
  12. "isDefaultGateway": true
  13. }
  14. },
  15. {
  16. "type": "portmap",
  17. "capabilities": {
  18. "portMappings": true
  19. }
  20. }
  21. ]
  22. }


CRI和版本:

  1. Container Runtime : containerd
  2. root@k8s-eu-1-master:~# cat /etc/containerd/config.toml | grep version
  3. version = 2


Pod在状态Running和状态CrashLoopBackOff之间来回移动

  1. root@k8s-eu-1-master:~# kubectl get pods -n kube-system
  2. NAME READY STATUS RESTARTS AGE
  3. coredns-5dd5756b68-g2bkc 1/1 Running 0 2d4h
  4. coredns-5dd5756b68-gt7xt 1/1 Running 0 2d4h
  5. etcd-k8s-eu-1-master 1/1 Running 1 (2d2h ago) 2d4h
  6. kube-apiserver-k8s-eu-1-master 1/1 Running 1 (2d2h ago) 2d4h
  7. kube-controller-manager-k8s-eu-1-master 1/1 Running 1 (2d2h ago) 2d4h
  8. kube-proxy-7mj86 1/1 Running 1 (2d2h ago) 2d4h
  9. kube-proxy-7nvv5 1/1 Running 1 (2d2h ago) 2d3h
  10. kube-proxy-fq6vz 1/1 Running 1 (2d2h ago) 2d4h
  11. kube-proxy-n2nm5 1/1 Running 1 (2d2h ago) 2d3h
  12. kube-proxy-qhvrn 1/1 Running 1 (2d2h ago) 2d4h
  13. kube-proxy-tbrn4 1/1 Running 1 (2d2h ago) 2d3h
  14. kube-scheduler-k8s-eu-1-master 1/1 Running 1 (2d2h ago) 2d4h
  15. root@k8s-eu-1-master:~# kubectl get pods
  16. NAME READY STATUS RESTARTS AGE
  17. arango-deployment-operator-7f59876f78-7djdr 0/1 CrashLoopBackOff 87 (11s ago) 4h58m
  18. arango-storage-operator-6c7fdf5586-gjcrp 0/1 CrashLoopBackOff 83 (98s ago) 4h44m
  19. root@k8s-eu-1-master:~# kubectl describe pod arango-deployment-operator-7f59876f78-7djdr
  20. Name: arango-deployment-operator-7f59876f78-7djdr
  21. Namespace: default
  22. Priority: 0
  23. Service Account: arango-deployment-operator
  24. Node: k8s-eu-1-worker-2/xx.xxx.xxx.xxx
  25. Start Time: Thu, 19 Oct 2023 12:56:41 +0200
  26. Labels: app.kubernetes.io/instance=deployment
  27. app.kubernetes.io/managed-by=Tiller
  28. app.kubernetes.io/name=kube-arangodb
  29. helm.sh/chart=kube-arangodb-1.2.34
  30. pod-template-hash=7f59876f78
  31. release=deployment
  32. Annotations: <none>
  33. Status: Running
  34. IP: 10.244.0.6
  35. IPs:
  36. IP: 10.244.0.6
  37. Controlled By: ReplicaSet/arango-deployment-operator-7f59876f78
  38. Containers:
  39. operator:
  40. Container ID: containerd://344e2967054112557a9333332f99a8ca1dc3312285c808c727de6468f8c73381
  41. Image: arangodb/kube-arangodb:1.2.34
  42. Image ID: docker.io/arangodb/kube-arangodb@sha256:a25d031e87ba5b0f3038ce9f346553b69760a3a065fe608727cde188602b59e8
  43. Port: 8528/TCP
  44. Host Port: 0/TCP
  45. Args:
  46. --scope=legacy
  47. --operator.deployment
  48. --mode.single
  49. --chaos.allowed=false
  50. --log.level=debug
  51. State: Waiting
  52. Reason: CrashLoopBackOff
  53. Last State: Terminated
  54. Reason: Error
  55. Exit Code: 137
  56. Started: Thu, 19 Oct 2023 17:39:23 +0200
  57. Finished: Thu, 19 Oct 2023 17:40:22 +0200
  58. Ready: False
  59. Restart Count: 83
  60. Liveness: http-get https://:8528/health delay=5s timeout=1s period=10s #success=1 #failure=3
  61. Readiness: http-get https://:8528/ready delay=5s timeout=1s period=10s #success=1 #failure=3
  62. Environment:
  63. MY_POD_NAMESPACE: default (v1:metadata.namespace)
  64. MY_POD_NAME: arango-deployment-operator-7f59876f78-7djdr (v1:metadata.name)
  65. MY_POD_IP: (v1:status.podIP)
  66. Mounts:
  67. /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g4fbd (ro)
  68. Conditions:
  69. Type Status
  70. Initialized True
  71. Ready False
  72. ContainersReady False
  73. PodScheduled True
  74. Volumes:
  75. kube-api-access-g4fbd:
  76. Type: Projected (a volume that contains injected data from multiple sources)
  77. TokenExpirationSeconds: 3607
  78. ConfigMapName: kube-root-ca.crt
  79. ConfigMapOptional: <nil>
  80. DownwardAPI: true
  81. QoS Class: BestEffort
  82. Node-Selectors: <none>
  83. Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 5s
  84. node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
  85. Events:
  86. Type Reason Age From Message
  87. ---- ------ ---- ---- -------
  88. Warning Unhealthy 48m (x215 over 4h48m) kubelet Liveness probe failed: Get "https://10.244.0.6:8528/health": dial tcp 10.244.0.6:8528: connect: connection refused
  89. Normal Pulling 28m (x77 over 4h48m) kubelet Pulling image "arangodb/kube-arangodb:1.2.34"
  90. Warning Unhealthy 13m (x565 over 4h48m) kubelet Readiness probe failed: Get "https://10.244.0.6:8528/ready": dial tcp 10.244.0.6:8528: connect: connection refused
  91. Warning BackOff 3m28s (x968 over 4h42m) kubelet Back-off restarting failed container operator in pod arango-deployment-operator-7f59876f78-7djdr_default(d1d6ec8e-b413-4ab8-84d7-8f6686cd3a8a)
  92. root@k8s-eu-1-master:~# kubectl logs arango-deployment-operator-7f59876f78-7djdr
  93. 2023-10-19T15:45:24Z INF nice to meet you operator-id=7djdr
  94. 2023-10-19T15:45:24Z INF Operator Feature agency-poll (deployment.feature.agency-poll) is disabled. operator-id=7djdr
  95. 2023-10-19T15:45:24Z INF Operator Feature deployment-spec-defaults-restore (deployment.feature.deployment-spec-defaults-restore) is enabled. operator-id=7djdr
  96. 2023-10-19T15:45:24Z INF Operator Feature encryption-rotation (deployment.feature.encryption-rotation) is disabled. operator-id=7djdr
  97. 2023-10-19T15:45:24Z INF Operator Feature enforced-resign-leadership (deployment.feature.enforced-resign-leadership) is enabled. operator-id=7djdr
  98. 2023-10-19T15:45:24Z INF Operator Feature ephemeral-volumes (deployment.feature.ephemeral-volumes) is disabled. operator-id=7djdr
  99. 2023-10-19T15:45:24Z INF Operator Feature failover-leadership (deployment.feature.failover-leadership) is disabled. operator-id=7djdr
  100. 2023-10-19T15:45:24Z INF Operator Feature force-rebuild-out-synced-shards (deployment.feature.force-rebuild-out-synced-shards) is disabled. operator-id=7djdr
  101. 2023-10-19T15:45:24Z INF Operator Feature graceful-shutdown (deployment.feature.graceful-shutdown) is enabled. operator-id=7djdr
  102. 2023-10-19T15:45:24Z INF Operator Feature init-containers-copy-resources (deployment.feature.init-containers-copy-resources) is enabled. operator-id=7djdr
  103. 2023-10-19T15:45:24Z INF Operator Feature jwt-rotation (deployment.feature.jwt-rotation) is enabled. operator-id=7djdr
  104. 2023-10-19T15:45:24Z INF Operator Feature local-storage.pass-reclaim-policy (deployment.feature.local-storage.pass-reclaim-policy) is disabled. operator-id=7djdr
  105. 2023-10-19T15:45:24Z INF Operator Feature local-volume-replacement-check (deployment.feature.local-volume-replacement-check) is disabled. operator-id=7djdr
  106. 2023-10-19T15:45:24Z INF Operator Feature maintenance (deployment.feature.maintenance) is enabled. operator-id=7djdr
  107. 2023-10-19T15:45:24Z INF Operator Feature metrics-exporter (deployment.feature.metrics-exporter) is enabled. operator-id=7djdr
  108. 2023-10-19T15:45:24Z INF Operator Feature optional-graceful-shutdown (deployment.feature.optional-graceful-shutdown) is disabled. operator-id=7djdr
  109. 2023-10-19T15:45:24Z INF Operator Feature random-pod-names (deployment.feature.random-pod-names) is disabled. operator-id=7djdr
  110. 2023-10-19T15:45:24Z INF Operator Feature rebalancer-v2 (deployment.feature.rebalancer-v2) is disabled. operator-id=7djdr
  111. 2023-10-19T15:45:24Z INF Operator Feature restart-policy-always (deployment.feature.restart-policy-always) is disabled. operator-id=7djdr
  112. 2023-10-19T15:45:24Z INF Operator Feature secured-containers (deployment.feature.secured-containers) is disabled. operator-id=7djdr
  113. 2023-10-19T15:45:24Z INF Operator Feature sensitive-information-protection (deployment.feature.sensitive-information-protection) is disabled. operator-id=7djdr
  114. 2023-10-19T15:45:24Z INF Operator Feature short-pod-names (deployment.feature.short-pod-names) is disabled. operator-id=7djdr
  115. 2023-10-19T15:45:24Z INF Operator Feature timezone-management (deployment.feature.timezone-management) is disabled. operator-id=7djdr
  116. 2023-10-19T15:45:24Z INF Operator Feature tls-rotation (deployment.feature.tls-rotation) is enabled. operator-id=7djdr
  117. 2023-10-19T15:45:24Z INF Operator Feature tls-sni (deployment.feature.tls-sni) is enabled. operator-id=7djdr
  118. 2023-10-19T15:45:24Z INF Operator Feature upgrade-version-check (deployment.feature.upgrade-version-check) is enabled. operator-id=7djdr
  119. 2023-10-19T15:45:24Z INF Operator Feature upgrade-version-check-v2 (deployment.feature.upgrade-version-check-v2) is disabled. operator-id=7djdr
  120. 2023-10-19T15:45:24Z INF Operator Feature version.3-10 (deployment.feature.version.3-10) is disabled. operator-id=7djdr
  121. 2023-10-19T15:45:24Z INF Starting arangodb-operator (Community), version 1.2.34 build 05e58812 operator-id=7djdr pod-name=arango-deployment-operator-7f59876f78-7djdr pod-namespace=default
  122. 2023-10-19T15:45:54Z INF Get Operations is not allowed. Continue crd=arangojobs.apps.arangodb.com operator-id=7djdr


为什么突然两个Arango Pod的状态都从running变成了CrashLoopBackOff

  1. root@k8s-eu-1-master:~# kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. arango-deployment-operator-7f59876f78-7djdr 0/1 CrashLoopBackOff 87 (100s ago) 4h59m
  4. arango-storage-operator-6c7fdf5586-gjcrp 0/1 CrashLoopBackOff 83 (3m7s ago) 4h45m
  5. root@k8s-eu-1-master:~#
  6. root@k8s-eu-1-master:~# kubectl get pods
  7. NAME READY STATUS RESTARTS AGE
  8. arango-deployment-operator-7f59876f78-7djdr 0/1 CrashLoopBackOff 89 (4m47s ago) 5h9m
  9. arango-storage-operator-6c7fdf5586-gjcrp 0/1 Running 86 (6m4s ago) 4h55m
  10. root@k8s-eu-1-master:~#


更新1):
正如@Sat21343所建议的那样,我定义了资源(内存+ CPU)请求和限制(https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1):

  1. containers:
  2. - name: operator
  3. # https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1
  4. resources:
  5. requests:
  6. memory: "128Mi"
  7. cpu: "250m"
  8. limits:
  9. memory: "526Mi"
  10. cpu: "500m"


arango-storage.yaml也一样。
但是pod仍然在RunningCrashLoopBackOff状态之间来回移动:

  1. root@k8s-eu-1-master:~# kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. arango-deployment-operator-65cd58968f-xmz5w 0/1 Running 3 (7s ago) 3m8s
  4. arango-storage-operator-58b8cb7c78-8dlb7 0/1 Running 2 (60s ago) 3m
  5. root@k8s-eu-1-master:~# kubectl get pods
  6. NAME READY STATUS RESTARTS AGE
  7. arango-deployment-operator-65cd58968f-xmz5w 0/1 CrashLoopBackOff 5 (29s ago) 6m30s
  8. arango-storage-operator-58b8cb7c78-8dlb7 0/1 CrashLoopBackOff 5 (22s ago) 6m22s
  9. root@k8s-eu-1-master:~# kubectl get pods
  10. NAME READY STATUS RESTARTS AGE
  11. arango-deployment-operator-65cd58968f-xmz5w 0/1 Running 9 (31s ago) 19m
  12. arango-storage-operator-58b8cb7c78-8dlb7 0/1 Running 9 (24s ago) 18m
  13. root@k8s-eu-1-master:~# kubectl get pods
  14. NAME READY STATUS RESTARTS AGE
  15. arango-deployment-operator-65cd58968f-xmz5w 0/1 CrashLoopBackOff 9 (57s ago) 20m
  16. arango-storage-operator-58b8cb7c78-8dlb7 0/1 CrashLoopBackOff 9 (50s ago) 20m


更新2):
我将资源请求和限制增加到:

  1. containers:
  2. - name: operator
  3. # https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1
  4. resources:
  5. requests:
  6. memory: "1024Mi"
  7. cpu: "500m"
  8. limits:
  9. memory: "2048Mi"
  10. cpu: "1000m"


但是,pod仍然在Running State和CrashLoopBackOff State之间来回移动:

  1. root@k8s-eu-1-master:~# kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. arango-deployment-operator-65cd58968f-xmz5w 0/1 Terminating 294 (6m ago) 17h
  4. arango-storage-operator-58b8cb7c78-8dlb7 0/1 Terminating 294 (5m53s ago) 17h
  5. root@k8s-eu-1-master:~#
  6. root@k8s-eu-1-master:~#
  7. root@k8s-eu-1-master:~# kubectl get pods
  8. NAME READY STATUS RESTARTS AGE
  9. arango-deployment-operator-5bd68475b-cdr9z 0/1 Running 0 7s
  10. arango-storage-operator-58b8cb7c78-8dlb7 0/1 Terminating 294 (6m ago) 17h
  11. root@k8s-eu-1-master:~#
  12. root@k8s-eu-1-master:~#
  13. root@k8s-eu-1-master:~# kubectl get pods
  14. NAME READY STATUS RESTARTS AGE
  15. arango-deployment-operator-5bd68475b-cdr9z 0/1 Running 0 9s
  16. arango-storage-operator-58b8cb7c78-8dlb7 0/1 Terminating 294 (6m2s ago) 17h
  17. root@k8s-eu-1-master:~#
  18. root@k8s-eu-1-master:~#
  19. root@k8s-eu-1-master:~# kubectl get pods
  20. NAME READY STATUS RESTARTS AGE
  21. arango-deployment-operator-5bd68475b-cdr9z 0/1 Running 5 (58s ago) 5m59s
  22. arango-storage-operator-5bd4546bb8-g4zr5 0/1 Running 5 (45s ago) 5m45s
  23. root@k8s-eu-1-master:~#
  24. root@k8s-eu-1-master:~#
  25. root@k8s-eu-1-master:~#
  26. root@k8s-eu-1-master:~# kubectl get pods
  27. NAME READY STATUS RESTARTS AGE
  28. arango-deployment-operator-5bd68475b-cdr9z 0/1 CrashLoopBackOff 5 (0s ago) 6m1s
  29. arango-storage-operator-5bd4546bb8-g4zr5 0/1 Running 5 (47s ago) 5m47s
  30. root@k8s-eu-1-master:~#
  31. root@k8s-eu-1-master:~#
  32. root@k8s-eu-1-master:~# kubectl get pods
  33. NAME READY STATUS RESTARTS AGE
  34. arango-deployment-operator-5bd68475b-cdr9z 0/1 CrashLoopBackOff 5 (6s ago) 6m7s
  35. arango-storage-operator-5bd4546bb8-g4zr5 0/1 Running 5 (53s ago) 5m53s


这是arango-deployment.yaml文件:https://drive.google.com/file/d/1VfCjQih5aJUEA4HD9ddsQDrZbLmquWIQ/view?usp=share_link
这是arango-storage.yaml文件:https://drive.google.com/file/d/1hqHU_H2Wr5VFrJLwM9GDUHF17b7_CYIG/view?usp=sharing
我不得不把kubectl describe podkubectl describe pod的输出放在Google Drive的一个txt文件中,因为SOF不接受这么长的文本:https://drive.google.com/file/d/1kZsYeKxOa5aSppV3IdS6c7-e8dnoLiiB/view?usp=share_link
这两个Pod都在同一个节点上:k8s-eu-1-worker-1,显然没有内存问题:https://drive.google.com/file/d/1cjBqezlnJ9vEEnqDlM4NVh8IgcfV2v8T/view?usp=sharing
更新3)
感谢@Sat21343的建议,我看了一下节点的syslog,就在这个节点中的pod从Running变成CrashLoopBackOff之后,这是syslog的最后几行:

  1. Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: I1020
  2. 15:44:10.594513 599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
  3. Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:10.594988 599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"
  4. Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: I1020 15:44:21.594619 599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
  5. Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:21.595036 599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"

节点的系统日志的最后几行:https://drive.google.com/file/d/1Ov_vrjsRWrLl2er_QB3yDqkZ7yN19hc-/view?usp=sharing。关于最后几行:我删除了所有的ArangoDB部署只是为了清理一切。
我做错了什么?如何让吊舱保持在“运行”状态?

yduiuuwa

yduiuuwa1#

从pod的描述中,我可以看到pod被终止,状态码为137,这意味着你还没有为你的容器配置所需的内存。
A 137 code is issued when a process is terminated externally because of its memory consumption. The operating system's out of memory manager (OOM) intervenes to stop the program before it destabilizes the host. Pods running in Kubernetes will show a status of OOMKilled when they encounter a 137 exit code.
为了解决这个问题,我建议您为容器配置resource Request and Limit
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#example-1

7rfyedvj

7rfyedvj2#

经过多次日志分析后,我在kube-arangodb GitHub Repo中打开了一个问题:https://github.com/arangodb/kube-arangodb/issues/1456
但是,正如你在这里看到的:https://github.com/arangodb/kube-arangodb/issues/1456#issuecomment-1779310532,ArangoDB的人认为这不是Arangodb Kubernetes操作员的问题,并在GitHub Repo中关闭了我的问题。
经验教训:解决问题的最好方法是不要把它当作一个需要解决的问题......这很有趣吗?

相关问题