发生了什么?
当使用 ResourceQuotas
准入控制器扩展资源( nvidia.com/gpu
)时,报告的已使用资源不一致且错误。
你期望会发生什么?
报告的已使用扩展资源应尊重命名空间中的当前状态。
我们如何尽可能最小精确地重现它?
应用以下 ResourceQuota:
apiVersion: v1
kind: ResourceQuota
metadata:
name: capsule-skg00000-2
spec:
hard:
requests.nvidia.com/gpu: "10"
检查 RQ 是否正确应用并报告预期状态:
apiVersion: v1
kind: ResourceQuota
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"ResourceQuota","metadata":{"annotations":{},"name":"capsule-skg00000-2","namespace":"skg00000-gpu"},"spec":{"hard":{"requests.nvidia.com/gpu":"10"}}}
creationTimestamp: "2024-05-26T17:00:28Z"
labels:
capsule.clastix.io/managed-by: skg00000
name: capsule-skg00000-2
namespace: skg00000-gpu
resourceVersion: "8225530"
uid: 3badf2c2-9959-4882-9315-7ffff127e08e
spec:
hard:
requests.nvidia.com/gpu: "10"
status:
hard:
requests.nvidia.com/gpu: "10"
used:
requests.nvidia.com/gpu: "0"
创建一个使用扩展资源(GPU)的模板的部署。
spec:
automountServiceAccountToken: false
containers:
- args:
- sleep 3600
command:
- /bin/bash
- -c
- --
image: nvidia/cuda:11.0.3-base-ubuntu20.04
imagePullPolicy: Always
name: nvidia
resources:
limits:
cpu: "1"
memory: 1Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 1Gi
nvidia.com/gpu: "1"
在 Pod 部署后,ResourceQuota 报告错误的已使用资源数量:
apiVersion: v1
kind: ResourceQuota
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"ResourceQuota","metadata":{"annotations":{},"name":"capsule-skg00000-2","namespace":"skg00000-gpu"},"spec":{"hard":{"requests.nvidia.com/gpu":"10"}}}
creationTimestamp: "2024-05-26T17:00:28Z"
labels:
capsule.clastix.io/managed-by: skg00000
name: capsule-skg00000-2
namespace: skg00000-gpu
resourceVersion: "8225807"
uid: 3badf2c2-9959-4882-9315-7ffff127e08e
spec:
hard:
requests.nvidia.com/gpu: "10"
status:
hard:
requests.nvidia.com/gpu: "10"
used:
requests.nvidia.com/gpu: "2"
执行 Deployment 的扩展时也会出现不一致的报告。
$: kubectl scale deployment gpu-workload --replicas=2
$ RQ reports 4 used instances
$: kubectl scale deployment gpu-workload --replicas=3
$ RQ reports 6 used instances
$: kubectl scale deployment gpu-workload --replicas=4
$ RQ reports 8 used instances
在缩减规模时,使用的示例仍会不时出现不一致的情况。
$: kubectl scale deployment gpu-workload --replicas=3
$ RQ reports 3 used instances
$: kubectl scale deployment gpu-workload --replicas=2
$ RQ reports 3 used instances
$: kubectl scale deployment gpu-workload --replicas=1
$ RQ reports 1 used instances
$: kubectl scale deployment gpu-workload --replicas=0
$ RQ reports 0 used instances
我们需要了解其他信息吗?
- 无响应*
Kubernetes 版本
$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
云提供商
OS 版本
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
$ uname -a
Linux REDACTED 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
安装工具
kubeadm
容器运行时(CRI)和版本(如适用)
N.R.
相关插件(CNI,CSI,...)和版本(如适用)
N.R.
7条答案
按热度按时间bcs8qyzn1#
/sig api-machinery
bjp0bcyl2#
我怀疑这里的本地控制器可以消耗的缓存出现了问题。
当将部署扩展到2个pod时,它报告了3个示例。
几分钟后,正确数量的pod被正确报告。如果我的脚本是正确的,似乎大约需要5分钟,这是
kube-controller-manager
(--resource-quota-sync-period
)配置的默认值。58wvjzkj3#
解:它可能与$x_{1e0f1} x$有关,在$x_{1e1f1} x$合并后,没有新的易出现问题的测试发生。
$x_{1e2f1} x$
xzlaal3s4#
你能看到@prometherion的评论吗?(谢谢你,Carlory)。你的问题还在发生吗?让我们知道。
8ulbf1ek5#
还没有时间测试这个,这是在生产集群上发生的,拥有大量的GPU资源。如果我理解正确的话,这也应该发生在常规资源上,所以我应该能够用常规的pods来测试它。
bvjveswy6#
/triage accepted
/help
有没有人能提出一个不同的/更好的解释?
4zcjmb1e7#
@fedebongio:
This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the
/remove-help
command.In response to this:
/triage accepted
/help
Can anybody come up with a different / better explanation?
Instructions for interacting with me using PR comments are available here . If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.