kubernetes 资源配额计算的扩展资源已用资源错误,

kq4fsx7k  于 4个月前  发布在  Kubernetes
关注(0)|答案(7)|浏览(56)

发生了什么?

当使用 ResourceQuotas 准入控制器扩展资源( nvidia.com/gpu )时,报告的已使用资源不一致且错误。

你期望会发生什么?

报告的已使用扩展资源应尊重命名空间中的当前状态。

我们如何尽可能最小精确地重现它?

应用以下 ResourceQuota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: capsule-skg00000-2
spec:
  hard:
    requests.nvidia.com/gpu: "10"

检查 RQ 是否正确应用并报告预期状态:

apiVersion: v1
kind: ResourceQuota
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"ResourceQuota","metadata":{"annotations":{},"name":"capsule-skg00000-2","namespace":"skg00000-gpu"},"spec":{"hard":{"requests.nvidia.com/gpu":"10"}}}
  creationTimestamp: "2024-05-26T17:00:28Z"
  labels:
    capsule.clastix.io/managed-by: skg00000
  name: capsule-skg00000-2
  namespace: skg00000-gpu
  resourceVersion: "8225530"
  uid: 3badf2c2-9959-4882-9315-7ffff127e08e
spec:
  hard:
    requests.nvidia.com/gpu: "10"
status:
  hard:
    requests.nvidia.com/gpu: "10"
  used:
    requests.nvidia.com/gpu: "0"

创建一个使用扩展资源(GPU)的模板的部署。

spec:        
  automountServiceAccountToken: false   
  containers:
  - args:    
    - sleep 3600      
    command: 
    - /bin/bash       
    - -c     
    - --     
    image: nvidia/cuda:11.0.3-base-ubuntu20.04   
    imagePullPolicy: Always    
    name: nvidia      
    resources:        
      limits:
        cpu: "1"      
        memory: 1Gi   
        nvidia.com/gpu: "1"    
      requests:       
        cpu: "1"      
        memory: 1Gi   
        nvidia.com/gpu: "1"

在 Pod 部署后,ResourceQuota 报告错误的已使用资源数量:

apiVersion: v1
kind: ResourceQuota
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"ResourceQuota","metadata":{"annotations":{},"name":"capsule-skg00000-2","namespace":"skg00000-gpu"},"spec":{"hard":{"requests.nvidia.com/gpu":"10"}}}
  creationTimestamp: "2024-05-26T17:00:28Z"
  labels:
    capsule.clastix.io/managed-by: skg00000
  name: capsule-skg00000-2
  namespace: skg00000-gpu
  resourceVersion: "8225807"
  uid: 3badf2c2-9959-4882-9315-7ffff127e08e
spec:
  hard:
    requests.nvidia.com/gpu: "10"
status:
  hard:
    requests.nvidia.com/gpu: "10"
  used:
    requests.nvidia.com/gpu: "2"

执行 Deployment 的扩展时也会出现不一致的报告。

$: kubectl scale deployment gpu-workload --replicas=2
$ RQ reports 4 used instances
$: kubectl scale deployment gpu-workload --replicas=3
$ RQ reports 6 used instances
$: kubectl scale deployment gpu-workload --replicas=4
$ RQ reports 8 used instances

在缩减规模时,使用的示例仍会不时出现不一致的情况。

$: kubectl scale deployment gpu-workload --replicas=3
$ RQ reports 3 used instances
$: kubectl scale deployment gpu-workload --replicas=2
$ RQ reports 3 used instances
$: kubectl scale deployment gpu-workload --replicas=1
$ RQ reports 1 used instances
$: kubectl scale deployment gpu-workload --replicas=0
$ RQ reports 0 used instances

我们需要了解其他信息吗?

  • 无响应*

Kubernetes 版本

$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4

云提供商

OS 版本

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
$ uname -a
Linux REDACTED 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

安装工具

kubeadm

容器运行时(CRI)和版本(如适用)

N.R.

相关插件(CNI,CSI,...)和版本(如适用)

N.R.

bjp0bcyl

bjp0bcyl2#

我怀疑这里的本地控制器可以消耗的缓存出现了问题。
当将部署扩展到2个pod时,它报告了3个示例。
几分钟后,正确数量的pod被正确报告。如果我的脚本是正确的,似乎大约需要5分钟,这是kube-controller-manager(--resource-quota-sync-period)配置的默认值。

58wvjzkj

58wvjzkj3#

解:它可能与$x_{1e0f1} x$有关,在$x_{1e1f1} x$合并后,没有新的易出现问题的测试发生。

$x_{1e2f1} x$

xzlaal3s

xzlaal3s4#

你能看到@prometherion的评论吗?(谢谢你,Carlory)。你的问题还在发生吗?让我们知道。

8ulbf1ek

8ulbf1ek5#

还没有时间测试这个,这是在生产集群上发生的,拥有大量的GPU资源。如果我理解正确的话,这也应该发生在常规资源上,所以我应该能够用常规的pods来测试它。

bvjveswy

bvjveswy6#

/triage accepted
/help
有没有人能提出一个不同的/更好的解释?

4zcjmb1e

4zcjmb1e7#

@fedebongio:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/triage accepted
/help
Can anybody come up with a different / better explanation?
Instructions for interacting with me using PR comments are available here . If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

相关问题