vllm [性能]:Qwen 7b聊天模型,在128并发下,CPU利用率达到100%,而GPU SM利用率仅为60%-75%,这是CPU瓶颈吗?

mfuanj7w  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(31)

建议改进性能

  • 无响应*

性能回归报告

  • 无响应*

关于性能的杂项讨论

我正在使用vllm部署qwen 7b聊天模型服务。在一个非常高的并发场景下,例如128个并发,我发现CPU利用率达到了100%,但我看到GPU利用率率低于60%
我的问题是,因为很多vllm的调度和计算逻辑都是通过Python协程实现的,它只能使用单个CPU的计算能力。在这样一个具有128个并发的场景中,CPU是否成为计算瓶颈,导致GPU CUDA无法实现更高的性能?
模型下载地址:https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main

  1. 对于服务器场景

  1. 对于离线批量推理场景

import random
import json
from vllm import LLM, SamplingParams

conc = 128
jsonl_path = "xxx.jsonl"

# 从jsonl文件中读取concurrent条数据
all_prompts = []
with open(jsonl_path, "r") as f:
    for line in f:
        line_obj = json.loads(line)
        print("line_obj as: ", line_obj)
        try:
            prompt = line_obj[-1]["content"]
        except Exception as e:
            prompt = line_obj[-1]["Content"]

        all_prompts.append(prompt)

# Sample prompts.
if len(all_prompts) > conc:
    prompts = all_prompts[:conc]
else:
    prompts = random.choices(all_prompts, k=conc)

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=500)

# Create an LLM.
#llm = LLM(model="facebook/opt-125m")
# llama2 7b chat
llm = LLM(model="/models/models--Qwen--Qwen-7B-Chat-new", trust_remote_code=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

您当前的环境(如果您认为有必要)

Collecting environment information...
PyTorch version: 2.2.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.27

Python version: 3.9.16 (main, May 15 2023, 23:46:34)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.6.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              14
On-line CPU(s) list: 0-13
Thread(s) per core:  2
Core(s) per socket:  7
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
Stepping:            6
CPU MHz:             2593.904
BogoMIPS:            5187.80
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            49152K
NUMA node0 CPU(s):   0-13
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq spec_ctrl

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.2+cu118
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu11          2.19.3                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.19.3                   pypi_0    pypi
[conda] torch                     2.2.2+cu118              pypi_0    pypi
[conda] triton                    2.2.0                    pypi_0    pypi
[conda] vllm-nccl-cu11            2.18.1.0.4.0             pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	0-13	0		N/A
NIC0	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
9gm1akwq

9gm1akwq1#

您可以启动VLLM API接口服务,该服务将具有CPU和GPU利用率,例如
平均提示吞吐量:0.0个令牌/秒, 平均生成吞吐量:0.0个令牌/秒, 运行:0个请求, 交换:0个请求, 挂起:0个请求, GPU KV缓存使用率:0.0%, CPU KV缓存使用率:0.0%
KV缓存将首先占用GPU,然后是CPU,可以使用FP8 E4M3 KV缓存减少KV缓存利用率

相关问题