建议改进性能
- 无响应*
性能回归报告
- 无响应*
关于性能的杂项讨论
我正在使用vllm部署qwen 7b聊天模型服务。在一个非常高的并发场景下,例如128个并发,我发现CPU利用率达到了100%,但我看到GPU利用率率低于60%
我的问题是,因为很多vllm的调度和计算逻辑都是通过Python协程实现的,它只能使用单个CPU的计算能力。在这样一个具有128个并发的场景中,CPU是否成为计算瓶颈,导致GPU CUDA无法实现更高的性能?
模型下载地址:https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main
- 对于服务器场景
- 对于离线批量推理场景
import random
import json
from vllm import LLM, SamplingParams
conc = 128
jsonl_path = "xxx.jsonl"
# 从jsonl文件中读取concurrent条数据
all_prompts = []
with open(jsonl_path, "r") as f:
for line in f:
line_obj = json.loads(line)
print("line_obj as: ", line_obj)
try:
prompt = line_obj[-1]["content"]
except Exception as e:
prompt = line_obj[-1]["Content"]
all_prompts.append(prompt)
# Sample prompts.
if len(all_prompts) > conc:
prompts = all_prompts[:conc]
else:
prompts = random.choices(all_prompts, k=conc)
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=500)
# Create an LLM.
#llm = LLM(model="facebook/opt-125m")
# llama2 7b chat
llm = LLM(model="/models/models--Qwen--Qwen-7B-Chat-new", trust_remote_code=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
您当前的环境(如果您认为有必要)
Collecting environment information...
PyTorch version: 2.2.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.27
Python version: 3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.6.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 14
On-line CPU(s) list: 0-13
Thread(s) per core: 2
Core(s) per socket: 7
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
Stepping: 6
CPU MHz: 2593.904
BogoMIPS: 5187.80
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 48K
L1i cache: 32K
L2 cache: 1280K
L3 cache: 49152K
NUMA node0 CPU(s): 0-13
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq spec_ctrl
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.2+cu118
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu11 2.19.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.19.3 pypi_0 pypi
[conda] torch 2.2.2+cu118 pypi_0 pypi
[conda] triton 2.2.0 pypi_0 pypi
[conda] vllm-nccl-cu11 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 0-13 0 N/A
NIC0 SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
1条答案
按热度按时间9gm1akwq1#
您可以启动VLLM API接口服务,该服务将具有CPU和GPU利用率,例如
平均提示吞吐量:0.0个令牌/秒, 平均生成吞吐量:0.0个令牌/秒, 运行:0个请求, 交换:0个请求, 挂起:0个请求, GPU KV缓存使用率:0.0%, CPU KV缓存使用率:0.0%
KV缓存将首先占用GPU,然后是CPU,可以使用FP8 E4M3 KV缓存减少KV缓存利用率