当前环境
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.0
Libc version: glibc-2.35
Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.5.82
CUDA_MODULE_LOADING set to: LAZY
GPU models:
A100s
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
🐛 描述bug
我正在尝试理解vllm通过多进程进行分布式服务的工作流程。原始设置是通过Triton推理服务器部署一个Tensor并行大小为2的模型,并使用distributed_executor_backend: mp
。在推理运行正常的情况下,当服务器关闭时,2个进程pt_main_thread
没有被杀死,它们的状态是State: S (sleeping)
。
Triton之外最接近的复现方法是:
from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.utils import random_uuid
import time
import asyncio
SAMPLING_PARAMETERS = {"temperature": 0, "top_p": 1}
VLLM_ENGINE_CONFIG = {
"model":"facebook/opt-125m",
"disable_log_requests": "true",
"gpu_memory_utilization": 0.5,
"enforce_eager": "true",
"tensor_parallel_size":2
}
PROMPTS = [
"The most dangerous animal is",
"The capital of France is",
"The future of AI is",
]
async def generate_python_vllm_output(prompt, llm_engine):
request_id = random_uuid()
sampling_params = SamplingParams(**SAMPLING_PARAMETERS)
python_vllm_output = None
last_output = None
async for vllm_output in llm_engine.generate(prompt, sampling_params, request_id):
last_output = vllm_output
if last_output:
python_vllm_output = [
(prompt + output.text).encode("utf-8") for output in last_output.outputs
]
return python_vllm_output
llm_engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**VLLM_ENGINE_CONFIG))
python_vllm_output = []
for i in range(len(PROMPTS * 1000)):
python_vllm_output.extend(
asyncio.run(generate_python_vllm_output(PROMPTS[i], llm_engine))
)
工作流程如下:
# ps
PID TTY TIME CMD
1 pts/0 00:00:00 bash
21346 pts/0 00:00:00 top
21927 pts/0 00:00:00 top
22463 pts/0 00:00:00 ps
# python3 vllm_reproducer.py &
...
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.38it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.37it/s]
INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
(VllmWorkerProcess pid=22534) INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
INFO 07-25 00:18:58 distributed_gpu_executor.py:56] # GPU blocks: 68037, # CPU blocks: 14563
# pkill -9 python3
# ps
PID TTY TIME CMD
1 pts/0 00:00:00 bash
21346 pts/0 00:00:00 top
21927 pts/0 00:00:00 top
22465 pts/0 00:00:22 pt_main_thread
22534 pts/0 00:00:14 pt_main_thread
22576 pts/0 00:00:00 python3 <defunct>
22745 pts/0 00:00:00 ps
同样,上述两个进程根据cat /proc/_PID_/status
处于休眠状态。对于vllm的分布式服务和多进程,如果有任何见解,将不胜感激。
2条答案
按热度按时间oo7oh9g91#
我也观察到了类似的情况...我目前的解决方法是在终止vLLM服务器后执行
pkill -f pt_main_thread
。qfe3c7zg2#
在终止vLLM服务器后,使用以下命令杀死pt_main_thread进程:
然而,这对我来说并不是一个可行的解决方案。