[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错

wwwo4jvm 于 2个月前发布在其他

关注(0)|答案(1)|浏览(44)

错误描述

进程组看门狗线程终止时，出现了一个CUDA错误：非法内存访问。这可能是由于在其他API调用中异步报告的CUDA内核错误导致的。为了调试，建议传递CUDA_LAUNCH_BLOCKING=1参数。编译时需要添加TORCH_USE_CUDA_DSA以启用设备端Assert。

另外，还有一个警告信息：资源跟踪器显示有1个泄漏的共享内存对象需要在关闭时清理。

代码块

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
 For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
 Compile with  `TORCH_USE_CUDA_DSA`  to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
 frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
 frame  [#1](https://github.com/vllm-project/vllm/pull/1) : c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
 frame  [#2](https://github.com/vllm-project/vllm/pull/2) : c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
 frame  [#3](https://github.com/vllm-project/vllm/pull/3) : c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#4](https://github.com/vllm-project/vllm/pull/4) : c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#5](https://github.com/vllm-project/vllm/pull/5) : c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#6](https://github.com/vllm-project/vllm/pull/6) : c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#7](https://github.com/vllm-project/vllm/pull/7) : + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin

vllm

来源：https://github.com/vllm-project/vllm/issues/6734