[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型,刚开始服务正常,但是并发高的时候就报错

wwwo4jvm  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(43)

错误描述

进程组看门狗线程终止时,出现了一个CUDA错误:非法内存访问。这可能是由于在其他API调用中异步报告的CUDA内核错误导致的。为了调试,建议传递CUDA_LAUNCH_BLOCKING=1参数。编译时需要添加TORCH_USE_CUDA_DSA以启用设备端Assert。

另外,还有一个警告信息:资源跟踪器显示有1个泄漏的共享内存对象需要在关闭时清理。

代码块

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
 For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
 Compile with  `TORCH_USE_CUDA_DSA`  to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
 frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
 frame  [#1](https://github.com/vllm-project/vllm/pull/1) : c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
 frame  [#2](https://github.com/vllm-project/vllm/pull/2) : c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
 frame  [#3](https://github.com/vllm-project/vllm/pull/3) : c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#4](https://github.com/vllm-project/vllm/pull/4) : c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#5](https://github.com/vllm-project/vllm/pull/5) : c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#6](https://github.com/vllm-project/vllm/pull/6) : c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
 frame  [#7](https://github.com/vllm-project/vllm/pull/7) : + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin
oymdgrw7

oymdgrw71#

机器配置为L20,单卡48G,vllm启动脚本指定了--tensor-parallel 2 --quantization awq。

相关问题