[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型,刚开始服务正常,但是并发高的时候就报错

wwwo4jvm  于 10个月前  发布在  其他
关注(0)|答案(1)|浏览(89)

错误描述

进程组看门狗线程终止时,出现了一个CUDA错误:非法内存访问。这可能是由于在其他API调用中异步报告的CUDA内核错误导致的。为了调试,建议传递CUDA_LAUNCH_BLOCKING=1参数。编译时需要添加TORCH_USE_CUDA_DSA以启用设备端Assert。

另外,还有一个警告信息:资源跟踪器显示有1个泄漏的共享内存对象需要在关闭时清理。

代码块

  1. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
  2. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  3. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  4. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  5. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  6. frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
  7. frame [#1](https://github.com/vllm-project/vllm/pull/1) : c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
  8. frame [#2](https://github.com/vllm-project/vllm/pull/2) : c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
  9. frame [#3](https://github.com/vllm-project/vllm/pull/3) : c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
  10. frame [#4](https://github.com/vllm-project/vllm/pull/4) : c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
  11. frame [#5](https://github.com/vllm-project/vllm/pull/5) : c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
  12. frame [#6](https://github.com/vllm-project/vllm/pull/6) : c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
  13. frame [#7](https://github.com/vllm-project/vllm/pull/7) : + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin
oymdgrw7

oymdgrw71#

机器配置为L20,单卡48G,vllm启动脚本指定了--tensor-parallel 2 --quantization awq。

相关问题