错误描述
进程组看门狗线程终止时,出现了一个CUDA错误:非法内存访问。这可能是由于在其他API调用中异步报告的CUDA内核错误导致的。为了调试,建议传递CUDA_LAUNCH_BLOCKING=1
参数。编译时需要添加TORCH_USE_CUDA_DSA
以启用设备端Assert。
另外,还有一个警告信息:资源跟踪器显示有1个泄漏的共享内存对象需要在关闭时清理。
代码块
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame [#1](https://github.com/vllm-project/vllm/pull/1) : c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame [#2](https://github.com/vllm-project/vllm/pull/2) : c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame [#3](https://github.com/vllm-project/vllm/pull/3) : c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame [#4](https://github.com/vllm-project/vllm/pull/4) : c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame [#5](https://github.com/vllm-project/vllm/pull/5) : c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame [#6](https://github.com/vllm-project/vllm/pull/6) : c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame [#7](https://github.com/vllm-project/vllm/pull/7) : + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin
1条答案
按热度按时间oymdgrw71#
机器配置为L20,单卡48G,vllm启动脚本指定了--tensor-parallel 2 --quantization awq。