vllm [Bug]:WSL2 nccl问题与2个GPU有关?

h9vpoimq  于 6个月前  发布在  其他
关注(0)|答案(2)|浏览(77)

在WSL2上运行时出现问题,可能是与NCCL相关的?(line misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') : libcuda.so: cannot open shared object file: No such file or directory)

  • 问题出现在--tensor-parallel-size 2上,vllm基本上只对1个GPU有效。

大部分错误:

INFO 04-27 01:36:40 utils.py:608] Found nccl from library /home/ch/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:40 utils.py:608] Found nccl from library /home/ch/.config/vllm/nccl/cu12/libnccl.so.2.18.1
WARNING 04-27 01:36:40 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerWrapper pid=212395) WARNING 04-27 01:36:40 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 04-27 01:36:40 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 04-27 01:36:40 selector.py:33] Using XFormers backend.
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:40 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:40 selector.py:33] Using XFormers backend.
INFO 04-27 01:36:41 pynccl_utils.py:43] vLLM is using nccl==2.18.1
00127-desktop:212030:212030 [0] NCCL INFO Bootstrap : Using eth0:172.18.78.11<0>
00127-desktop:212030:212030 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
00127-desktop:212030:212030 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation

00127-desktop:212030:212030 [0] misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') : libcuda.so: cannot open shared object file: No such file or directory
NCCL version 2.18.1+cuda12.0
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:41 pynccl_utils.py:43] vLLM is using nccl==2.18.1
00127-desktop:212030:212030 [0] NCCL INFO NET/IB : No device found.
00127-desktop:212030:212030 [0] NCCL INFO NET/Socket : Using [0]eth0:172.18.78.11<0>
00127-desktop:212030:212030 [0] NCCL INFO Using network Socket
00127-desktop:212030:212030 [0] NCCL INFO Channel 00/02 :    0   1
00127-desktop:212030:212030 [0] NCCL INFO Channel 01/02 :    0   1
00127-desktop:212030:212030 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
00127-desktop:212030:212030 [0] NCCL INFO P2P Chunksize set to 131072
00127-desktop:212030:212030 [0] NCCL INFO Channel 00 : 0[1000] -> 1[5000] via SHM/direct/direct
00127-desktop:212030:212030 [0] NCCL INFO Channel 01 : 0[1000] -> 1[5000] via SHM/direct/direct

00127-desktop:212030:212030 [0] transport.cc:154 NCCL WARN Cuda failure 'invalid argument'
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1032 -> 1
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1309 -> 1
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1549 -> 1
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1587 -> 1
ERROR 04-27 01:36:41 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 04-27 01:36:41 worker_base.py:157] Traceback (most recent call last):
ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
ERROR 04-27 01:36:41 worker_base.py:157]     return executor(*args, **kwargs)
ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 110, in init_device
ERROR 04-27 01:36:41 worker_base.py:157]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in init_worker_distributed_environment
ERROR 04-27 01:36:41 worker_base.py:157]     pynccl_utils.init_process_group()
ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
ERROR 04-27 01:36:41 worker_base.py:157]     comm = NCCLCommunicator(group=group)
ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
ERROR 04-27 01:36:41 worker_base.py:157]     NCCL_CHECK(
ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
ERROR 04-27 01:36:41 worker_base.py:157]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 04-27 01:36:41 worker_base.py:157] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] Traceback (most recent call last):
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 110, in init_device
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in init_worker_distributed_environment
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]     pynccl_utils.init_process_group()
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]     comm = NCCLCommunicator(group=group)
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]     NCCL_CHECK(
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]   File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 361, in from_engine_args
    engine = cls(
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 437, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
    self._init_workers_ray(placement_group)
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
    self._run_workers("init_device")
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 158, in execute_method
    raise e
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 110, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in init_worker_distributed_environment
    pynccl_utils.init_process_group()
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
    comm = NCCLCommunicator(group=group)
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
    NCCL_CHECK(
  File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
*** SIGSEGV received at time=1714207001 on cpu 0 ***
PC: @     0x7f633b07e905  (unknown)  ncclProxyService()
    @     0x7f6520405520  (unknown)  (unknown)
[2024-04-27 01:36:41,979 E 212030 212471] logging.cc:361: *** SIGSEGV received at time=1714207001 on cpu 0 ***
[2024-04-27 01:36:41,979 E 212030 212471] logging.cc:361: PC: @     0x7f633b07e905  (unknown)  ncclProxyService()
[2024-04-27 01:36:41,979 E 212030 212471] logging.cc:361:     @     0x7f6520405520  (unknown)  (unknown)
Fatal Python error: Segmentation fault```
wgeznvg7

wgeznvg71#

在WSL2上运行时出现了一个问题,可能是与NCCL相关的?(misc/cudawrap.cc:179 NCCL警告无法找到CUDA库libcuda.so(NCCL_CUDA_PATH=''):libcuda.so:无法打开共享对象文件:没有这样的文件或目录)
看起来NCCL找不到libcuda.so。请尝试按照指南操作,并手动使用NCCL_CUDA_PATH指向路径?

nuypyhwy

nuypyhwy2#

在WSL2上运行时出现了一个问题,可能是与NCCL相关的?(misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH=''):libcuda.so:无法打开共享对象文件:没有这样的文件或目录)
看起来NCCL找不到libcuda.so。尝试按照指南,手动使用NCCL_CUDA_PATH指向路径?
有道理。我愿意尝试这个并采取一些步骤来尝试修复。
但我认为我在WSL2上运行的是一个相当直接的原生态安装。
如果这不是用户错误,我想知道WSL2是否不鼓励使用或者实际上不受支持(这将有助于了解需要投入多少精力)。

相关问题