Docker paddle 无法通过run_check()测试

ia2d9nvy  于 5个月前  发布在  Docker
关注(0)|答案(8)|浏览(94)

请提出你的问题 Please ask your question

报错如下
[2024-07-12 08:34:51,881] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 8 GPUs. This may be caused by:

  1. There is not enough GPUs visible on your system
  2. Some GPUs are occupied by other process now
  3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
    to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
    [2024-07-12 08:34:51,881] [ WARNING] install_check.py:297 -
    Original Error is: Process 6 terminated with exit code 1.
    PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
    Traceback (most recent call last):
    File "/home/test.py", line 2, in
    paddle.utils.run_check()
    File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
    raise e
    File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
    _run_parallel(device_list)
    File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
    paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
    File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 614, in spawn
    while not context.join():
    File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 423, in join
    self._throw_exception(error_index)
    File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 435, in _throw_exception
    raise Exception(
    Exception: Process 6 terminated with exit code 1.

本地环境:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800-SXM... On | 00000000:3D:00.0 Off | 0 |
| N/A 35C P0 68W / 400W | 6553MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM... On | 00000000:42:00.0 Off | 0 |
| N/A 30C P0 62W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800-SXM... On | 00000000:61:00.0 Off | 0 |
| N/A 30C P0 60W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800-SXM... On | 00000000:67:00.0 Off | 0 |
| N/A 35C P0 61W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A800-SXM... On | 00000000:AD:00.0 Off | 0 |
| N/A 34C P0 60W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A800-SXM... On | 00000000:B1:00.0 Off | 0 |
| N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 |
| N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 |
| N/A 34C P0 65W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

Repository: registry.baidubce.com/paddlepaddle/paddle
paddlepaddle/paddle
两个都试过,报错都是一样的。

qc6wkl3g

qc6wkl3g1#

一开始试的是 registry.baidubce.com/paddlepaddle/paddle:3.0.0b0-gpu-cuda11.8-cudnn8.6-trt8.5
也是8卡跑不通,就换成两个cuda=12.0的版本还是跑不通

gxwragnw

gxwragnw2#

你用的那个whl包

xxhby3vn

xxhby3vn3#

python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个

jexiocij

jexiocij4#

python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个

我是直接用docker的,因为不想再配一遍本地环境

a0zr77ik

a0zr77ik5#

使用的docker是官方的3.0和两个2.6的版本,一开始用的是3.0的跑不了怀疑是cuda版本不匹配就采用两个2.6的还是无法通过run check

beq87vna

beq87vna6#

现在想跑多卡sft但是在
I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265
尝试启动分布式之后就无响应

7bsow1i6

7bsow1i67#

你的docker里可以把这个给卸载了,然后装我发给你的这个

0qx6xfy6

0qx6xfy68#

现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应

这个估计得让分布式方向的RD看一下了

相关问题