请提出你的问题 Please ask your question
报错如下
[2024-07-12 08:34:51,881] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 8 GPUs. This may be caused by:
- There is not enough GPUs visible on your system
- Some GPUs are occupied by other process now
- NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2024-07-12 08:34:51,881] [ WARNING] install_check.py:297 -
Original Error is: Process 6 terminated with exit code 1.
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "/home/test.py", line 2, in
paddle.utils.run_check()
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/usr/local/lib/python3.10/dist-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 614, in spawn
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 423, in join
self._throw_exception(error_index)
File "/usr/local/lib/python3.10/dist-packages/paddle/distributed/spawn.py", line 435, in _throw_exception
raise Exception(
Exception: Process 6 terminated with exit code 1.
本地环境:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800-SXM... On | 00000000:3D:00.0 Off | 0 |
| N/A 35C P0 68W / 400W | 6553MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM... On | 00000000:42:00.0 Off | 0 |
| N/A 30C P0 62W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800-SXM... On | 00000000:61:00.0 Off | 0 |
| N/A 30C P0 60W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800-SXM... On | 00000000:67:00.0 Off | 0 |
| N/A 35C P0 61W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A800-SXM... On | 00000000:AD:00.0 Off | 0 |
| N/A 34C P0 60W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A800-SXM... On | 00000000:B1:00.0 Off | 0 |
| N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 |
| N/A 30C P0 61W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 |
| N/A 34C P0 65W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Repository: registry.baidubce.com/paddlepaddle/paddle
paddlepaddle/paddle
两个都试过,报错都是一样的。
8条答案
按热度按时间qc6wkl3g1#
一开始试的是 registry.baidubce.com/paddlepaddle/paddle:3.0.0b0-gpu-cuda11.8-cudnn8.6-trt8.5
也是8卡跑不通,就换成两个cuda=12.0的版本还是跑不通
gxwragnw2#
你用的那个whl包
xxhby3vn3#
python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个
jexiocij4#
python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ 试试这个
我是直接用docker的,因为不想再配一遍本地环境
a0zr77ik5#
使用的docker是官方的3.0和两个2.6的版本,一开始用的是3.0的跑不了怀疑是cuda版本不匹配就采用两个2.6的还是无法通过run check
beq87vna6#
现在想跑多卡sft但是在
I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265
尝试启动分布式之后就无响应
7bsow1i67#
你的docker里可以把这个给卸载了,然后装我发给你的这个
0qx6xfy68#
现在想跑多卡sft但是在 I0715 07:28:13.692878 490 tcp_utils.cc:181] The server starts to listen on IP_ANY:58265 尝试启动分布式之后就无响应
这个估计得让分布式方向的RD看一下了