bug描述 Describe the Bug
从NGC官网下载22.05的paddlepaddle版本: docker run --gpus all -it --rm nvcr.io/nvidia/paddlepaddle:22.05-py3
运行paddle的示例: https://github.com/PaddlePaddle/models/blob/release/1.8/dygraph/mnist/train.py
执行过程中报错如下:执行命令:python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog mnist_distribution_v1.py
WARNING 2022-06-15 14:58:25,811 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2022-06-15 14:58:25,812 launch_utils.py:525] Local start 4 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:59873 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... 0.1:55859,127.0.0.1:49121,127.0.0.1:38457|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,2,3 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+
INFO 2022-06-15 14:58:25,812 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./mylog/endpoints.log, and detail running logs maybe found in ./mylog/workerlog.0
launch proc_id:5212 idx:0
launch proc_id:5231 idx:1
launch proc_id:5250 idx:2
launch proc_id:5270 idx:3
I0615 14:58:27.576098 5212 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
W0615 14:58:29.451130 5212 device_context.cc:451] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.7, Runtime API Version: 11.7
W0615 14:58:29.460047 5212 device_context.cc:469] device: 0, cuDNN Version: 8.4.
loading mnist dataset from ./work/mnist.json.gz ...
Traceback (most recent call last):
File "mnist_distribution_v1.py", line 107, in <module>
train_multi_gpu()
File "mnist_distribution_v1.py", line 84, in train_multi_gpu
train_loader = fluid.contrib.reader.distributed_batch_reader(train_loader)
AttributeError: module 'paddle.fluid.contrib' has no attribute 'reader'
INFO 2022-06-15 14:58:46,935 launch_utils.py:320] terminate process group gid:5231
INFO 2022-06-15 14:58:46,935 launch_utils.py:320] terminate process group gid:5250
INFO 2022-06-15 14:58:46,936 launch_utils.py:320] terminate process group gid:5270
INFO 2022-06-15 14:58:50,940 launch_utils.py:341] terminate all the procs
ERROR 2022-06-15 14:58:50,940 launch_utils.py:602] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-06-15 14:58:54,945 launch_utils.py:341] terminate all the procs
INFO 2022-06-15 14:58:54,945 launch.py:311] Local processes completed.
我推测是不是因为例子是1.8版本,而docker的环境是2.2.2 版本的,所以有API的不同,因而采用paddle v1到v2版本的转换器进行转换,将v1版本转换成v2之后,依然采用相同的命令执行并行计算,此次,报错如下:
INFO 2022-06-15 14:48:53,563 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./mylog/endpoints.log, and detail running logs maybe found in ./mylog/workerlog.0
launch proc_id:4501 idx:0
launch proc_id:4520 idx:1
launch proc_id:4539 idx:2
launch proc_id:4559 idx:3
I0615 14:48:55.291877 4501 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
Traceback (most recent call last):
File "mnist_distribution.py", line 109, in <module>
train_multi_gpu()
File "mnist_distribution.py", line 76, in train_multi_gpu
strategy = paddle.fluid.dygraph.parallel.prepare_context()
File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/parallel.py", line 68, in prepare_context
parallel_helper._init_parallel_ctx()
File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/parallel_helper.py", line 42, in _init_parallel_ctx
__parallel_ctx__clz__.init()
OSError: (External) NCCL error(5), invalid usage. Detail: Resource temporarily unavailable
Please try one of the following solutions:
1. export NCCL_SHM_DISABLE=1;
2. export NCCL_P2P_LEVEL=SYS;
3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting -shm-size=2g.
[Hint: 'ncclInvalidUsage'. The call to NCCL is incorrect. This is usually reflecting a programming error.] (at /opt/paddle/paddle/paddle/fluid/platform/collective_helper.cc:99)
INFO 2022-06-15 14:49:03,684 launch_utils.py:341] terminate all the procs
ERROR 2022-06-15 14:49:03,684 launch_utils.py:602] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2022-06-15 14:49:07,689 launch_utils.py:341] terminate all the procs
INFO 2022-06-15 14:49:07,689 launch.py:311] Local processes completed.
按照上面的提示,我设置了两个环境变量,同时增大了docker的shm-size,依然是相同的报错,此外,我用run_check 检查了机器环境,发现GPU卡间是不能p2p的,但是fluid是通过了多GPU的测试的。
>>> fluid.install_check.run_check()
Running Verify Fluid Program ...
W0615 14:36:55.247263 3184 device_context.cc:451] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.7, Runtime API Version: 11.7
W0615 14:36:55.254676 3184 device_context.cc:469] device: 0, cuDNN Version: 8.4.
Your Paddle Fluid works well on SINGLE GPU or CPU.
W0615 14:36:59.100487 3184 parallel_executor.cc:617] Cannot enable P2P access from 0 to 5
W0615 14:36:59.100513 3184 parallel_executor.cc:617] Cannot enable P2P access from 0 to 6
W0615 14:36:59.100518 3184 parallel_executor.cc:617] Cannot enable P2P access from 0 to 7
W0615 14:36:59.797154 3184 parallel_executor.cc:617] Cannot enable P2P access from 1 to 4
W0615 14:37:00.129410 3184 parallel_executor.cc:617] Cannot enable P2P access from 1 to 6
W0615 14:37:00.129437 3184 parallel_executor.cc:617] Cannot enable P2P access from 1 to 7
W0615 14:37:01.045341 3184 parallel_executor.cc:617] Cannot enable P2P access from 2 to 4
W0615 14:37:01.045370 3184 parallel_executor.cc:617] Cannot enable P2P access from 2 to 5
W0615 14:37:01.379971 3184 parallel_executor.cc:617] Cannot enable P2P access from 2 to 7
W0615 14:37:02.027123 3184 parallel_executor.cc:617] Cannot enable P2P access from 3 to 4
W0615 14:37:02.027153 3184 parallel_executor.cc:617] Cannot enable P2P access from 3 to 5
W0615 14:37:02.027158 3184 parallel_executor.cc:617] Cannot enable P2P access from 3 to 6
W0615 14:37:03.341859 3184 parallel_executor.cc:617] Cannot enable P2P access from 4 to 1
W0615 14:37:03.341889 3184 parallel_executor.cc:617] Cannot enable P2P access from 4 to 2
W0615 14:37:03.341893 3184 parallel_executor.cc:617] Cannot enable P2P access from 4 to 3
W0615 14:37:03.343364 3184 parallel_executor.cc:617] Cannot enable P2P access from 5 to 0
W0615 14:37:04.248374 3184 parallel_executor.cc:617] Cannot enable P2P access from 5 to 2
W0615 14:37:04.248404 3184 parallel_executor.cc:617] Cannot enable P2P access from 5 to 3
W0615 14:37:04.250039 3184 parallel_executor.cc:617] Cannot enable P2P access from 6 to 0
W0615 14:37:04.250051 3184 parallel_executor.cc:617] Cannot enable P2P access from 6 to 1
W0615 14:37:04.873052 3184 parallel_executor.cc:617] Cannot enable P2P access from 6 to 3
W0615 14:37:04.874171 3184 parallel_executor.cc:617] Cannot enable P2P access from 7 to 0
W0615 14:37:04.874182 3184 parallel_executor.cc:617] Cannot enable P2P access from 7 to 1
W0615 14:37:04.874188 3184 parallel_executor.cc:617] Cannot enable P2P access from 7 to 2
W0615 14:37:17.662714 3184 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid now
其他补充信息 Additional Supplementary Information
机器环境为8*V10016G的GPU
3条答案
按热度按时间368yc8dk1#
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档 、 常见问题 、 历史Issue 、 AI社区 来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API , FAQ , Github Issue and AI community to get the answer.Have a nice day!
2w2cym1i2#
hi,请问v1到v2版本的转换器是指什么呢?
bkkx9g8r3#
hi,请问v1到v2版本的转换器是指什么呢?
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/model_convert/migration_cn.html