- 标题:【论文复现】 单机多卡训练报错
- 版本、环境信息:
1)PaddlePaddle版本:2.1.2
2)CPU:Intel Xeon32
3)GPU:Tesla V1004、CUDA version: 10.1.243, cuDNN version: None.None.None, Nvidia driver version: 418.67
4)系统环境:CentOS 6.10, python 3.7.0
- 训练信息
1)单机,多卡
2)显存信息:32480MiB
3)Operator信息
- 复现信息:
1.单机单卡训练正常
2.单机多卡训练出错:
python -m paddle.distributed.launch main_multi_gpu.py
- 问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段
merging config from configs/swinv2_tiny_patch4_window7_224.yaml
----- Imagenet2012 image train list len = 40000
----- Imagenet2012 image val list len = 10000
1203 10:24:33 AM
AMP: False
AUG:
AUTO_AUGMENT: rand-m9-mstd0.5-inc1
COLOR_JITTER: 0.4
CUTMIX: 1.0
CUTMIX_MINMAX: None
MIXUP: 0.8
MIXUP_MODE: batch
MIXUP_PROB: 1.0
MIXUP_SWITCH_PROB: 0.5
RE_COUNT: 1
RE_MODE: pixel
RE_PROB: 0.25
BASE: ['']
DATA:
BATCH_SIZE: 64
BATCH_SIZE_EVAL: 8
CROP_PCT: 0.9
DATASET: imagenet2012
DATA_PATH: ILSVRC2012mini
IMAGE_SIZE: 224
NUM_WORKERS: 8
EVAL: False
LOCAL_RANK: 0
MODEL:
ATTENTION_DROPOUT: 0.0
DROPOUT: 0.0
DROP_PATH: 0.2
NAME: swin_tiny_patch4_window7_224
NUM_CLASSES: 1000
PRETRAINED: None
RESUME: None
TRANS:
APE: False
EMBED_DIM: 96
EXTRA_NORM: False
IN_CHANNELS: 3
MLP_RATIO: 4.0
NUM_HEADS: [3, 6, 12, 24]
PATCH_NORM: True
PATCH_SIZE: 4
QKV_BIAS: True
QK_SCALE: None
STAGE_DEPTHS: [2, 2, 6, 2]
WINDOW_SIZE: 7
TYPE: swin
NGPUS: 1
REPORT_FREQ: 50
SAVE: /root/paddlejob/workspace/output//train-20211203-10-24-26
SAVE_FREQ: 5
SEED: 42
TAG: default
TRAIN:
ACCUM_ITER: 1
AUTO_AUGMENT: True
BASE_LR: 0.0005
COLOR_JITTER: 0.4
CUTMIX_ALPHA: 1.0
CUTMIX_MINMAX: None
END_LR: 5e-06
GRAD_CLIP: 5.0
LAST_EPOCH: 0
LR_SCHEDULER:
DECAY_EPOCHS: 30
DECAY_RATE: 0.1
MILESTONES: 30, 60, 90
NAME: warmupcosine
MIXUP_ALPHA: 0.8
MIXUP_MODE: batch
MIXUP_PROB: 1.0
MIXUP_SWITCH_PROB: 0.5
NUM_EPOCHS: 300
OPTIMIZER:
BETAS: (0.9, 0.999)
EPS: 1e-08
MOMENTUM: 0.9
NAME: AdamW
RANDOM_ERASE_COUNT: 1
RANDOM_ERASE_MODE: pixel
RANDOM_ERASE_PROB: 0.25
RANDOM_ERASE_SPLIT: False
SMOOTHING: 0.1
WARMUP_EPOCHS: 20
WARMUP_START_LR: 5e-07
WEIGHT_DECAY: 0.05
VALIDATE_FREQ: 10
1203 10:24:33 AM ----- world_size = 1, local_rank = 0
1203 10:24:33 AM ----- world_size = 1, local_rank = 0
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/parallel.py:120: UserWarning: Currently not a parallel execution environment, `paddle.distributed.init_parallel_env` will not do anything.
"Currently not a parallel execution environment, `paddle.distributed.init_parallel_env` will not do anything."
Traceback (most recent call last):
File "main_multi_gpu.py", line 574, in <module>
main()
File "main_multi_gpu.py", line 570, in main
dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 501, in spawn
while not context.join():
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 312, in join
self._throw_exception(error_index)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
raise Exception(msg)
Exception:
----------------------------------------------
Process 0 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
result = func(*args)
File "/root/paddlejob/workspace/code/main_multi_gpu.py", line 318, in main_worker
model = build_model(config)
File "/mnt/code_20211203102210/swin_transformer.py", line 772, in build_swin
extra_norm=config.MODEL.TRANS.EXTRA_NORM)
File "/mnt/code_20211203102210/swin_transformer.py", line 674, in __init__
embed_dim=embed_dim)
File "/mnt/code_20211203102210/swin_transformer.py", line 65, in __init__
stride=patch_size)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 646, in __init__
data_format=data_format)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 135, in __init__
default_initializer=_get_default_param_initializer())
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 412, in create_parameter
default_initializer)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/layer_helper_base.py", line 374, in create_parameter
**attr._to_kwargs(with_initializer=True))
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2895, in create_parameter
initializer(param, self)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 366, in __call__
stop_gradient=True)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2925, in append_op
kwargs.get("stop_gradient", False))
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/tracer.py", line 45, in trace_op
not stop_gradient)
NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:88)
[operator < gaussian_random > error]
2条答案
按热度按时间q5lcpyga1#
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day!
pxq42qpu2#
distributed.launch 启动的话,程序里面不用写spawn了,可以参考这里的说明:https://github.com/PaddlePaddle/models/blob/tipc/docs/lwfx/ArticleReproduction_CV.md#3.12