Paddle 分布式训练

xqk2d5yq  于 5个月前  发布在  其他
关注(0)|答案(5)|浏览(74)

请提出你的问题 Please ask your question

环境 ubuntu16.04 cuda=10.2 paddlepaddle-gpu=2.4.2 paddlenlp=2.5.2 python=3.7
运行命令 python -m paddle.distributed.launch --gpus 1,3 train.py
报错:

LAUNCH INFO 2023-04-07 11:21:26,453 -----------  Configuration  ----------------------
LAUNCH INFO 2023-04-07 11:21:26,454 devices: 1,3
LAUNCH INFO 2023-04-07 11:21:26,454 elastic_level: -1
LAUNCH INFO 2023-04-07 11:21:26,454 elastic_timeout: 30
LAUNCH INFO 2023-04-07 11:21:26,454 gloo_port: 6767
LAUNCH INFO 2023-04-07 11:21:26,454 host: None
LAUNCH INFO 2023-04-07 11:21:26,454 ips: None
LAUNCH INFO 2023-04-07 11:21:26,454 job_id: default
LAUNCH INFO 2023-04-07 11:21:26,454 legacy: False
LAUNCH INFO 2023-04-07 11:21:26,454 log_dir: log
LAUNCH INFO 2023-04-07 11:21:26,454 log_level: INFO
LAUNCH INFO 2023-04-07 11:21:26,454 master: None
LAUNCH INFO 2023-04-07 11:21:26,454 max_restart: 3
LAUNCH INFO 2023-04-07 11:21:26,454 nnodes: 1
LAUNCH INFO 2023-04-07 11:21:26,454 nproc_per_node: None
LAUNCH INFO 2023-04-07 11:21:26,454 rank: -1
LAUNCH INFO 2023-04-07 11:21:26,454 run_mode: collective
LAUNCH INFO 2023-04-07 11:21:26,454 server_num: None
LAUNCH INFO 2023-04-07 11:21:26,454 servers: 
LAUNCH INFO 2023-04-07 11:21:26,454 start_port: 6070
LAUNCH INFO 2023-04-07 11:21:26,454 trainer_num: None
LAUNCH INFO 2023-04-07 11:21:26,454 trainers: 
LAUNCH INFO 2023-04-07 11:21:26,454 training_script: train.py
LAUNCH INFO 2023-04-07 11:21:26,454 training_script_args: []
LAUNCH INFO 2023-04-07 11:21:26,454 with_gloo: 1
LAUNCH INFO 2023-04-07 11:21:26,454 --------------------------------------------------
LAUNCH INFO 2023-04-07 11:21:26,455 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-04-07 11:21:26,458 Run Pod: kimyen, replicas 2, status ready
LAUNCH INFO 2023-04-07 11:21:26,470 Watching Pod: kimyen, replicas 2, status running
/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
I0407 11:21:29.226342 20581 tcp_utils.cc:181] The server starts to listen on IP_ANY:56184
I0407 11:21:29.226483 20581 tcp_utils.cc:130] Successfully connected to 192.168.0.113:56184
LAUNCH INFO 2023-04-07 11:21:31,476 Pod failed
LAUNCH ERROR 2023-04-07 11:21:31,476 Container failed !!!
Container rank 0 status failed cmd ['/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/bin/python', '-u', 'train.py'] code -11 log log/workerlog.0 
env {'LC_PAPER': 'zh_CN.UTF-8', 'LC_ADDRESS': 'zh_CN.UTF-8', 'XDG_SESSION_ID': '603', 'LC_MONETARY': 'zh_CN.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'SSH_CLIENT': '114.213.210.253 61148 22', 'CONDA_SHLVL': '2', 'CONDA_PROMPT_MODIFIER': '(ViSTA) ', 'LC_NUMERIC': 'zh_CN.UTF-8', 'OLDPWD': '/mnt/hd/1/buchaofei/ViSTA/log', 'GTK_MODULES': 'gail:atk-bridge', 'SSH_TTY': '/dev/pts/8', 'CUDA_HOME': ':/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda', 'USER': 'buchaofei', 'LD_LIBRARY_PATH': ':/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'CONDA_EXE': '/mnt/hd/1/buchaofei/anaconda3/bin/conda', '_CE_CONDA': '', 'CONDA_PREFIX_1': '/mnt/hd/1/buchaofei/anaconda3', 'MAIL': '/var/mail/buchaofei', 'PATH': '/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/bin:/mnt/hd/1/buchaofei/anaconda3/condabin:/home/buchaofei/bin:/home/buchaofei/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/mnt/hd/1/buchaofei/cuda/bin', 'QT_QPA_PLATFORMTHEME': 'appmenu-qt5', 'CONDA_PREFIX': '/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', 'PWD': '/mnt/hd/1/buchaofei/ViSTA', 'LANG': 'en_US.UTF-8', 'LC_MEASUREMENT': 'zh_CN.UTF-8', '_CE_M': '', 'SHLVL': '1', 'HOME': '/home/buchaofei', 'CONDA_PYTHON_EXE': '/mnt/hd/1/buchaofei/anaconda3/bin/python', 'LOGNAME': 'buchaofei', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '114.213.210.253 61148 192.168.0.113 22', 'CONDA_DEFAULT_ENV': 'ViSTA', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'XDG_RUNTIME_DIR': '/run/user/1212', 'DISPLAY': 'localhost:11.0', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'LC_TIME': 'zh_CN.UTF-8', 'LC_NAME': 'zh_CN.UTF-8', '_': '/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'kimyen', 'PADDLE_MASTER': '192.168.0.113:56184', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '192.168.0.113:56185,192.168.0.113:56186', 'PADDLE_CURRENT_ENDPOINT': '192.168.0.113:56185', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '1'}
LAUNCH INFO 2023-04-07 11:21:31,476 ------------------------- ERROR LOG DETAIL -------------------------
/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
I0407 11:21:29.226342 20581 tcp_utils.cc:181] The server starts to listen on IP_ANY:56184
I0407 11:21:29.226483 20581 tcp_utils.cc:130] Successfully connected to 192.168.0.113:56184
LAUNCH INFO 2023-04-07 11:21:31,677 Exit code -11
atmip9wb

atmip9wb1#

看下 log/workerlog.0 或者 log/workerlog.1 里面有详细报错吗?

k7fdbhmy

k7fdbhmy2#

看下 log/workerlog.0 或者 log/workerlog.1 里面有详细报错吗?

workerlog.1里面是这样的

/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
I0407 11:21:29.237982 20583 tcp_utils.cc:130] Successfully connected to 192.168.0.113:56184
Traceback (most recent call last):
  File "train.py", line 23, in <module>
    fleet.init(is_collective=True)
  File "/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/paddle/distributed/fleet/fleet.py", line 311, in init
    paddle.distributed.init_parallel_env()
  File "/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/paddle/distributed/parallel.py", line 297, in init_parallel_env
    paddle.distributed.barrier(group=group)
  File "/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/paddle/distributed/collective.py", line 280, in barrier
    task = group.process_group.barrier()
ValueError: (InvalidArgument) TCP send error. Details: Broken pipe.
  [Hint: Expected byte_sent > 0, but received byte_sent:-1 <= 0:0.] (at /paddle/paddle/fluid/distributed/store/tcp_utils.h:86)
2g32fytz

2g32fytz3#

建议换一下端口试试

gv8xihay

gv8xihay4#

您好,请问下该问题解决了么??

2eafrhcq

2eafrhcq5#

您好,请问下该问题解决了么??

按照paddle官网上的调整分布式代码(如果代码中有错误的话),并且不要忘记安装nccl,最后执行paddle的分布式测试代码,看是否能够在多GPU上跑

相关问题