请提出你的问题 Please ask your question
环境 ubuntu16.04 cuda=10.2 paddlepaddle-gpu=2.4.2 paddlenlp=2.5.2 python=3.7
运行命令 python -m paddle.distributed.launch --gpus 1,3 train.py
报错:
LAUNCH INFO 2023-04-07 11:21:26,453 ----------- Configuration ----------------------
LAUNCH INFO 2023-04-07 11:21:26,454 devices: 1,3
LAUNCH INFO 2023-04-07 11:21:26,454 elastic_level: -1
LAUNCH INFO 2023-04-07 11:21:26,454 elastic_timeout: 30
LAUNCH INFO 2023-04-07 11:21:26,454 gloo_port: 6767
LAUNCH INFO 2023-04-07 11:21:26,454 host: None
LAUNCH INFO 2023-04-07 11:21:26,454 ips: None
LAUNCH INFO 2023-04-07 11:21:26,454 job_id: default
LAUNCH INFO 2023-04-07 11:21:26,454 legacy: False
LAUNCH INFO 2023-04-07 11:21:26,454 log_dir: log
LAUNCH INFO 2023-04-07 11:21:26,454 log_level: INFO
LAUNCH INFO 2023-04-07 11:21:26,454 master: None
LAUNCH INFO 2023-04-07 11:21:26,454 max_restart: 3
LAUNCH INFO 2023-04-07 11:21:26,454 nnodes: 1
LAUNCH INFO 2023-04-07 11:21:26,454 nproc_per_node: None
LAUNCH INFO 2023-04-07 11:21:26,454 rank: -1
LAUNCH INFO 2023-04-07 11:21:26,454 run_mode: collective
LAUNCH INFO 2023-04-07 11:21:26,454 server_num: None
LAUNCH INFO 2023-04-07 11:21:26,454 servers:
LAUNCH INFO 2023-04-07 11:21:26,454 start_port: 6070
LAUNCH INFO 2023-04-07 11:21:26,454 trainer_num: None
LAUNCH INFO 2023-04-07 11:21:26,454 trainers:
LAUNCH INFO 2023-04-07 11:21:26,454 training_script: train.py
LAUNCH INFO 2023-04-07 11:21:26,454 training_script_args: []
LAUNCH INFO 2023-04-07 11:21:26,454 with_gloo: 1
LAUNCH INFO 2023-04-07 11:21:26,454 --------------------------------------------------
LAUNCH INFO 2023-04-07 11:21:26,455 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-04-07 11:21:26,458 Run Pod: kimyen, replicas 2, status ready
LAUNCH INFO 2023-04-07 11:21:26,470 Watching Pod: kimyen, replicas 2, status running
/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
I0407 11:21:29.226342 20581 tcp_utils.cc:181] The server starts to listen on IP_ANY:56184
I0407 11:21:29.226483 20581 tcp_utils.cc:130] Successfully connected to 192.168.0.113:56184
LAUNCH INFO 2023-04-07 11:21:31,476 Pod failed
LAUNCH ERROR 2023-04-07 11:21:31,476 Container failed !!!
Container rank 0 status failed cmd ['/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/bin/python', '-u', 'train.py'] code -11 log log/workerlog.0
env {'LC_PAPER': 'zh_CN.UTF-8', 'LC_ADDRESS': 'zh_CN.UTF-8', 'XDG_SESSION_ID': '603', 'LC_MONETARY': 'zh_CN.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'SSH_CLIENT': '114.213.210.253 61148 22', 'CONDA_SHLVL': '2', 'CONDA_PROMPT_MODIFIER': '(ViSTA) ', 'LC_NUMERIC': 'zh_CN.UTF-8', 'OLDPWD': '/mnt/hd/1/buchaofei/ViSTA/log', 'GTK_MODULES': 'gail:atk-bridge', 'SSH_TTY': '/dev/pts/8', 'CUDA_HOME': ':/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda:/mnt/hd/1/buchaofei/cuda/bin:/mnt/hd/1/buchaofei/cuda', 'USER': 'buchaofei', 'LD_LIBRARY_PATH': ':/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64:/mnt/hd/1/buchaofei/cuda/lib64', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'CONDA_EXE': '/mnt/hd/1/buchaofei/anaconda3/bin/conda', '_CE_CONDA': '', 'CONDA_PREFIX_1': '/mnt/hd/1/buchaofei/anaconda3', 'MAIL': '/var/mail/buchaofei', 'PATH': '/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/bin:/mnt/hd/1/buchaofei/anaconda3/condabin:/home/buchaofei/bin:/home/buchaofei/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/mnt/hd/1/buchaofei/cuda/bin', 'QT_QPA_PLATFORMTHEME': 'appmenu-qt5', 'CONDA_PREFIX': '/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', 'PWD': '/mnt/hd/1/buchaofei/ViSTA', 'LANG': 'en_US.UTF-8', 'LC_MEASUREMENT': 'zh_CN.UTF-8', '_CE_M': '', 'SHLVL': '1', 'HOME': '/home/buchaofei', 'CONDA_PYTHON_EXE': '/mnt/hd/1/buchaofei/anaconda3/bin/python', 'LOGNAME': 'buchaofei', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '114.213.210.253 61148 192.168.0.113 22', 'CONDA_DEFAULT_ENV': 'ViSTA', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'XDG_RUNTIME_DIR': '/run/user/1212', 'DISPLAY': 'localhost:11.0', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'LC_TIME': 'zh_CN.UTF-8', 'LC_NAME': 'zh_CN.UTF-8', '_': '/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'kimyen', 'PADDLE_MASTER': '192.168.0.113:56184', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '192.168.0.113:56185,192.168.0.113:56186', 'PADDLE_CURRENT_ENDPOINT': '192.168.0.113:56185', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '1'}
LAUNCH INFO 2023-04-07 11:21:31,476 ------------------------- ERROR LOG DETAIL -------------------------
/mnt/hd/1/buchaofei/anaconda3/envs/ViSTA/lib/python3.7/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
I0407 11:21:29.226342 20581 tcp_utils.cc:181] The server starts to listen on IP_ANY:56184
I0407 11:21:29.226483 20581 tcp_utils.cc:130] Successfully connected to 192.168.0.113:56184
LAUNCH INFO 2023-04-07 11:21:31,677 Exit code -11
5条答案
按热度按时间atmip9wb1#
看下
log/workerlog.0
或者log/workerlog.1
里面有详细报错吗?k7fdbhmy2#
看下
log/workerlog.0
或者log/workerlog.1
里面有详细报错吗?workerlog.1里面是这样的
2g32fytz3#
建议换一下端口试试
gv8xihay4#
您好,请问下该问题解决了么??
2eafrhcq5#
您好,请问下该问题解决了么??
按照paddle官网上的调整分布式代码(如果代码中有错误的话),并且不要忘记安装nccl,最后执行paddle的分布式测试代码,看是否能够在多GPU上跑