Paddle 分布式训练,在脚本里设置了IP和端口信息,但是日志里显示得IP是127.0.1.1

eoxn13cs  于 2022-10-20  发布在  其他
关注(0)|答案(7)|浏览(201)

请提出你的问题 Please ask your question

采用两台机子本地部署分布式训练,两台机子使用网线连接起来的,是通的。运行的示例是wide_and_deep_dataset里的文件。
在train.py脚本里,进行下图所示的IP和端口设置:

日志里显示的如下图所示,endpoint:127.0.1.1:

有两个问题:
1、我的IP和端口设置的是否正确
2、日志里显示的仍是127.0.1.1需要如何解决

ss2ws0br

ss2ws0br1#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档常见问题历史IssueAI社区 来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

t1qtbnec

t1qtbnec2#

IP端口设置,可以参考下, https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html
另外,新版本不用设端口号,自动检测的

zyfwsgd6

zyfwsgd63#

参看后更改了启动命令,出下面的问题了,这个是哪里出错了?

$ python -m paddle.distributed.launch --servers="169.254.60.61:36011" --workers="169.254.94.75:36011" train.py --lr=0.01
LAUNCH INFO 2022-10-19 15:12:10,743 ----------- Configuration ----------------------
LAUNCH INFO 2022-10-19 15:12:10,744 devices: None
LAUNCH INFO 2022-10-19 15:12:10,744 elastic_level: -1
LAUNCH INFO 2022-10-19 15:12:10,744 elastic_timeout: 30
LAUNCH INFO 2022-10-19 15:12:10,744 gloo_port: 6767
LAUNCH INFO 2022-10-19 15:12:10,744 host: None
LAUNCH INFO 2022-10-19 15:12:10,744 job_id: default
LAUNCH INFO 2022-10-19 15:12:10,744 legacy: False
LAUNCH INFO 2022-10-19 15:12:10,744 log_dir: log
LAUNCH INFO 2022-10-19 15:12:10,744 log_level: INFO
LAUNCH INFO 2022-10-19 15:12:10,744 master: None
LAUNCH INFO 2022-10-19 15:12:10,744 max_restart: 3
LAUNCH INFO 2022-10-19 15:12:10,744 nnodes: 1
LAUNCH INFO 2022-10-19 15:12:10,744 nproc_per_node: None
LAUNCH INFO 2022-10-19 15:12:10,744 rank: -1
LAUNCH INFO 2022-10-19 15:12:10,744 run_mode: collective
LAUNCH INFO 2022-10-19 15:12:10,744 server_num: None
LAUNCH INFO 2022-10-19 15:12:10,744 servers: 169.254.60.61:36011
LAUNCH INFO 2022-10-19 15:12:10,744 trainer_num: None
LAUNCH INFO 2022-10-19 15:12:10,744 trainers:
LAUNCH INFO 2022-10-19 15:12:10,744 training_script: train.py
LAUNCH INFO 2022-10-19 15:12:10,744 training_script_args: ['--lr=0.01']
LAUNCH INFO 2022-10-19 15:12:10,744 with_gloo: 0
LAUNCH INFO 2022-10-19 15:12:10,744 --------------------------------------------------
LAUNCH WARNING 2022-10-19 15:12:10,744 Compatible mode enable with args ['--workers=169.254.94.75:36011']
----------- Configuration Arguments -----------
backend: auto
cluster_topo_path: None
elastic_pre_hook: None
elastic_server: None
enable_auto_mapping: False
force: False
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: log
np: None
nproc_per_node: None
rank_mapping_path: None
run_mode: None
scale: 0
server_num: None
servers: 169.254.60.61:36011
training_script: train.py
training_script_args: ['--lr=0.01']
worker_num: None
workers: 169.254.94.75:36011

INFO 2022-10-19 15:12:10,745 launch.py:494] Run parameter-sever mode. pserver arguments:['--servers', '--workers'], accelerators count:0
INFO 2022-10-19 15:12:10,745 launch.py:494] Run parameter-sever mode. pserver arguments:['--servers', '--workers'], accelerators count:0
WARNING 2022-10-19 15:12:10,745 launch.py:714] launch start with CPUONLY mode
WARNING 2022-10-19 15:12:10,745 launch.py:714] launch start with CPUONLY mode
/usr/lib/python3/dist-packages/apport/report.py:13: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import fnmatch, glob, traceback, errno, sys, atexit, locale, imp, stat
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/launch/main.py", line 17, in
launch()
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/launch/main.py", line 241, in launch
launch.launch()
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch.py", line 726, in launch
launch_ps(args, distribute_mode)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch.py", line 430, in launch_ps
ps_launcher = ParameterServerLauncher(args, distribute_mode)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch_utils.py", line 1184, ininit
self.get_role_endpoints(args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch_utils.py", line 1421, in get_role_endpoints
assert self.current_node_ip in self.node_ips, "Can't find your local ip {%s} in args.servers and args.workers ips: {%s}"
AssertionError: Can't find your local ip {127.0.1.1} in args.servers and args.workers ips: {['169.254.60.61', '169.254.94.75']}

3b6akqbq

3b6akqbq4#

目前的情况是,本地部署两台机子,一台当server,一台当worker,进行PS分布式训练。

lp0sw83n

lp0sw83n5#

可以参考下这个 ISSUE

vc6uscn9

vc6uscn96#

你好,参考了试了一下。

按照图中1方案来运行,结果还是报一样的错。
按照2来,export POD_IP=169.254.60.61 (这个是我server的ip),server端能正常运行,一直卡在下图步骤,ctrl+c结束。

在worker端类似的进行,export POD_IP=169.254.94.75,见下图

worker端出现问题了。请问该如何解决。

vlf7wbxs

vlf7wbxs7#

补充一张图,是server端卡住按ctrl+c结束后的反馈内容

有一个疑问,这个是server处于在等待worker的状态吗?

相关问题