Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API , FAQ , Github Issue and AI community to get the answer.Have a nice day!
$ python -m paddle.distributed.launch --servers="169.254.60.61:36011" --workers="169.254.94.75:36011" train.py --lr=0.01 LAUNCH INFO 2022-10-19 15:12:10,743 ----------- Configuration ---------------------- LAUNCH INFO 2022-10-19 15:12:10,744 devices: None LAUNCH INFO 2022-10-19 15:12:10,744 elastic_level: -1 LAUNCH INFO 2022-10-19 15:12:10,744 elastic_timeout: 30 LAUNCH INFO 2022-10-19 15:12:10,744 gloo_port: 6767 LAUNCH INFO 2022-10-19 15:12:10,744 host: None LAUNCH INFO 2022-10-19 15:12:10,744 job_id: default LAUNCH INFO 2022-10-19 15:12:10,744 legacy: False LAUNCH INFO 2022-10-19 15:12:10,744 log_dir: log LAUNCH INFO 2022-10-19 15:12:10,744 log_level: INFO LAUNCH INFO 2022-10-19 15:12:10,744 master: None LAUNCH INFO 2022-10-19 15:12:10,744 max_restart: 3 LAUNCH INFO 2022-10-19 15:12:10,744 nnodes: 1 LAUNCH INFO 2022-10-19 15:12:10,744 nproc_per_node: None LAUNCH INFO 2022-10-19 15:12:10,744 rank: -1 LAUNCH INFO 2022-10-19 15:12:10,744 run_mode: collective LAUNCH INFO 2022-10-19 15:12:10,744 server_num: None LAUNCH INFO 2022-10-19 15:12:10,744 servers: 169.254.60.61:36011 LAUNCH INFO 2022-10-19 15:12:10,744 trainer_num: None LAUNCH INFO 2022-10-19 15:12:10,744 trainers: LAUNCH INFO 2022-10-19 15:12:10,744 training_script: train.py LAUNCH INFO 2022-10-19 15:12:10,744 training_script_args: ['--lr=0.01'] LAUNCH INFO 2022-10-19 15:12:10,744 with_gloo: 0 LAUNCH INFO 2022-10-19 15:12:10,744 -------------------------------------------------- LAUNCH WARNING 2022-10-19 15:12:10,744 Compatible mode enable with args ['--workers=169.254.94.75:36011'] ----------- Configuration Arguments ----------- backend: auto cluster_topo_path: None elastic_pre_hook: None elastic_server: None enable_auto_mapping: False force: False heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 127.0.0.1 job_id: None log_dir: log np: None nproc_per_node: None rank_mapping_path: None run_mode: None scale: 0 server_num: None servers: 169.254.60.61:36011 training_script: train.py training_script_args: ['--lr=0.01'] worker_num: None workers: 169.254.94.75:36011
INFO 2022-10-19 15:12:10,745 launch.py:494] Run parameter-sever mode. pserver arguments:['--servers', '--workers'], accelerators count:0 INFO 2022-10-19 15:12:10,745 launch.py:494] Run parameter-sever mode. pserver arguments:['--servers', '--workers'], accelerators count:0 WARNING 2022-10-19 15:12:10,745 launch.py:714] launch start with CPUONLY mode WARNING 2022-10-19 15:12:10,745 launch.py:714] launch start with CPUONLY mode /usr/lib/python3/dist-packages/apport/report.py:13: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import fnmatch, glob, traceback, errno, sys, atexit, locale, imp, stat Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/launch/main.py", line 17, in launch() File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/launch/main.py", line 241, in launch launch.launch() File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch.py", line 726, in launch launch_ps(args, distribute_mode) File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch.py", line 430, in launch_ps ps_launcher = ParameterServerLauncher(args, distribute_mode) File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch_utils.py", line 1184, ininit self.get_role_endpoints(args) File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch_utils.py", line 1421, in get_role_endpoints assert self.current_node_ip in self.node_ips, "Can't find your local ip {%s} in args.servers and args.workers ips: {%s}" AssertionError: Can't find your local ip {127.0.1.1} in args.servers and args.workers ips: {['169.254.60.61', '169.254.94.75']}
7条答案
按热度按时间ss2ws0br1#
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档 、 常见问题 、 历史Issue 、 AI社区 来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API , FAQ , Github Issue and AI community to get the answer.Have a nice day!
t1qtbnec2#
IP端口设置,可以参考下, https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html
另外,新版本不用设端口号,自动检测的
zyfwsgd63#
参看后更改了启动命令,出下面的问题了,这个是哪里出错了?
$ python -m paddle.distributed.launch --servers="169.254.60.61:36011" --workers="169.254.94.75:36011" train.py --lr=0.01
LAUNCH INFO 2022-10-19 15:12:10,743 ----------- Configuration ----------------------
LAUNCH INFO 2022-10-19 15:12:10,744 devices: None
LAUNCH INFO 2022-10-19 15:12:10,744 elastic_level: -1
LAUNCH INFO 2022-10-19 15:12:10,744 elastic_timeout: 30
LAUNCH INFO 2022-10-19 15:12:10,744 gloo_port: 6767
LAUNCH INFO 2022-10-19 15:12:10,744 host: None
LAUNCH INFO 2022-10-19 15:12:10,744 job_id: default
LAUNCH INFO 2022-10-19 15:12:10,744 legacy: False
LAUNCH INFO 2022-10-19 15:12:10,744 log_dir: log
LAUNCH INFO 2022-10-19 15:12:10,744 log_level: INFO
LAUNCH INFO 2022-10-19 15:12:10,744 master: None
LAUNCH INFO 2022-10-19 15:12:10,744 max_restart: 3
LAUNCH INFO 2022-10-19 15:12:10,744 nnodes: 1
LAUNCH INFO 2022-10-19 15:12:10,744 nproc_per_node: None
LAUNCH INFO 2022-10-19 15:12:10,744 rank: -1
LAUNCH INFO 2022-10-19 15:12:10,744 run_mode: collective
LAUNCH INFO 2022-10-19 15:12:10,744 server_num: None
LAUNCH INFO 2022-10-19 15:12:10,744 servers: 169.254.60.61:36011
LAUNCH INFO 2022-10-19 15:12:10,744 trainer_num: None
LAUNCH INFO 2022-10-19 15:12:10,744 trainers:
LAUNCH INFO 2022-10-19 15:12:10,744 training_script: train.py
LAUNCH INFO 2022-10-19 15:12:10,744 training_script_args: ['--lr=0.01']
LAUNCH INFO 2022-10-19 15:12:10,744 with_gloo: 0
LAUNCH INFO 2022-10-19 15:12:10,744 --------------------------------------------------
LAUNCH WARNING 2022-10-19 15:12:10,744 Compatible mode enable with args ['--workers=169.254.94.75:36011']
----------- Configuration Arguments -----------
backend: auto
cluster_topo_path: None
elastic_pre_hook: None
elastic_server: None
enable_auto_mapping: False
force: False
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: log
np: None
nproc_per_node: None
rank_mapping_path: None
run_mode: None
scale: 0
server_num: None
servers: 169.254.60.61:36011
training_script: train.py
training_script_args: ['--lr=0.01']
worker_num: None
workers: 169.254.94.75:36011
INFO 2022-10-19 15:12:10,745 launch.py:494] Run parameter-sever mode. pserver arguments:['--servers', '--workers'], accelerators count:0
INFO 2022-10-19 15:12:10,745 launch.py:494] Run parameter-sever mode. pserver arguments:['--servers', '--workers'], accelerators count:0
WARNING 2022-10-19 15:12:10,745 launch.py:714] launch start with CPUONLY mode
WARNING 2022-10-19 15:12:10,745 launch.py:714] launch start with CPUONLY mode
/usr/lib/python3/dist-packages/apport/report.py:13: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import fnmatch, glob, traceback, errno, sys, atexit, locale, imp, stat
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/launch/main.py", line 17, in
launch()
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/launch/main.py", line 241, in launch
launch.launch()
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch.py", line 726, in launch
launch_ps(args, distribute_mode)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch.py", line 430, in launch_ps
ps_launcher = ParameterServerLauncher(args, distribute_mode)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch_utils.py", line 1184, ininit
self.get_role_endpoints(args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/launch_utils.py", line 1421, in get_role_endpoints
assert self.current_node_ip in self.node_ips, "Can't find your local ip {%s} in args.servers and args.workers ips: {%s}"
AssertionError: Can't find your local ip {127.0.1.1} in args.servers and args.workers ips: {['169.254.60.61', '169.254.94.75']}
3b6akqbq4#
目前的情况是,本地部署两台机子,一台当server,一台当worker,进行PS分布式训练。
lp0sw83n5#
可以参考下这个 ISSUE ~
vc6uscn96#
你好,参考了试了一下。
按照图中1方案来运行,结果还是报一样的错。
按照2来,export POD_IP=169.254.60.61 (这个是我server的ip),server端能正常运行,一直卡在下图步骤,ctrl+c结束。
在worker端类似的进行,export POD_IP=169.254.94.75,见下图
worker端出现问题了。请问该如何解决。
vlf7wbxs7#
补充一张图,是server端卡住按ctrl+c结束后的反馈内容
有一个疑问,这个是server处于在等待worker的状态吗?