fluid 1.6版本,在MPI集群上运行分布式CPU训练,不是paddlecloud那个mpi,50个节点,每个节点两个mpi进程,使用Adam优化器。
简略配置:
import paddle
import paddle.fluid as fluid
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.transpiler import DistributeTranspilerConfig
from paddle.fluid.incubate.fleet.base import role_maker
role = role_maker.MPISymetricRoleMaker()
fleet.init(role)
config = DistributeTranspilerConfig()
config.sync_mode = True
optimizer = fleet.distributed_optimizer(optimizer, config)
optimizer.minimize(subnet.cost)
MPI启动脚本:
mpirun -wdir 1/2/3 -npernode 2 -timestamp-output -tag-output -machinefile "${PBS_NODEFILE}" python/bin/python train.py
相关日志细节,从其中摘了一些:
get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
/home/disk1/normandy/maybach/app-user-20191106110848-7309/workspace/1/2/3/python/lib/python2.7/site-packages/paddle/fluid/executor.py:790: UserWarning: The current program is empty.
grpc_server.cc:472] Server listening on 10.182.14.140:61991 selected port: 61991
Wed Nov 6 11:22:42 2019[1,79]:I1106 11:22:42.877466 2336 rpc_client.h:106] init rpc client with trainer_id 39
Wed Nov 6 11:23:00 2019[1,0]:E1106 11:23:00.299126 18303 variable_response.cc:100] recved var should not on current server: fc_0.b_0@GRAD.trainer_13
Wed Nov 6 11:23:00 2019[1,27]:F1106 11:23:00.287508 25203 grpc_client.cc:508] SendRPC name:[fc_0.b_0@GRAD.trainer_13], ep:[10.182.13.150:61991], status:[-1] meets grpc error, error_code:13 error_message:Unable to parse request error_details:
Wed Nov 6 11:23:00 2019[1,27]:Check failure stack trace:
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb4c0f6f31d google::LogMessage::Fail()
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb4c0f72dcc google::LogMessage::SendToLog()
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb4c0f6ee43 google::LogMessage::Flush()
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb4c0f742de google::LogMessageFatal::~LogMessageFatal()
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb4c204a8f5 paddle::operators::distributed::GRPCClient::Proceed()
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb3ecfaa8a0 execute_native_thread_routine
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb4f80111c3 start_thread
Wed Nov 6 11:23:00 2019[1,27]: @ 0x7fb4f763912d __clone
Wed Nov 6 11:23:00 2019[1,27]: @ (nil) (unknown)
mpirun noticed that process rank 27 with PID 44175 on node 10.182.104.144 exited on signal 6 (Aborted).
7条答案
按热度按时间mznpcxlj1#
日志上来看当前program是空的,如果是train的话请检查一下是否exe.run(startup_program)了,如果是eval的话请检查一下是否正确load weight
bpzcxfmw2#
@heavengate 是训练过程。
if fleet.is_server():
init_path = self.model_conf.get("train", "%s_init_from_model" % network_name)
if init_path is not None and len(init_path) > 0:
last_slash = init_path.rfind('/')
dirname = init_path[:last_slash]
fleet.init_server(dirname)
else:
fleet.init_server()
fleet.run_server()
elif fleet.is_worker():
exe = fluid.Executor(place)
self.train_loop(
main_program=main_program,
exe=exe,
subnet=subnet,
network_name=network_name,
place=place)
fleet.stop_worker()
def train_loop(self, main_program, exe, subnet, network_name, place):
"""训练的pass训练"""
exe.run(fluid.default_startup_program())
在train_loop()方法中,第一句是exe.run(fluid.default_startup_program())。
uqxowvwt3#
在Trainer 进行 run(startup_program)之前,请加上fleet.init_worker()
lawou6xi4#
@MrChengmo 错误日志和之前有微小区别,日志比之前多了下面的,摘一些新增的日志:
Wed Nov 6 19:33:14 2019[1,27]:server not ready, wait 3 sec to retry...
Wed Nov 6 19:33:14 2019[1,27]:not ready endpoints:['10.182.18.148:61991', '10.182.8.25:61991', '10.182.12.38:61991', '10.182.70.15:61991', '10.182.69.28:61991', '10.182.120.18:61991', '10.182.111.146:61991', '10.182.110.139:61991', '10.182.120.15:61991', '10.182.78.22:61991', '10.182.21.15:61991', '10.182.8.152:61991', '10.182.78.28:61991', '10.182.22.162:61991', '10.182.69.12:61991', '10.182.70.18:61991', '10.182.76.38:61991', '10.182.78.142:61991', '10.182.69.35:61991', '10.182.8.38:61991', '10.182.74.25:61991', '10.182.69.142:61991', '10.182.7.32:61991', '10.182.69.148:61991', '10.182.79.18:61991', '10.182.76.152:61991', '10.182.69.32:61991', '10.182.79.162:61991', '10.182.8.155:61991', '10.182.78.35:61991', '10.182.110.153:61991', '10.182.70.22:61991', '10.182.8.15:61991', '10.182.75.15:61991']
还有一些日志和原来一样,也摘出来:
Wed Nov 6 19:33:14 2019[1,54]:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
Wed Nov 6 19:33:14 2019[1,42]:/home/disk1/normandy/maybach/app-user-20191106191413-7928/workspace/1/2/3/python/lib/python2.7/site-packages/paddle/fluid/executor.py:790: UserWarning: The current program is empty.
Wed Nov 6 19:33:14 2019[1,78]:I1106 19:33:14.284071 8040 grpc_server.cc:472] Server listening on 10.182.8.22:61991 selected port: 61991
Wed Nov 6 19:33:20 2019[1,71]:I1106 19:33:20.108784 38182 rpc_client.h:106] init rpc client with trainer_id 35
Wed Nov 6 19:33:35 2019[1,0]:E1106 19:33:35.401274 43675 variable_response.cc:100] recved var should not on current server: fc_0.b_0@GRAD.trainer_14
Wed Nov 6 19:33:35 2019[1,29]:F1106 19:33:35.406313 43586 grpc_client.cc:508] SendRPC name:[fc_0.b_0@GRAD.trainer_14], ep:[10.182.18.148:61991], status:[-1] meets grpc error, error_code:13 error_message:Unable to parse request error_details:
Wed Nov 6 19:33:35 2019[1,29]:***Check failure stack trace:***
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f15c9ca231d google::LogMessage::Fail()
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f15c9ca5dcc google::LogMessage::SendToLog()
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f15c9ca1e43 google::LogMessage::Flush()
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f15c9ca72de google::LogMessageFatal::~LogMessageFatal()
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f15cad7d8f5 paddle::operators::distributed::GRPCClient::Proceed()
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f1587fce8a0 execute_native_thread_routine
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f16afe771c3 start_thread
Wed Nov 6 19:33:35 2019[1,29]: @ 0x7f16af49f12d __clone
Wed Nov 6 19:33:35 2019[1,29]: @ (nil) (unknown)
按照上面答案更新后的代码,
a8jjtwal5#
elif fleet.is_worker():
exe = fluid.Executor(place)
fleet.init_worker()
exe.run(fleet.startup_program)
self.train_loop(
main_program=main_program,
exe=exe,
subnet=subnet,
network_name=network_name,
place=place)
fleet.stop_worker()
3ks5zfa06#
@seiriosPlus 将原来的exe.run(fluid.default_startup_program())改为exe.run(fleet.startup_program),日志报错信息一样的。
cu6pst1q7#