参考官网手册进行多机训练http://paddlepaddle.org/documentation/docs/zh/1.2/user_guides/howto/training/save_load_variables.html
单机的时候使用default的program载入之前保存的模型没有问题,可以继续训练:
main_program = fluid.default_main_program()
exe = fluid.Executor(place)
startup_prog = fluid.default_startup_program()
exe.run(startup_prog)
fluid.io.load_persistables(exe, 'thirdparty/continue_model-pass-0', startup_prog)
但是用多机的get_pserver_program则报错:
logger.info("run pserver with continue model")
prog = t.get_pserver_program(current_endpoint)
startup = t.get_startup_program(current_endpoint, pserver_program=prog)
exe.run(startup)
fluid.io.load_persistables(exe, args.continue_model_path, startup)
exe.run(prog)
logger.info("pserver starting")
在运行到fluid.io.load_persistables(exe, args.continue_model_path, startup) 的时候报错:
File "train.py", line 269, in <module>
train()
File "train.py", line 254, in train
fluid.io.load_persistables(exe, args.continue_model_path, startup)
File "/home/work/anaconda2/lib/python2.7/site-packages/paddle/fluid/io.py", line 503, in load_persistables
filename=filename)
File "/home/work/anaconda2/lib/python2.7/site-packages/paddle/fluid/io.py", line 377, in load_vars
filename=filename)
File "/home/work/anaconda2/lib/python2.7/site-packages/paddle/fluid/io.py", line 387, in load_vars
new_var = _clone_var_in_block_(load_block, each_var)
File "/home/work/anaconda2/lib/python2.7/site-packages/paddle/fluid/io.py", line 85, in _clone_var_in_block_
lod_level=var.lod_level,
File "/home/work/anaconda2/lib/python2.7/site-packages/paddle/fluid/framework.py", line 418, in lod_level
return self.desc.lod_level()
paddle.fluid.core.EnforceNotMet: Getting 'lod_level' is not supported by the type of var seg_lr_Factors@GRAD. at [/paddle/paddle/fluid/framework/var_desc.cc:173]
PaddlePaddle Call Stacks:
0 0x7fa5a5f0eab6p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1 0x7fa5a5fb3b42p paddle::framework::VarDesc::GetLoDLevel() const + 162
2 0x7fa5a5f705cfp void pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<int, paddle::framework::VarDesc, , pybind11::name, pybind11::is_method, pybind11::sibling>(int (paddle::framework::VarDesc::*)() const, pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::VarDesc const*)#1}, int, paddle::framework::VarDesc const*, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<int, paddle::framework::VarDesc, , pybind11::name, pybind11::is_method, pybind11::sibling>(int (paddle::framework::VarDesc::*)() const, pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::VarDesc const*)#1}&&, int (*)(paddle::framework::VarDesc const*), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 143
3 0x7fa5a5f43cb4p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 2596
4 0x7fa5cb840eecp PyEval_EvalFrameEx + 33468
5 0x7fa5cb8424e9p PyEval_EvalCodeEx + 2025
6 0x7fa5cb7cafdap
7 0x7fa5cb7a6773p PyObject_Call + 67
8 0x7fa5cb7a685cp
9 0x7fa5cb7a6952p PyObject_CallFunction + 146
10 0x7fa5cb7e2c94p _PyObject_GenericGetAttrWithDict + 180
11 0x7fa5cb83d9cbp PyEval_EvalFrameEx + 19867
12 0x7fa5cb8424e9p PyEval_EvalCodeEx + 2025
13 0x7fa5cb83f482p PyEval_EvalFrameEx + 26706
14 0x7fa5cb8424e9p PyEval_EvalCodeEx + 2025
15 0x7fa5cb83f482p PyEval_EvalFrameEx + 26706
16 0x7fa5cb8424e9p PyEval_EvalCodeEx + 2025
17 0x7fa5cb83f482p PyEval_EvalFrameEx + 26706
18 0x7fa5cb8424e9p PyEval_EvalCodeEx + 2025
19 0x7fa5cb83f482p PyEval_EvalFrameEx + 26706
20 0x7fa5cb8424e9p PyEval_EvalCodeEx + 2025
21 0x7fa5cb83f482p PyEval_EvalFrameEx + 26706
22 0x7fa5cb8424e9p PyEval_EvalCodeEx + 2025
23 0x7fa5cb84270ap PyEval_EvalCode + 26
24 0x7fa5cb85b93dp
25 0x7fa5cb85cab8p PyRun_FileExFlags + 120
26 0x7fa5cb85dcd8p PyRun_SimpleFileExFlags + 232
27 0x7fa5cb86fd3cp Py_Main + 2988
28 0x7fa5caa8e445p __libc_start_main + 245
29 0x560601c1e87fp
为什么用单机的程序可以load 模型继续训练,用pserver的方法不可以??
9条答案
按热度按时间x33g5p2x1#
补充下,seg_lr_Factors 是一个embedding
holgip5t2#
您好,问题已经收到,会尽快排查原因
tkclm6bt3#
pserver的startup program和单机的不太一样,我给你找个例子
ddarikpa4#
seg_lr_Factors@GRAD 这个是梯度,为啥会被load回来? 单机是怎么save的模型,存下来的内容截个图看看?
ymzxtsji5#
fluid.io.save_persistables(exe, model_dir, train_program)
就用的这个语句存的
w8biq8rn6#
请问这个有下文么
pjngdqdw7#
几个问题需要确认一下:
yxyvkwin8#
load这里有一些问题,在最新的develop分支中已经修复, 这里提供两种方法:
xytpbqjk9#
我试试,谢谢