Paddle load_persistables 重新加载模型后训练,报错MomentumOp的维度不一致

hxzsmxv2  于 2021-11-29  发布在  Java
关注(0)|答案(11)|浏览(298)
  • 版本、环境信息:

   1)PaddlePaddle版本:1.2
   4)系统环境:python 2.7

  • 训练信息

   1)单机/单卡
 
我训练的过程中用 save_persistables 保存的模型,然后程序停止了,我第二次运行程序的时候,我用 load_persistables 方式加载模型,然后训练的时候报错。说优化器MomentumOp的维度不一致

EnforceNotMetTraceback (most recent call last)<ipython-input-1-0317d6de6533> in <module>()
    649         total_batch_count += 1
    650         t1 = time.time()
--> 651         loss = exe.run(train_program, feed=feeder.feed(data), fetch_list=train_fetch_list)
    652         period = time.time() - t1
    653         loss = np.mean(np.array(loss))
/opt/conda/envs/py27-paddle1.2.0/lib/python2.7/site-packages/paddle/fluid/executor.pyc in run(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
    470 
    471         self._feed_data(program, feed, feed_var_name, scope)
--> 472         self.executor.run(program.desc, scope, 0, True, True)
    473         outs = self._fetch_data(fetch_list, fetch_var_name, scope)
    474         if return_numpy:
EnforceNotMet: Enforce failed. Expected param_dim == ctx->GetInputDim("Velocity"), but received param_dim:32 != ctx->GetInputDim("Velocity"):21, 1.
Param and Velocity of MomentumOp should have the same dimension. at [/paddle/paddle/fluid/operators/optimizers/momentum_op.h:64]
PaddlePaddle Call Stacks: 
0       0x7f70fa297826p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1       0x7f70fb30ade9p paddle::operators::MomentumOp::InferShape(paddle::framework::InferShapeContext*) const + 1945
yqlxgs2m

yqlxgs2m1#

设置优化器,保存模型,加载模型相关的代码如下

def optimizer_setting():
    learning_strategy = train_parameters['learning_strategy']
    learning_rate = fluid.layers.exponential_decay(learning_rate=learning_strategy['learning_rate'],
                                                   decay_steps=learning_strategy['decay_steps'],
                                                   decay_rate=learning_strategy['decay_rate'])
    optimizer = fluid.optimizer.MomentumOptimizer(learning_rate=learning_rate, momentum=0.1)
    return optimizer

def save_model(base_dir, base_name, feed_var_list, target_var_list, program, exe):
    fluid.io.save_inference_model(dirname=base_dir,
                                  params_filename=base_name + '-params',
                                  model_filename=base_name + '-model',
                                  feeded_var_names=feed_var_list,
                                  target_vars=target_var_list,
                                  main_program=program,
                                  executor=exe)
    fluid.io.save_persistables(dirname=base_dir,
                               filename=base_name + '-retrain',
                               main_program=program,
                               executor=exe)

def load_pretrained_params(exe, program):
    retrain_param_file = os.path.join(train_parameters['save_model_dir'],
                                      train_parameters['model_prefix'] + '-retrain')
    if os.path.exists(retrain_param_file):
        logger.info('load param from retrain model')
        print('load param from retrain model')
        fluid.io.load_persistables(executor=exe,
                                   dirname=train_parameters['save_model_dir'],
                                   main_program=program,
                                   filename=train_parameters['model_prefix'] + '-retrain')
    elif train_parameters['pretrained']:
        logger.info('load param from pretrained model')
        print('load param from pretrained model')

        def if_exist(var):
            return os.path.exists(os.path.join(train_parameters['pretrained_model_dir'], var.name))
        fluid.io.load_vars(exe, train_parameters['pretrained_model_dir'], main_program=program,
                           predicate=if_exist)
y1aodyip

y1aodyip2#

贴一下你调用load_persistables 的地方的代码吧
“我训练的过程中用 save_persistables 保存的模型,然后程序停止了。”模型停止了是出错了么还是自然停止的。

tzxcd3kk

tzxcd3kk3#

hi,你加载的是被剪切的模型么?

yshpjwxd

yshpjwxd4#

train_program = fluid.Program()
start_program = fluid.Program()
eval_program = fluid.Program()
feeder, loss, locs, confs, box, box_var = build_program(train_program, start_program, True)
cur_map, accum_map = build_program(eval_program, start_program, False)
eval_program = eval_program.clone(for_test=True)

logger.info("build executor and init params")
exe = fluid.Executor(place)
fluid.ParallelExecutor()
exe.run(start_program)
train_fetch_list = [loss.name]
eval_fetch_list = [cur_map.name, accum_map.name]
load_pretrained_params(exe, train_program)

successive_count = 0
stop_train = False
total_batch_count = 0
for pass_id in range(train_parameters["num_epochs"]):
    logger.info("current pass: %d, start read image", pass_id)
    batch_id = 0
    for step_id, data in enumerate(batch_reader()):

程序停止是我主动停止的,因为我感觉训练时候的某些超参设置不合理了。但是我还想接着中断那次的参数接着训练

r1wp621o

r1wp621o5#

应该是完整的模型

dsf9zpds

dsf9zpds6#

save_persistables之后有修改网络结构么?

jutyujz0

jutyujz08#

MomentumOptimizer 有重新生成么。

yr9zkbsy

yr9zkbsy9#

没有。是说需要么?

uqzxnwby

uqzxnwby10#

最后发现先调用fluid.io.save_inference_mode,然后调用fluid.io.save_persistables。训练时候使用fluid.io.load_persistables load save_persistables存储的模型会出现问题。

ldxq2e6h

ldxq2e6h11#

是的,我把两个函数反过来,先调用save_persistables,再调用save_inference_model,此时可以加载模型重新训练。

相关问题