Paddle 机器翻译多机cpu下,1个pass异常结束

0qx6xfy6  于 2021-11-30  发布在  Java
关注(0)|答案(1)|浏览(376)

paddle version 0.14
代码库地址:https://github.com/xuezhong/transformer-nist
作业运行配置:
1个pserver,2个trainer,8核
启动脚本:

FLAGS_rpc_deadline=1800000 python -u thirdparty/model/transformer_cloud/train.py --src_vocab_fpath ./thirdparty/nist06n/cn_30001.dict --trg_vocab_fpath ./thirdparty/nist06n/en_30001.dict --train_file_pattern './train/part-*' --val_file_pattern './test/part-*' --batch_size 2048 --use_token_batch True  --special_token '_GO' '_EOS' '_UNK' --pass_num=1000 --iterations=1000 --device CPU

trainer日志

Total examples: 28796480, total time: 253549.41292, 113.57345 examples/sed

epoch: 0, val avg loss: 8.104059, val ppl: 3307.866803, consumed 323412.915918s
F0717 09:26:40.817395  4816 grpc_client.cc:301] Get name:[fc_34.b_0], ep:[10.90.245.27:30006] meets grpc error:Deadline Exceeded

***Check failure stack trace:***

    @     0x7f25d8a6dacd  google::LogMessage::Fail()
    @     0x7f25d8a7157c  google::LogMessage::SendToLog()
    @     0x7f25d8a6d5f3  google::LogMessage::Flush()
    @     0x7f25d8a72a8e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f25d9831ac7  paddle::operators::distributed::GRPCClient::Proceed()
    @     0x7f26238cd470  (unknown)
    @     0x7f262e645851  start_thread
    @     0x7f262dd0890d  clone
    @              (nil)  (unknown)
[/root/paddlejob/paddle_k8s : 180] [start_trainer]
[FATAL]: execute user cmd failed

*********************Shell Script Stack Trace********************

 @: [/root/paddlejob/paddle_k8s: 39] check_return
 @: [/root/paddlejob/paddle_k8s: 180] start_trainer
 @: [/root/paddlejob/paddle_k8s: 203] main

相关问题