paddle version 0.14
代码库地址:https://github.com/xuezhong/transformer-nist
作业运行配置:
1个pserver,2个trainer,8核
启动脚本:
FLAGS_rpc_deadline=1800000 python -u thirdparty/model/transformer_cloud/train.py --src_vocab_fpath ./thirdparty/nist06n/cn_30001.dict --trg_vocab_fpath ./thirdparty/nist06n/en_30001.dict --train_file_pattern './train/part-*' --val_file_pattern './test/part-*' --batch_size 2048 --use_token_batch True --special_token '_GO' '_EOS' '_UNK' --pass_num=1000 --iterations=1000 --device CPU
trainer日志
Total examples: 28796480, total time: 253549.41292, 113.57345 examples/sed
epoch: 0, val avg loss: 8.104059, val ppl: 3307.866803, consumed 323412.915918s
F0717 09:26:40.817395 4816 grpc_client.cc:301] Get name:[fc_34.b_0], ep:[10.90.245.27:30006] meets grpc error:Deadline Exceeded
***Check failure stack trace:***
@ 0x7f25d8a6dacd google::LogMessage::Fail()
@ 0x7f25d8a7157c google::LogMessage::SendToLog()
@ 0x7f25d8a6d5f3 google::LogMessage::Flush()
@ 0x7f25d8a72a8e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f25d9831ac7 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f26238cd470 (unknown)
@ 0x7f262e645851 start_thread
@ 0x7f262dd0890d clone
@ (nil) (unknown)
[/root/paddlejob/paddle_k8s : 180] [start_trainer]
[FATAL]: execute user cmd failed
*********************Shell Script Stack Trace********************
@: [/root/paddlejob/paddle_k8s: 39] check_return
@: [/root/paddlejob/paddle_k8s: 180] start_trainer
@: [/root/paddlejob/paddle_k8s: 203] main
1条答案
按热度按时间6ojccjat1#
模型训练没有添加complete