Paddle 训练若干个epoch后报错an illegal memory access was encountered

e5njpo68  于 2021-11-30  发布在  Java
关注(0)|答案(4)|浏览(266)

paddlecloud job-0bb5eedc1f205017

terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
  what():  

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2   paddle::framework::details::OpHandleBase::~OpHandleBase()
3   paddle::framework::details::FetchOpHandle::~FetchOpHandle()
4   paddle::framework::ir::Node::~Node()
5   paddle::framework::ir::Node::~Node()
6   paddle::framework::details::ClearFetchOp(paddle::framework::ir::Graph*, std::vector<paddle::framework::details::OpHandleBase*, std::allocator<paddle::framework::details::OpHandleBase*> >*)
7   paddle::framework::details::FastThreadedSSAGraphExecutor::ExecutionFinal(std::vector<paddle::framework::details::OpHandleBase*, std::allocator<paddle::framework::details::OpHandleBase*> >*)
8   paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&)
9   paddle::framework::details::ScopeBufferedMonitor::Apply(std::function<void ()()> const&, bool)
10  paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&)
11  paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&)
----------------------
Error Message Summary:
----------------------
Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.
  - New issue link: https://github.com/PaddlePaddle/Paddle/issues/new
  - Recommended issue content: all error stack information: an illegal memory access was encountered at (/paddle/paddle/fluid/framework/details/op_handle_base.cc:39)

W0620 16:58:14.349057   878 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0620 16:58:14.349076   878 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0620 16:58:14.349079   878 init.cc:214] The detail failure signal is:

W0620 16:58:14.349082   878 init.cc:217]***Aborted at 1592643494 (unix time) try "date -d @1592643494" if you are using GNU date***
W0620 16:58:14.351125   878 init.cc:217] PC: @                0x0 (unknown)
W0620 16:58:14.351239   878 init.cc:217]***SIGABRT (@0x36e) received by PID 878 (TID 0x7fbe93e04700) from PID 878; stack trace:***
W0620 16:58:14.353410   878 init.cc:217]     @     0x7fbe933d7bb0 (unknown)
W0620 16:58:14.355901   878 init.cc:217]     @     0x7fbe92951f29 __GI_raise
W0620 16:58:14.357455   878 init.cc:217]     @     0x7fbe9295334a __GI_abort
W0620 16:58:14.358215   878 init.cc:217]     @     0x7fbddfb0ca8d __gnu_cxx::__verbose_terminate_handler()
W0620 16:58:14.358928   878 init.cc:217]     @     0x7fbddfb0abe6 (unknown)
W0620 16:58:14.359678   878 init.cc:217]     @     0x7fbddfb09b69 (unknown)
W0620 16:58:14.360306   878 init.cc:217]     @     0x7fbddfb0a5c1 __gxx_personality_v0
W0620 16:58:14.360981   878 init.cc:217]     @     0x7fbe1a5a8383 (unknown)
W0620 16:58:14.361657   878 init.cc:217]     @     0x7fbe1a5a8457 _Unwind_Resume
W0620 16:58:14.367496   878 init.cc:217]     @     0x7fbd9418ab9c paddle::framework::details::OpHandleBase::~OpHandleBase()
W0620 16:58:14.369899   878 init.cc:217]     @     0x7fbd94145011 paddle::framework::details::FetchOpHandle::~FetchOpHandle()
W0620 16:58:14.372714   878 init.cc:217]     @     0x7fbd91a72a89 paddle::framework::ir::Node::~Node()
W0620 16:58:14.377493   878 init.cc:217]     @     0x7fbd91a72c31 paddle::framework::ir::Node::~Node()
W0620 16:58:14.396554   878 init.cc:217]     @     0x7fbd94147956 paddle::framework::details::ClearFetchOp()
W0620 16:58:14.398361   878 init.cc:217]     @     0x7fbd9414394a paddle::framework::details::FastThreadedSSAGraphExecutor::ExecutionFinal()
W0620 16:58:14.403327   878 init.cc:217]     @     0x7fbd9414252a paddle::framework::details::FastThreadedSSAGraphExecutor::Run()
W0620 16:58:14.404551   878 init.cc:217]     @     0x7fbd9408afe7 _ZNSt17_Function_handlerIFvvEZN6paddle9framework7details29ScopeBufferedSSAGraphExecutor3RunERKSt6vectorISsSaISsEEEUlvE_E9_M_invokeERKSt9_Any_data
W0620 16:58:14.408218   878 init.cc:217]     @     0x7fbd9408ffdf paddle::framework::details::ScopeBufferedMonitor::Apply()
W0620 16:58:14.409902   878 init.cc:217]     @     0x7fbd9408bd86 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run()
W0620 16:58:14.412961   878 init.cc:217]     @     0x7fbd91b739a8 paddle::framework::ParallelExecutor::Run()
W0620 16:58:14.413442   878 init.cc:217]     @     0x7fbd91789a18 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL22pybind11_init_core_avxERNS_6moduleEEUlRNS2_9framework16ParallelExecutorERKSt6vectorISsSaISsEEE199_S9_INS6_9LoDTensorESaISF_EEIS8_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESY_
W0620 16:58:14.414839   878 init.cc:217]     @     0x7fbd917debb1 pybind11::cpp_function::dispatcher()
W0620 16:58:14.416537   878 init.cc:217]     @     0x7fbe936efce8 PyEval_EvalFrameEx
W0620 16:58:14.418002   878 init.cc:217]     @     0x7fbe936f237d PyEval_EvalCodeEx
W0620 16:58:14.419442   878 init.cc:217]     @     0x7fbe936efd70 PyEval_EvalFrameEx
W0620 16:58:14.420891   878 init.cc:217]     @     0x7fbe936f237d PyEval_EvalCodeEx
W0620 16:58:14.422331   878 init.cc:217]     @     0x7fbe936efd70 PyEval_EvalFrameEx
W0620 16:58:14.423780   878 init.cc:217]     @     0x7fbe936f237d PyEval_EvalCodeEx
W0620 16:58:14.425217   878 init.cc:217]     @     0x7fbe936efd70 PyEval_EvalFrameEx
W0620 16:58:14.426668   878 init.cc:217]     @     0x7fbe936f237d PyEval_EvalCodeEx
W0620 16:58:14.428104   878 init.cc:217]     @     0x7fbe936f24b2 PyEval_EvalCode
W0620 16:58:14.429548   878 init.cc:217]     @     0x7fbe9371c1c2 PyRun_FileExFlags
busg9geu

busg9geu1#

这个问题在单机上能复现吗?如果能,是否可以提供下能在单机上简单复现的代码呢?

svdrlsy4

svdrlsy43#

换成python3,并设置梯度裁剪后,能正常训练。

yyhrrdl8

yyhrrdl84#

没有单独测试过,感觉梯度的原因大一些

相关问题