Paddle 单卡训练ernie没问题,多卡训练报错

nhaq1z21  于 2022-10-20  发布在  其他
关注(0)|答案(2)|浏览(933)

系统:cetos 6.3 4核12G
python:2.7
gpu cuda8 cudnn5.1
paddle 1.4
代码源 https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE (基本未做改动)
错误:多卡训练时以确认没有其他程序占gpu。

ztigrdn8

ztigrdn81#

@Kevinlsy 看报错是nccl的问题,请确认nccl环境正确设置了

aelbi1ox

aelbi1ox2#

@kuke 换了nccl2.3.5版本还是有问题,双卡可以了,但是3卡,4卡还是会报同样的错误。同样的nccl换了一个相同显卡配置的机子,连双卡也不行了。(好诡异。。)
PC: @ 0x0 (unknown)

SIGSEGV (@0x50) received by PID 12833 (TID 0x7f56686c9700) from PID 80; stack trace:

@ 0x7f56682a0160 (unknown)
@ 0x7f54bff14ea7 freeRing()
@ 0x7f54bff09c20 commFree()
@ 0x7f54bff10cc5 ncclCommInitAll
@ 0x7f561a1cd190 paddle::platform::NCCLContextMap::NCCLContextMap()
@ 0x7f561a1c8ee0 paddle::framework::ParallelExecutor::ParallelExecutor()
@ 0x7f561a0b1638 ZZN8pybind1112cpp_function10initializeIZNS_6detail8initimpl11constructorIJRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS9_8CPUPlaceENS9_15CUDAPinnedPlaceENS6_6detail7variant5void_ESF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_EESaISG_EERKS5_ISsSaISsEERKSsPNS8_9framework5ScopeERS5_IST_SaIST_EERKNSR_7details17ExecutionStrategyERKNSX_13BuildStrategyEPNSR_2ir5GraphEEE7executeINS_6class_INSR_16ParallelExecutorEJEEEJELi0EEEvRT_DpRKT0_EUlRNS2_16value_and_holderESK_SO_SQ_ST_SW_S10_S13_S16_E_vJS1J_SK_SO_SQ_ST_SW_S10_S13_S16_EJNS_4nameENS_9is_methodENS_7siblingENS2_24is_new_style_constructorEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES21
@ 0x7f561a04d72e pybind11::cpp_function::dispatcher()
@ 0x422f1a PyObject_Call
@ 0x4273fd instancemethod_call
@ 0x422f1a PyObject_Call
@ 0x48164f slot_tp_init
@ 0x47ed7a type_call
@ 0x422f1a PyObject_Call
@ 0x4b216a PyEval_EvalFrameEx
@ 0x4ba048 PyEval_EvalCodeEx
@ 0x4b6c57 PyEval_EvalFrameEx
@ 0x4ba048 PyEval_EvalCodeEx
@ 0x4b6c57 PyEval_EvalFrameEx
@ 0x4ba048 PyEval_EvalCodeEx
▽ @ 0x52e26f function_call
@ 0x422f1a PyObject_Call
@ 0x4273fd instancemethod_call
@ 0x422f1a PyObject_Call
@ 0x48164f slot_tp_init
@ 0x47ed7a type_call
@ 0x422f1a PyObject_Call
@ 0x4b216a PyEval_EvalFrameEx
@ 0x4ba048 PyEval_EvalCodeEx
@ 0x4b6c57 PyEval_EvalFrameEx
@ 0x4ba048 PyEval_EvalCodeEx
@ 0x4ba172 PyEval_EvalCode
script/run_lcqmc.sh: line 56: 12833 Segmentation fault

相关问题