Paddle test_sync_batch_norm_op random failure (Segmentation fault)

9udxz4iz  于 2023-02-04  发布在  其他

bug描述 Describe the Bug

test_sync_batch_norm_op randomly receives segfault on P100 x 2. Error message:

Start 1308: test_sync_batch_norm_op
1/1 Test #1308: test_sync_batch_norm_op ..........***Failed   51.02 sec
W1230 08:00:16.132694  9679] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 12.0, Runtime API Version: 11.7
W1230 08:00:16.132745  9679] device: 0, cuDNN Version: 8.4.
I1230 08:00:19.037106  9679] New Executor is Running.
W1230 08:00:19.142438  9679] Cannot enable P2P access from 0 to 1
W1230 08:00:19.142482  9679] Cannot enable P2P access from 1 to 0
I1230 08:00:21.305244  9679] set enable_sequential_execution:1
I1230 08:00:21.307850  9679] ParallelExecutor is Running (RunAndMerge).
I1230 08:00:23.322465  9679] set enable_sequential_execution:1
I1230 08:00:23.563740  9679] Standalone Executor is Used.
I1230 08:00:23.772176  9679] set enable_sequential_execution:1
W1230 08:00:23.774336  9679] Find all_reduce operators: 3. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 3.
I1230 08:00:24.357183  9679] set enable_sequential_execution:1
W1230 08:00:24.359194  9679] Find all_reduce operators: 3. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 3.
I1230 08:00:24.924245  9679] set enable_sequential_execution:1
I1230 08:00:25.247583  9679] set enable_sequential_execution:1
I1230 08:00:25.542920  9679] set enable_sequential_execution:1
W1230 08:00:25.546025  9679] Find all_reduce operators: 3. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 3.
I1230 08:00:25.948423  9679] set enable_sequential_execution:1
W1230 08:00:25.950547  9679] Find all_reduce operators: 3. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 3.
/home/scratch.tizheng_sw/tmp/fix_ut/paddle-develop/build/python/paddle/nn/layer/ UserWarning: When training, we now always track global mean and variance.
  "When training, we now always track global mean and variance."
/home/scratch.tizheng_sw/tmp/fix_ut/paddle-develop/build/python/paddle/fluid/ UserWarning: The data type of 'input' in conv2d only support float16 in GPU now.
  % (input_name, op_name, extra_message)
/home/scratch.tizheng_sw/tmp/fix_ut/paddle-develop/build/python/paddle/fluid/ UserWarning: The data type of 'Out' in guassian_random only support float16 in GPU now.
  % (input_name, op_name, extra_message)
/home/scratch.tizheng_sw/tmp/fix_ut/paddle-develop/build/python/paddle/fluid/ UserWarning: The data type of 'input' in batch_norm only support float16 in GPU now.
  % (input_name, op_name, extra_message)
/home/scratch.tizheng_sw/tmp/fix_ut/paddle-develop/build/python/paddle/fluid/ UserWarning: The data type of 'x' in cast only support float16 in GPU now.
  % (input_name, op_name, extra_message)
/home/scratch.tizheng_sw/tmp/fix_ut/paddle-develop/build/python/paddle/fluid/ UserWarning: Standalone executor is not used for data parallel

C++ Traceback (most recent call last):
0   paddle::framework::ScopePool::Clear()
1   paddle::framework::ScopePool::DeleteScope(paddle::framework::Scope*)
2   paddle::framework::Scope::~Scope()
3   std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
4   paddle::framework::Variable::PlaceholderImpl<phi::DenseTensor>::~PlaceholderImpl()
5   std::_Sp_counted_deleter<phi::Allocation*, std::function<void (phi::Allocation*)>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
6   paddle::memory::allocation::CUDAAllocator::FreeImpl(phi::Allocation*)
7   paddle::platform::RecordedGpuMallocHelper::Free(void*, unsigned long)
8   std::_Rb_tree<void*, void*, std::_Identity<void*>, std::less<void*>, std::allocator<void*> >::erase(void* const&)

Error Message Summary:
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1672387226 (unix time) try "date -d @1672387226" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x18) received by PID 9679 (TID 0x7f0eeead4740) from PID 24 ***]

Segmentation fault

0% tests passed, 1 tests failed out of 1

Label Time Summary:
RUN_TYPE=DIST    =  51.02 sec*proc (1 test)

Way to reproduce

  • docker image: paddlepaddle/paddle:2.4.1-gpu-cuda11.7-cudnn8.4-trt8.4
  • Hardware: Tesla P100-PCIE-16GB x 2
  • Build options:
cmake -B${BUILD_DIR} -S${BASE_DIR} \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_FLAGS="-t0" \
    -DCUDA_ARCH_NAME=Manual \
    -DCUDA_ARCH_BIN="60 80" \

其他补充信息 Additional Supplementary Information

The odd is small (approximately once in every 100 runs).



Thanks, will ask the owner of this ut for help.
