为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题: 无
- 版本、环境信息:
Paddle version: 1.8.3
Paddle With CUDA: True
OS: debian stretch/sid
Python version: 3.7.7
CUDA version: 10.1.243
cuDNN version: None.None.None # 注:系统中安装libcudnn.so.7.6.5,且将路径加入$LD_LIBRARY_PATH
Nvidia driver version: 418.74
- 问题描述:
使用PaddleDetection中的tool/eval.py进行推理,环境为单机,单卡或多卡,Tesla V100 16G或 Titan RTX
模型为Cascade RCNN (backbone为R101vd或R200vd) , multi-scale test。
以下报错大约有30%的概率出现(使用相同的脚本无法稳定复现,不知道跟什么原因有关):
2020-08-05 12:50:51,695-INFO: start loading proposals
2020-08-05 12:50:52,457-INFO: loading roidb 2012_test
100%|████████████████████████████████████████| 970/970 [00:01<00:00, 601.75it/s]
2020-08-05 12:50:54,377-INFO: finish loading roidb from scope 2012_test
2020-08-05 12:50:54,378-INFO: finish loading roidbs, total num = 970
2020-08-05 12:50:54,379-INFO: set max batches to 0
2020-08-05 12:50:54,380-INFO: places would be ommited when DataLoader is not iterable
W0805 12:50:54.530522 4141844 device_context.cc:252] Please NOTE: device: 5, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0
W0805 12:50:55.613425 4141844 device_context.cc:260] device: 5, cuDNN Version: 7.6.
W0805 12:51:24.223932 4141881 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0805 12:51:24.223980 4141881 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0805 12:51:24.223989 4141881 init.cc:221] The detail failure signal is:
W0805 12:51:24.224001 4141881 init.cc:224]Aborted at 1596603084 (unix time) try "date -d @1596603084" if you are using GNU date
W0805 12:51:24.228863 4141881 init.cc:224] PC: @ 0x0 (unknown)
W0805 12:51:24.346484 4141881 init.cc:224]***SIGSEGV (@0x8) received by PID 4141844 (TID 0x7f012db3d700) from PID 8; stack trace:***
W0805 12:51:24.351244 4141881 init.cc:224] @ 0x7f01e3671390 (unknown)
W0805 12:51:24.353901 4141881 init.cc:224] @ 0x7f012eda2747 (unknown)
W0805 12:51:24.356168 4141881 init.cc:224] @ 0x7f012ec98d4c (unknown)
W0805 12:51:24.358356 4141881 init.cc:224] @ 0x7f012e41b5fc (unknown)
W0805 12:51:24.360416 4141881 init.cc:224] @ 0x7f012e42b938 (unknown)
W0805 12:51:24.362363 4141881 init.cc:224] @ 0x7f012e41859a cudnnGetConvolutionForwardAlgorithm_v7
W0805 12:51:24.447378 4141881 init.cc:224] @ 0x7f019853ff45 paddle::operators::SearchAlgorithm<>::Find<>()
W0805 12:51:24.469980 4141881 init.cc:224] @ 0x7f01985e1889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0805 12:51:24.481895 4141881 init.cc:224] @ 0x7f01985e2b33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0805 12:51:24.504448 4141881 init.cc:224] @ 0x7f019a561ac0 paddle::framework::OperatorWithKernel::RunImpl()
W0805 12:51:24.565385 4141881 init.cc:224] @ 0x7f019a5622b1 paddle::framework::OperatorWithKernel::RunImpl()
W0805 12:51:24.604465 4141881 init.cc:224] @ 0x7f019a55b261 paddle::framework::OperatorBase::Run()
W0805 12:51:24.635419 4141881 init.cc:224] @ 0x7f019a268f16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0805 12:51:24.657658 4141881 init.cc:224] @ 0x7f019a210551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0805 12:51:24.673673 4141881 init.cc:224] @ 0x7f019a20e04f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0805 12:51:24.687579 4141881 init.cc:224] @ 0x7f019a20e314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0805 12:51:24.724630 4141881 init.cc:224] @ 0x7f0197001fb3 std::_Function_handler<>::_M_invoke()
W0805 12:51:24.769093 4141881 init.cc:224] @ 0x7f0196dfd647 std::__future_base::_State_base::_M_do_set()
W0805 12:51:24.773929 4141881 init.cc:224] @ 0x7f01e366ea99 __pthread_once_slow
W0805 12:51:24.780242 4141881 init.cc:224] @ 0x7f019a20a4e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0805 12:51:24.817785 4141881 init.cc:224] @ 0x7f0196dffaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0805 12:51:24.850741 4141881 init.cc:224] @ 0x7f01d4120421 execute_native_thread_routine_compat
W0805 12:51:24.857818 4141881 init.cc:224] @ 0x7f01e36676ba start_thread
W0805 12:51:24.862519 4141881 init.cc:224] @ 0x7f01e339d41d clone
W0805 12:51:24.870891 4141881 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)
21条答案
按热度按时间ntjbwcob1#
想确认下你在单卡单batch的情况下预测也是会随机出现这种问题吗?
刚才用TITAN RTX单卡跑了10+次,只有第一次出现cudnnGetConvolutionForwardAlgorithm_v7报错。
能否使用docker配置环境?
可以,我直接docker pull paddlepaddle/paddle:1.8.3-gpu-cuda10.0-cudnn7 试试。
wbrvyc0a2#
@flishwang export FLAGS_selected_gpu=0, 1试一下
使用docker镜像内带的2.6.4版本的nccl后,run_check通过,多卡推理不报错,这个问题解决了。应该就是nccl版本的问题。
m3eecexj3#
@flishwang export FLAGS_selected_gpu=0, 1试一下
pdtvr36n4#
@flishwang 你好,可以使用export FLAGS_selected_gpu避开有问题的显卡,然后在run_check看一下嘛
dxpyg8gm5#
这个可能和nccl有关,https://www.paddlepaddle.org.cn/documentation/docs/zh/1.8/install/install_Ubuntu.html#cpu-gpu 可以参考这里有相关版本的要求 另外如果能使用docker的话建议使用这里的镜像https://www.paddlepaddle.org.cn/documentation/docs/zh/1.8/install/install_Docker.html#id3
acruukt96#
这个由于我们这边没有能够复现的环境,所以还比较麻烦,从报错信息来看是再调cudnn的卷积挂的,有可能和cudnn有关
好的。
另外我们还有一台机器,有一块儿显卡无法正常工作。
model.with_data_parallel和fluid.install_check.run_check都会引发nccl error,无论with_data_parallel中的places是否涉及坏掉的显卡。
但这台机器可以使用剩下的好的显卡,使用mxnet正常训练和测试。
这个问题能够定位或解决吗?
报错日志和nvidia-smi信息如下:
lokaqttq7#
这个由于我们这边没有能够复现的环境,所以还比较麻烦,从报错信息来看是再调cudnn的卷积挂的,有可能和cudnn有关
ttcibm8c8#
这里有可能受batch size或是计算量的影响,如果用cascade rcnn+ R200vd + multi-scale test这种组合的话有可能会出现这种情况,可以先试着把batch size设为1,然后使用single scale的方式进行测试
我们目前主要用cascade + r101vd, multiscale test, batch size=1,
输入图片scale为3200, 2048, 1024, 576
所以有可能是计算量的影响。
我们面临的实际情况需要对图片进行multiscale test.
把scale为3200的图片切成小块进行识别,再拼回去(类似dota的数据处理方式)可能不太可行。
请问随机报错的原因是什么呢?paddle框架的问题?还是cudnn或cuda的问题?
这个问题能否解决呢?
z31licg09#
这里有可能受batch size或是计算量的影响,如果用cascade rcnn+ R200vd + multi-scale test这种组合的话有可能会出现这种情况,可以先试着把batch size设为1,然后使用single scale的方式进行测试
6ju8rftf10#
嗯,使用docker环境再确认下吧,我们这边也好定位。之前本地都没有复现过这种错误
基于1.8.3-gpu-cuda10.0-cudnn7这个docker,安装openssh-server、git等ubuntu包后,放到k8s中,仍然随机崩溃,以下是3次崩溃的日志:
W0826 06:06:09.501305 4216 device_context.cc:260] device: 0, cuDNN Version: 7.6.
/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "tools/eval.py", line 243, in
main()
File "tools/eval.py", line 180, in main
sub_eval_prog, sub_keys, sub_values, resolution)
File "/home/bwang/projects/sniper-paddle/ppdet/utils/eval_utils.py", line 134, in eval_run
return_merged=False)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl
return_merged=return_merged)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel
tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const
3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
7 paddle::framework::details::ComputationOpHandle::RunImpl()
8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*)
10 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&)
11 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
12 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const
Python Call Stacks (More useful to users):
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(args,kwargs)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 2938, in conv2d
"data_format": data_format,
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 187, in _conv_norm
name=_name + '.conv2d.output.1')
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 462, in c1_stage
name=_name)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 490, incall*
res = self.c1_stage(res)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 210, in build_multi_scale
body_feats = self.backbone(im)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 363, in eval
return self.build_multi_scale(feed_vars)
File "tools/eval.py", line 105, in main
fetches = model.eval(feed_vars, multi_scale_test)
File "tools/eval.py", line 243, in
main()
Error Message Summary:
ExternalError: Cudnn error, CUDNN_STATUS_BAD_PARAM at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:300)
[operator < conv2d > error]
2020-08-26 06:10:45,407-INFO: set max batches to 0
2020-08-26 06:10:45,408-INFO: places would be ommited when DataLoader is not iterable
W0826 06:10:45.596765 4864 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0826 06:10:45.601509 4864 device_context.cc:260] device: 0, cuDNN Version: 7.6.
W0826 06:11:05.817611 5102 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0826 06:11:05.817775 5102 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0826 06:11:05.817786 5102 init.cc:221] The detail failure signal is:
W0826 06:11:05.817793 5102 init.cc:224]Aborted at 1598422265 (unix time) try "date -d @1598422265" if you are using GNU date
W0826 06:11:05.822052 5102 init.cc:224] PC: @ 0x0 (unknown)
W0826 06:11:05.822432 5102 init.cc:224]***SIGSEGV (@0x0) received by PID 4864 (TID 0x7f9ad84ca700) from PID 0; stack trace:***
W0826 06:11:05.826328 5102 init.cc:224] @ 0x7f9baae06390 (unknown)
W0826 06:11:05.827157 5102 init.cc:224] @ 0x7f99df5111b8 (unknown)
W0826 06:11:05.828028 5102 init.cc:224] @ 0x7f99df51136a (unknown)
W0826 06:11:05.828830 5102 init.cc:224] @ 0x7f99deda26f0 (unknown)
W0826 06:11:05.829461 5102 init.cc:224] @ 0x7f99dec98d4c (unknown)
W0826 06:11:05.829979 5102 init.cc:224] @ 0x7f99de41b5fc (unknown)
W0826 06:11:05.830523 5102 init.cc:224] @ 0x7f99de41d429 cudnnGetConvolutionForwardWorkspaceSize
W0826 06:11:05.836208 5102 init.cc:224] @ 0x7f9a2843a8f0 paddle::operators::SearchAlgorithm<>::GetWorkspaceSize()
W0826 06:11:05.841645 5102 init.cc:224] @ 0x7f9a28451f5d paddle::operators::SearchAlgorithm<>::Find<>()
W0826 06:11:05.846259 5102 init.cc:224] @ 0x7f9a284f3889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0826 06:11:05.849913 5102 init.cc:224] @ 0x7f9a284f4b33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0826 06:11:05.853875 5102 init.cc:224] @ 0x7f9a2a473ac0 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:11:05.859454 5102 init.cc:224] @ 0x7f9a2a4742b1 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:11:05.862534 5102 init.cc:224] @ 0x7f9a2a46d261 paddle::framework::OperatorBase::Run()
W0826 06:11:05.866932 5102 init.cc:224] @ 0x7f9a2a17af16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0826 06:11:05.870985 5102 init.cc:224] @ 0x7f9a2a122551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0826 06:11:05.877444 5102 init.cc:224] @ 0x7f9a2a12004f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0826 06:11:05.879237 5102 init.cc:224] @ 0x7f9a2a120314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0826 06:11:05.884681 5102 init.cc:224] @ 0x7f9a26f13fb3 std::_Function_handler<>::_M_invoke()
W0826 06:11:05.889741 5102 init.cc:224] @ 0x7f9a26d0f647 std::__future_base::_State_base::_M_do_set()
W0826 06:11:05.891779 5102 init.cc:224] @ 0x7f9baae03a99 __pthread_once_slow
W0826 06:11:05.893308 5102 init.cc:224] @ 0x7f9a2a11c4e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0826 06:11:05.898675 5102 init.cc:224] @ 0x7f9a26d11aa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0826 06:11:05.899813 5102 init.cc:224] @ 0x7f9ab2fb5c80 (unknown)
W0826 06:11:05.901708 5102 init.cc:224] @ 0x7f9baadfc6ba start_thread
W0826 06:11:05.903543 5102 init.cc:224] @ 0x7f9baab324dd clone
W0826 06:11:05.905416 5102 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)
2020-08-26 06:15:31,989-INFO: finish loading roidbs, total num = 970
2020-08-26 06:15:31,990-INFO: set max batches to 0
2020-08-26 06:15:31,990-INFO: places would be ommited when DataLoader is not iterable
W0826 06:15:32.250245 6523 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0826 06:15:32.255419 6523 device_context.cc:260] device: 0, cuDNN Version: 7.6.
W0826 06:15:52.560868 6768 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0826 06:15:52.561199 6768 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0826 06:15:52.561347 6768 init.cc:221] The detail failure signal is:
W0826 06:15:52.561537 6768 init.cc:224]Aborted at 1598422552 (unix time) try "date -d @1598422552" if you are using GNU date
W0826 06:15:52.565481 6768 init.cc:224] PC: @ 0x0 (unknown)
W0826 06:15:52.567220 6768 init.cc:224]***SIGSEGV (@0x8) received by PID 6523 (TID 0x7f74b3f6f700) from PID 8; stack trace:***
W0826 06:15:52.570947 6768 init.cc:224] @ 0x7f756889f390 (unknown)
W0826 06:15:52.571835 6768 init.cc:224] @ 0x7f739eda2747 (unknown)
W0826 06:15:52.572630 6768 init.cc:224] @ 0x7f739ec98d4c (unknown)
W0826 06:15:52.573356 6768 init.cc:224] @ 0x7f739e41b5fc (unknown)
W0826 06:15:52.574101 6768 init.cc:224] @ 0x7f739e42be5a (unknown)
W0826 06:15:52.574887 6768 init.cc:224] @ 0x7f739e41859a cudnnGetConvolutionForwardAlgorithm_v7
W0826 06:15:52.581318 6768 init.cc:224] @ 0x7f73e5eeaf45 paddle::operators::SearchAlgorithm<>::Find<>()
W0826 06:15:52.587302 6768 init.cc:224] @ 0x7f73e5f8c889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0826 06:15:52.592053 6768 init.cc:224] @ 0x7f73e5f8db33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0826 06:15:52.596971 6768 init.cc:224] @ 0x7f73e7f0cac0 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:15:52.603899 6768 init.cc:224] @ 0x7f73e7f0d2b1 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:15:52.607750 6768 init.cc:224] @ 0x7f73e7f06261 paddle::framework::OperatorBase::Run()
W0826 06:15:52.613262 6768 init.cc:224] @ 0x7f73e7c13f16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0826 06:15:52.618239 6768 init.cc:224] @ 0x7f73e7bbb551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0826 06:15:52.623728 6768 init.cc:224] @ 0x7f73e7bb904f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0826 06:15:52.625653 6768 init.cc:224] @ 0x7f73e7bb9314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0826 06:15:52.632129 6768 init.cc:224] @ 0x7f73e49acfb3 std::_Function_handler<>::_M_invoke()
W0826 06:15:52.638634 6768 init.cc:224] @ 0x7f73e47a8647 std::__future_base::_State_base::_M_do_set()
W0826 06:15:52.641016 6768 init.cc:224] @ 0x7f756889ca99 __pthread_once_slow
W0826 06:15:52.642813 6768 init.cc:224] @ 0x7f73e7bb54e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0826 06:15:52.649221 6768 init.cc:224] @ 0x7f73e47aaaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0826 06:15:52.652024 6768 init.cc:224] @ 0x7f7490a4ec80 (unknown)
W0826 06:15:52.659901 6768 init.cc:224] @ 0x7f75688956ba start_thread
W0826 06:15:52.663735 6768 init.cc:224] @ 0x7f75685cb4dd clone
W0826 06:15:52.667634 6768 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)
ercv8c1e11#
嗯,使用docker环境再确认下吧,我们这边也好定位。之前本地都没有复现过这种错误
2ul0zpep12#
这是另一次的报错信息:
2020-08-05 13:45:28,804-INFO: start loading proposals
2020-08-05 13:45:29,282-INFO: loading roidb 2012_test
100%|██████████| 970/970 [00:00<00:00, 1421.30it/s]
2020-08-05 13:45:30,310-INFO: finish loading roidb from scope 2012_test
2020-08-05 13:45:30,326-INFO: finish loading roidbs, total num = 970
2020-08-05 13:45:30,334-INFO: set max batches to 0
2020-08-05 13:45:30,342-INFO: places would be ommited when DataLoader is not iterable
W0805 13:45:30.530916 19928 device_context.cc:252] Please NOTE: device: 3, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0805 13:45:30.535248 19928 device_context.cc:260] device: 3, cuDNN Version: 7.6.
/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "tools/eval.py", line 243, in
main()
File "tools/eval.py", line 180, in main
sub_eval_prog, sub_keys, sub_values, resolution)
File "/home/bwang/projects/sniper-paddle/ppdet/utils/eval_utils.py", line 134, in eval_run
return_merged=False)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl
return_merged=return_merged)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel
tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const
3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
7 paddle::framework::details::ComputationOpHandle::RunImpl()
8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*)
10 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&)
11 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
12 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const
Python Call Stacks (More useful to users):
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(args,kwargs)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 2938, in conv2d
"data_format": data_format,
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 187, in _conv_norm
name=_name + '.conv2d.output.1')
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 462, in c1_stage
name=_name)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 490, incall*
res = self.c1_stage(res)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 193, in build_multi_scale
body_feats = self.backbone(im)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 345, in eval
return self.build_multi_scale(feed_vars)
File "tools/eval.py", line 105, in main
fetches = model.eval(feed_vars, multi_scale_test)
File "tools/eval.py", line 243, in
main()
Error Message Summary:
ExternalError: Cudnn error, CUDNN_STATUS_BAD_PARAM at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:300)
[operator < conv2d > error]
wrrgggsh13#
能否使用docker配置环境?
ewm0tg9j14#
想确认下你在单卡单batch的情况下预测也是会随机出现这种问题吗?
hgb9j2n615#
今天又崩溃退出了,报了另一个错:
2020-08-12 22:24:51,236-INFO: set max batches to 0
2020-08-12 22:24:51,237-INFO: places would be ommited when DataLoader is not iterable
W0812 22:24:51.457595 58764 device_context.cc:252] Please NOTE: device: 3, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0812 22:24:51.462311 58764 device_context.cc:260] device: 3, cuDNN Version: 7.6.
W0812 22:25:10.694689 58836 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0812 22:25:10.694741 58836 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0812 22:25:10.694751 58836 init.cc:221] The detail failure signal is:
W0812 22:25:10.694756 58836 init.cc:224]Aborted at 1597242310 (unix time) try "date -d @1597242310" if you are using GNU date
W0812 22:25:10.702319 58836 init.cc:224] PC: @ 0x0 (unknown)
W0812 22:25:10.703627 58836 init.cc:224]***SIGSEGV (@0x0) received by PID 58764 (TID 0x7ff0b87f4700) from PID 0; stack trace:***
W0812 22:25:10.708917 58836 init.cc:224] @ 0x7ff2b336e390 (unknown)
W0812 22:25:10.713052 58836 init.cc:224] @ 0x7ff1ff5111b8 (unknown)
W0812 22:25:10.716704 58836 init.cc:224] @ 0x7ff1ff51136a (unknown)
W0812 22:25:10.719149 58836 init.cc:224] @ 0x7ff1feda26f0 (unknown)
W0812 22:25:10.720286 58836 init.cc:224] @ 0x7ff1fec98d4c (unknown)
W0812 22:25:10.721880 58836 init.cc:224] @ 0x7ff1fe41b5fc (unknown)
W0812 22:25:10.725221 58836 init.cc:224] @ 0x7ff1fe41d429 cudnnGetConvolutionForwardWorkspaceSize
W0812 22:25:10.745923 58836 init.cc:224] @ 0x7ff267a238f0 paddle::operators::SearchAlgorithm<>::GetWorkspaceSize()
W0812 22:25:10.766815 58836 init.cc:224] @ 0x7ff267a3af5d paddle::operators::SearchAlgorithm<>::Find<>()
W0812 22:25:10.783972 58836 init.cc:224] @ 0x7ff267adc889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0812 22:25:10.804280 58836 init.cc:224] @ 0x7ff267addb33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0812 22:25:10.818567 58836 init.cc:224] @ 0x7ff269a5cac0 paddle::framework::OperatorWithKernel::RunImpl()
W0812 22:25:10.848381 58836 init.cc:224] @ 0x7ff269a5d2b1 paddle::framework::OperatorWithKernel::RunImpl()
W0812 22:25:10.879992 58836 init.cc:224] @ 0x7ff269a56261 paddle::framework::OperatorBase::Run()
W0812 22:25:10.905500 58836 init.cc:224] @ 0x7ff269763f16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0812 22:25:10.914219 58836 init.cc:224] @ 0x7ff26970b551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0812 22:25:10.922811 58836 init.cc:224] @ 0x7ff26970904f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0812 22:25:10.930382 58836 init.cc:224] @ 0x7ff269709314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0812 22:25:10.944406 58836 init.cc:224] @ 0x7ff2664fcfb3 std::_Function_handler<>::_M_invoke()
W0812 22:25:10.954800 58836 init.cc:224] @ 0x7ff2662f8647 std::__future_base::_State_base::_M_do_set()
W0812 22:25:10.961282 58836 init.cc:224] @ 0x7ff2b336ba99 __pthread_once_slow
W0812 22:25:10.965298 58836 init.cc:224] @ 0x7ff2697054e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0812 22:25:10.975075 58836 init.cc:224] @ 0x7ff2662faaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0812 22:25:10.977896 58836 init.cc:224] @ 0x7ff2a361b421 execute_native_thread_routine_compat
W0812 22:25:10.989349 58836 init.cc:224] @ 0x7ff2b33646ba start_thread
W0812 22:25:10.993046 58836 init.cc:224] @ 0x7ff2b309a41d clone
W0812 22:25:10.996786 58836 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)