Paddle 推理过程中,cudnnGetConvolutionForwardAlgorithm_v7报错

qf9go6mv  于 2022-04-21  发布在  Java
关注(0)|答案(21)|浏览(255)

为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题: 无

  • 版本、环境信息:

Paddle version: 1.8.3
Paddle With CUDA: True
OS: debian stretch/sid
Python version: 3.7.7
CUDA version: 10.1.243
cuDNN version: None.None.None # 注:系统中安装libcudnn.so.7.6.5,且将路径加入$LD_LIBRARY_PATH
Nvidia driver version: 418.74

  • 问题描述:

使用PaddleDetection中的tool/eval.py进行推理,环境为单机,单卡或多卡,Tesla V100 16G或 Titan RTX
模型为Cascade RCNN (backbone为R101vd或R200vd) , multi-scale test。

以下报错大约有30%的概率出现(使用相同的脚本无法稳定复现,不知道跟什么原因有关):

2020-08-05 12:50:51,695-INFO: start loading proposals
2020-08-05 12:50:52,457-INFO: loading roidb 2012_test
100%|████████████████████████████████████████| 970/970 [00:01<00:00, 601.75it/s]
2020-08-05 12:50:54,377-INFO: finish loading roidb from scope 2012_test
2020-08-05 12:50:54,378-INFO: finish loading roidbs, total num = 970
2020-08-05 12:50:54,379-INFO: set max batches to 0
2020-08-05 12:50:54,380-INFO: places would be ommited when DataLoader is not iterable
W0805 12:50:54.530522 4141844 device_context.cc:252] Please NOTE: device: 5, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0
W0805 12:50:55.613425 4141844 device_context.cc:260] device: 5, cuDNN Version: 7.6.
W0805 12:51:24.223932 4141881 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0805 12:51:24.223980 4141881 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0805 12:51:24.223989 4141881 init.cc:221] The detail failure signal is:

W0805 12:51:24.224001 4141881 init.cc:224]Aborted at 1596603084 (unix time) try "date -d @1596603084" if you are using GNU date
W0805 12:51:24.228863 4141881 init.cc:224] PC: @ 0x0 (unknown)
W0805 12:51:24.346484 4141881 init.cc:224]***SIGSEGV (@0x8) received by PID 4141844 (TID 0x7f012db3d700) from PID 8; stack trace:***
W0805 12:51:24.351244 4141881 init.cc:224] @ 0x7f01e3671390 (unknown)
W0805 12:51:24.353901 4141881 init.cc:224] @ 0x7f012eda2747 (unknown)
W0805 12:51:24.356168 4141881 init.cc:224] @ 0x7f012ec98d4c (unknown)
W0805 12:51:24.358356 4141881 init.cc:224] @ 0x7f012e41b5fc (unknown)
W0805 12:51:24.360416 4141881 init.cc:224] @ 0x7f012e42b938 (unknown)
W0805 12:51:24.362363 4141881 init.cc:224] @ 0x7f012e41859a cudnnGetConvolutionForwardAlgorithm_v7
W0805 12:51:24.447378 4141881 init.cc:224] @ 0x7f019853ff45 paddle::operators::SearchAlgorithm<>::Find<>()
W0805 12:51:24.469980 4141881 init.cc:224] @ 0x7f01985e1889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0805 12:51:24.481895 4141881 init.cc:224] @ 0x7f01985e2b33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0805 12:51:24.504448 4141881 init.cc:224] @ 0x7f019a561ac0 paddle::framework::OperatorWithKernel::RunImpl()
W0805 12:51:24.565385 4141881 init.cc:224] @ 0x7f019a5622b1 paddle::framework::OperatorWithKernel::RunImpl()
W0805 12:51:24.604465 4141881 init.cc:224] @ 0x7f019a55b261 paddle::framework::OperatorBase::Run()
W0805 12:51:24.635419 4141881 init.cc:224] @ 0x7f019a268f16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0805 12:51:24.657658 4141881 init.cc:224] @ 0x7f019a210551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0805 12:51:24.673673 4141881 init.cc:224] @ 0x7f019a20e04f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0805 12:51:24.687579 4141881 init.cc:224] @ 0x7f019a20e314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0805 12:51:24.724630 4141881 init.cc:224] @ 0x7f0197001fb3 std::_Function_handler<>::_M_invoke()
W0805 12:51:24.769093 4141881 init.cc:224] @ 0x7f0196dfd647 std::__future_base::_State_base::_M_do_set()
W0805 12:51:24.773929 4141881 init.cc:224] @ 0x7f01e366ea99 __pthread_once_slow
W0805 12:51:24.780242 4141881 init.cc:224] @ 0x7f019a20a4e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0805 12:51:24.817785 4141881 init.cc:224] @ 0x7f0196dffaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0805 12:51:24.850741 4141881 init.cc:224] @ 0x7f01d4120421 execute_native_thread_routine_compat
W0805 12:51:24.857818 4141881 init.cc:224] @ 0x7f01e36676ba start_thread
W0805 12:51:24.862519 4141881 init.cc:224] @ 0x7f01e339d41d clone
W0805 12:51:24.870891 4141881 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)

ntjbwcob

ntjbwcob1#

想确认下你在单卡单batch的情况下预测也是会随机出现这种问题吗?

刚才用TITAN RTX单卡跑了10+次,只有第一次出现cudnnGetConvolutionForwardAlgorithm_v7报错。

能否使用docker配置环境?

可以,我直接docker pull paddlepaddle/paddle:1.8.3-gpu-cuda10.0-cudnn7 试试。

wbrvyc0a

wbrvyc0a2#

@flishwang export FLAGS_selected_gpu=0, 1试一下

使用docker镜像内带的2.6.4版本的nccl后,run_check通过,多卡推理不报错,这个问题解决了。应该就是nccl版本的问题。

m3eecexj

m3eecexj3#

@flishwang export FLAGS_selected_gpu=0, 1试一下

pdtvr36n

pdtvr36n4#

@flishwang 你好,可以使用export FLAGS_selected_gpu避开有问题的显卡,然后在run_check看一下嘛

dxpyg8gm

dxpyg8gm5#

这个可能和nccl有关,https://www.paddlepaddle.org.cn/documentation/docs/zh/1.8/install/install_Ubuntu.html#cpu-gpu 可以参考这里有相关版本的要求 另外如果能使用docker的话建议使用这里的镜像https://www.paddlepaddle.org.cn/documentation/docs/zh/1.8/install/install_Docker.html#id3

acruukt9

acruukt96#

这个由于我们这边没有能够复现的环境,所以还比较麻烦,从报错信息来看是再调cudnn的卷积挂的,有可能和cudnn有关

好的。
另外我们还有一台机器,有一块儿显卡无法正常工作。
model.with_data_parallel和fluid.install_check.run_check都会引发nccl error,无论with_data_parallel中的places是否涉及坏掉的显卡。
但这台机器可以使用剩下的好的显卡,使用mxnet正常训练和测试。
这个问题能够定位或解决吗?
报错日志和nvidia-smi信息如下:

(paddle) bwang@hpc-training-6ff5b9b6cb-hhscf:~/projects/sniper-paddle$ python tools/eval.py --cfg experiments/cascade/translinemain_cascade_r101vd_gc_queue.yaml --gpu 4,5
2020-09-09 16:40:27,274-INFO: The 'num_classes'(number of classes) you set is 10, and 'with_background' in 'dataset' sets True.
So please note the actual number of categories is 9.
2020-09-09 16:40:28,070-INFO: args pass to the program are:{"config": "experiments/cascade/translinemain_cascade_r101vd_gc_queue.yaml", "opt": {}, "gpu": "4,5", "experiment": "unknown", "cache": "", "json_eval": false, "output_eval": null, "output_proposal": false}
2020-09-09 16:41:58,481-INFO: start loading proposals
2020-09-09 16:41:59,165-INFO: loading roidb 2012_test
100%|████████████████████████████████████████| 970/970 [00:01<00:00, 676.44it/s]
2020-09-09 16:42:01,036-INFO: finish loading roidb from scope 2012_test
2020-09-09 16:42:01,036-INFO: finish loading roidbs, total num = 970
2020-09-09 16:42:01,037-INFO: set max batches to 0
2020-09-09 16:42:01,038-INFO: places would be ommited when DataLoader is not iterable
W0909 16:42:01.243356  1164 device_context.cc:252] Please NOTE: device: 4, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0
W0909 16:42:02.461562  1164 device_context.cc:260] device: 4, cuDNN Version: 7.6.
/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "tools/eval.py", line 251, in <module>
    main()
  File "tools/eval.py", line 188, in main
    sub_eval_prog, sub_keys, sub_values, resolution)
  File "/home/bwang/projects/sniper-paddle/ppdet/utils/eval_utils.py", line 134, in eval_run
    return_merged=False)
  File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
    six.reraise(*sys.exc_info())
  File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
    return_merged=return_merged)
  File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1156, in _run_impl
    program._compile(scope, self.place)
  File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 443, in _compile
    places=self._places)
  File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 396, in _compile_data_parallel
    self._exec_strategy, self._build_strategy, self._graph)
paddle.fluid.core_avx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2   paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, ncclUniqueId*, unsigned long, unsigned long)
3   paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocator<paddle::framework::Scope*> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)

----------------------
Error Message Summary:
----------------------
ExternalError:  Nccl error, unhandled system error  at (/paddle/paddle/fluid/platform/nccl_helper.h:114)

terminate called without an active exception
W0909 16:42:32.513360  1256 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0909 16:42:32.513393  1256 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0909 16:42:32.513401  1256 init.cc:221] The detail failure signal is:

W0909 16:42:32.513413  1256 init.cc:224]***Aborted at 1599640952 (unix time) try "date -d @1599640952" if you are using GNU date***
W0909 16:42:32.518121  1256 init.cc:224] PC: @                0x0 (unknown)
W0909 16:42:32.518368  1256 init.cc:224]***SIGABRT (@0x27160000048c) received by PID 1164 (TID 0x7fd3e4b35700) from PID 1164; stack trace:***
W0909 16:42:32.522661  1256 init.cc:224]     @     0x7fd53fb2b390 (unknown)
W0909 16:42:32.526769  1256 init.cc:224]     @     0x7fd53f785428 gsignal
W0909 16:42:32.530611  1256 init.cc:224]     @     0x7fd53f78702a abort
W0909 16:42:32.563489  1256 init.cc:224]     @     0x7fd5305bf84a __gnu_cxx::__verbose_terminate_handler()
W0909 16:42:32.568449  1256 init.cc:224]     @     0x7fd5305bdf47 __cxxabiv1::__terminate()
W0909 16:42:32.574772  1256 init.cc:224]     @     0x7fd5305bdf7d std::terminate()
W0909 16:42:32.579522  1256 init.cc:224]     @     0x7fd5305bdc5a __gxx_personality_v0
W0909 16:42:32.599822  1256 init.cc:224]     @     0x7fd53d052b97 _Unwind_ForcedUnwind_Phase2
W0909 16:42:32.605825  1256 init.cc:224]     @     0x7fd53d052e7d _Unwind_ForcedUnwind
W0909 16:42:32.610039  1256 init.cc:224]     @     0x7fd53fb2a070 __GI___pthread_unwind
W0909 16:42:32.614220  1256 init.cc:224]     @     0x7fd53fb22845 __pthread_exit
W0909 16:42:32.685542  1256 init.cc:224]     @     0x561f80b6d059 PyThread_exit_thread
W0909 16:42:32.687439  1256 init.cc:224]     @     0x561f809f2c10 PyEval_RestoreThread.cold.799
W0909 16:42:32.692178  1256 init.cc:224]     @     0x7fd52af85cde (unknown)
W0909 16:42:32.694483  1256 init.cc:224]     @     0x561f80af3ab4 _PyMethodDef_RawFastCallKeywords
W0909 16:42:32.696527  1256 init.cc:224]     @     0x561f80af3bd1 _PyCFunction_FastCallKeywords
W0909 16:42:32.698681  1256 init.cc:224]     @     0x561f80b5a57b _PyEval_EvalFrameDefault
W0909 16:42:32.700711  1256 init.cc:224]     @     0x561f80a9f389 _PyEval_EvalCodeWithName
W0909 16:42:32.702916  1256 init.cc:224]     @     0x561f80aa04c5 _PyFunction_FastCallDict
W0909 16:42:32.705274  1256 init.cc:224]     @     0x561f80abfa73 _PyObject_Call_Prepend
W0909 16:42:32.707274  1256 init.cc:224]     @     0x561f80b0727a slot_tp_call
W0909 16:42:32.709513  1256 init.cc:224]     @     0x561f80b082db _PyObject_FastCallKeywords
W0909 16:42:32.711834  1256 init.cc:224]     @     0x561f80b5a146 _PyEval_EvalFrameDefault
W0909 16:42:32.714344  1256 init.cc:224]     @     0x561f80aa03fb _PyFunction_FastCallDict
W0909 16:42:32.716514  1256 init.cc:224]     @     0x561f80abfa73 _PyObject_Call_Prepend
W0909 16:42:32.718256  1256 init.cc:224]     @     0x561f80b0727a slot_tp_call
W0909 16:42:32.720507  1256 init.cc:224]     @     0x561f80b082db _PyObject_FastCallKeywords
W0909 16:42:32.722937  1256 init.cc:224]     @     0x561f80b5aa39 _PyEval_EvalFrameDefault
W0909 16:42:32.725335  1256 init.cc:224]     @     0x561f80a9f389 _PyEval_EvalCodeWithName
W0909 16:42:32.727396  1256 init.cc:224]     @     0x561f80aa04c5 _PyFunction_FastCallDict
W0909 16:42:32.729285  1256 init.cc:224]     @     0x561f80abfa73 _PyObject_Call_Prepend
W0909 16:42:32.731227  1256 init.cc:224]     @     0x561f80ab1fde PyObject_Call
Aborted (core dumped)
(paddle) bwang@hpc-training-6ff5b9b6cb-hhscf:~/projects/sniper-paddle$ 
(paddle) bwang@hpc-training-6ff5b9b6cb-hhscf:~/projects/sniper-paddle$ nvidia-smi
Wed Sep  9 16:43:06 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74       Driver Version: 418.74       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           Off  | 00000000:3E:00.0 Off |                  N/A |
| 41%   50C    P8    25W / 280W |    251MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN RTX           Off  | 00000000:3F:00.0 Off |                  N/A |
| 41%   39C    P8    25W / 280W |      0MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN RTX           Off  | 00000000:40:00.0 Off |                  N/A |
| 41%   43C    P8   393W / 280W |      0MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN RTX           Off  | 00000000:60:00.0 Off |                  N/A |
| 41%   40C    P8    18W / 280W |      0MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN RTX           Off  | 00000000:62:00.0 Off |                  N/A |
| 40%   45C    P8    21W / 280W |      0MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN RTX           Off  | 00000000:63:00.0 Off |                  N/A |
| 41%   39C    P8    24W / 280W |      0MiB / 24190MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
(paddle) bwang@hpc-training-6ff5b9b6cb-hhscf:~/projects/sniper-paddle$
lokaqttq

lokaqttq7#

这个由于我们这边没有能够复现的环境,所以还比较麻烦,从报错信息来看是再调cudnn的卷积挂的,有可能和cudnn有关

ttcibm8c

ttcibm8c8#

这里有可能受batch size或是计算量的影响,如果用cascade rcnn+ R200vd + multi-scale test这种组合的话有可能会出现这种情况,可以先试着把batch size设为1,然后使用single scale的方式进行测试

我们目前主要用cascade + r101vd, multiscale test, batch size=1,
输入图片scale为3200, 2048, 1024, 576
所以有可能是计算量的影响。

我们面临的实际情况需要对图片进行multiscale test.
把scale为3200的图片切成小块进行识别,再拼回去(类似dota的数据处理方式)可能不太可行。

请问随机报错的原因是什么呢?paddle框架的问题?还是cudnn或cuda的问题?
这个问题能否解决呢?

z31licg0

z31licg09#

这里有可能受batch size或是计算量的影响,如果用cascade rcnn+ R200vd + multi-scale test这种组合的话有可能会出现这种情况,可以先试着把batch size设为1,然后使用single scale的方式进行测试

6ju8rftf

6ju8rftf10#

嗯,使用docker环境再确认下吧,我们这边也好定位。之前本地都没有复现过这种错误

基于1.8.3-gpu-cuda10.0-cudnn7这个docker,安装openssh-server、git等ubuntu包后,放到k8s中,仍然随机崩溃,以下是3次崩溃的日志:

W0826 06:06:09.501305 4216 device_context.cc:260] device: 0, cuDNN Version: 7.6.
/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "tools/eval.py", line 243, in
main()
File "tools/eval.py", line 180, in main
sub_eval_prog, sub_keys, sub_values, resolution)
File "/home/bwang/projects/sniper-paddle/ppdet/utils/eval_utils.py", line 134, in eval_run
return_merged=False)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl
return_merged=return_merged)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel
tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const
3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
7 paddle::framework::details::ComputationOpHandle::RunImpl()
8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*)
10 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&)
11 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
12 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

Python Call Stacks (More useful to users):

File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(args,kwargs)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 2938, in conv2d
"data_format": data_format,
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 187, in _conv_norm
name=_name + '.conv2d.output.1')
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 462, in c1_stage
name=_name)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 490, in
call
*
res = self.c1_stage(res)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 210, in build_multi_scale
body_feats = self.backbone(im)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 363, in eval
return self.build_multi_scale(feed_vars)
File "tools/eval.py", line 105, in main
fetches = model.eval(feed_vars, multi_scale_test)
File "tools/eval.py", line 243, in
main()

Error Message Summary:

ExternalError: Cudnn error, CUDNN_STATUS_BAD_PARAM at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:300)
[operator < conv2d > error]

2020-08-26 06:10:45,407-INFO: set max batches to 0
2020-08-26 06:10:45,408-INFO: places would be ommited when DataLoader is not iterable
W0826 06:10:45.596765 4864 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0826 06:10:45.601509 4864 device_context.cc:260] device: 0, cuDNN Version: 7.6.
W0826 06:11:05.817611 5102 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0826 06:11:05.817775 5102 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0826 06:11:05.817786 5102 init.cc:221] The detail failure signal is:

W0826 06:11:05.817793 5102 init.cc:224]Aborted at 1598422265 (unix time) try "date -d @1598422265" if you are using GNU date
W0826 06:11:05.822052 5102 init.cc:224] PC: @ 0x0 (unknown)
W0826 06:11:05.822432 5102 init.cc:224]***SIGSEGV (@0x0) received by PID 4864 (TID 0x7f9ad84ca700) from PID 0; stack trace:***
W0826 06:11:05.826328 5102 init.cc:224] @ 0x7f9baae06390 (unknown)
W0826 06:11:05.827157 5102 init.cc:224] @ 0x7f99df5111b8 (unknown)
W0826 06:11:05.828028 5102 init.cc:224] @ 0x7f99df51136a (unknown)
W0826 06:11:05.828830 5102 init.cc:224] @ 0x7f99deda26f0 (unknown)
W0826 06:11:05.829461 5102 init.cc:224] @ 0x7f99dec98d4c (unknown)
W0826 06:11:05.829979 5102 init.cc:224] @ 0x7f99de41b5fc (unknown)
W0826 06:11:05.830523 5102 init.cc:224] @ 0x7f99de41d429 cudnnGetConvolutionForwardWorkspaceSize
W0826 06:11:05.836208 5102 init.cc:224] @ 0x7f9a2843a8f0 paddle::operators::SearchAlgorithm<>::GetWorkspaceSize()
W0826 06:11:05.841645 5102 init.cc:224] @ 0x7f9a28451f5d paddle::operators::SearchAlgorithm<>::Find<>()
W0826 06:11:05.846259 5102 init.cc:224] @ 0x7f9a284f3889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0826 06:11:05.849913 5102 init.cc:224] @ 0x7f9a284f4b33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0826 06:11:05.853875 5102 init.cc:224] @ 0x7f9a2a473ac0 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:11:05.859454 5102 init.cc:224] @ 0x7f9a2a4742b1 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:11:05.862534 5102 init.cc:224] @ 0x7f9a2a46d261 paddle::framework::OperatorBase::Run()
W0826 06:11:05.866932 5102 init.cc:224] @ 0x7f9a2a17af16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0826 06:11:05.870985 5102 init.cc:224] @ 0x7f9a2a122551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0826 06:11:05.877444 5102 init.cc:224] @ 0x7f9a2a12004f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0826 06:11:05.879237 5102 init.cc:224] @ 0x7f9a2a120314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0826 06:11:05.884681 5102 init.cc:224] @ 0x7f9a26f13fb3 std::_Function_handler<>::_M_invoke()
W0826 06:11:05.889741 5102 init.cc:224] @ 0x7f9a26d0f647 std::__future_base::_State_base::_M_do_set()
W0826 06:11:05.891779 5102 init.cc:224] @ 0x7f9baae03a99 __pthread_once_slow
W0826 06:11:05.893308 5102 init.cc:224] @ 0x7f9a2a11c4e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0826 06:11:05.898675 5102 init.cc:224] @ 0x7f9a26d11aa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0826 06:11:05.899813 5102 init.cc:224] @ 0x7f9ab2fb5c80 (unknown)
W0826 06:11:05.901708 5102 init.cc:224] @ 0x7f9baadfc6ba start_thread
W0826 06:11:05.903543 5102 init.cc:224] @ 0x7f9baab324dd clone
W0826 06:11:05.905416 5102 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)

2020-08-26 06:15:31,989-INFO: finish loading roidbs, total num = 970
2020-08-26 06:15:31,990-INFO: set max batches to 0
2020-08-26 06:15:31,990-INFO: places would be ommited when DataLoader is not iterable
W0826 06:15:32.250245 6523 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0826 06:15:32.255419 6523 device_context.cc:260] device: 0, cuDNN Version: 7.6.
W0826 06:15:52.560868 6768 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0826 06:15:52.561199 6768 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0826 06:15:52.561347 6768 init.cc:221] The detail failure signal is:

W0826 06:15:52.561537 6768 init.cc:224]Aborted at 1598422552 (unix time) try "date -d @1598422552" if you are using GNU date
W0826 06:15:52.565481 6768 init.cc:224] PC: @ 0x0 (unknown)
W0826 06:15:52.567220 6768 init.cc:224]***SIGSEGV (@0x8) received by PID 6523 (TID 0x7f74b3f6f700) from PID 8; stack trace:***
W0826 06:15:52.570947 6768 init.cc:224] @ 0x7f756889f390 (unknown)
W0826 06:15:52.571835 6768 init.cc:224] @ 0x7f739eda2747 (unknown)
W0826 06:15:52.572630 6768 init.cc:224] @ 0x7f739ec98d4c (unknown)
W0826 06:15:52.573356 6768 init.cc:224] @ 0x7f739e41b5fc (unknown)
W0826 06:15:52.574101 6768 init.cc:224] @ 0x7f739e42be5a (unknown)
W0826 06:15:52.574887 6768 init.cc:224] @ 0x7f739e41859a cudnnGetConvolutionForwardAlgorithm_v7
W0826 06:15:52.581318 6768 init.cc:224] @ 0x7f73e5eeaf45 paddle::operators::SearchAlgorithm<>::Find<>()
W0826 06:15:52.587302 6768 init.cc:224] @ 0x7f73e5f8c889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0826 06:15:52.592053 6768 init.cc:224] @ 0x7f73e5f8db33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0826 06:15:52.596971 6768 init.cc:224] @ 0x7f73e7f0cac0 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:15:52.603899 6768 init.cc:224] @ 0x7f73e7f0d2b1 paddle::framework::OperatorWithKernel::RunImpl()
W0826 06:15:52.607750 6768 init.cc:224] @ 0x7f73e7f06261 paddle::framework::OperatorBase::Run()
W0826 06:15:52.613262 6768 init.cc:224] @ 0x7f73e7c13f16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0826 06:15:52.618239 6768 init.cc:224] @ 0x7f73e7bbb551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0826 06:15:52.623728 6768 init.cc:224] @ 0x7f73e7bb904f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0826 06:15:52.625653 6768 init.cc:224] @ 0x7f73e7bb9314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0826 06:15:52.632129 6768 init.cc:224] @ 0x7f73e49acfb3 std::_Function_handler<>::_M_invoke()
W0826 06:15:52.638634 6768 init.cc:224] @ 0x7f73e47a8647 std::__future_base::_State_base::_M_do_set()
W0826 06:15:52.641016 6768 init.cc:224] @ 0x7f756889ca99 __pthread_once_slow
W0826 06:15:52.642813 6768 init.cc:224] @ 0x7f73e7bb54e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0826 06:15:52.649221 6768 init.cc:224] @ 0x7f73e47aaaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0826 06:15:52.652024 6768 init.cc:224] @ 0x7f7490a4ec80 (unknown)
W0826 06:15:52.659901 6768 init.cc:224] @ 0x7f75688956ba start_thread
W0826 06:15:52.663735 6768 init.cc:224] @ 0x7f75685cb4dd clone
W0826 06:15:52.667634 6768 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)

ercv8c1e

ercv8c1e11#

嗯,使用docker环境再确认下吧,我们这边也好定位。之前本地都没有复现过这种错误

2ul0zpep

2ul0zpep12#

这是另一次的报错信息:

2020-08-05 13:45:28,804-INFO: start loading proposals
2020-08-05 13:45:29,282-INFO: loading roidb 2012_test
100%|██████████| 970/970 [00:00<00:00, 1421.30it/s]
2020-08-05 13:45:30,310-INFO: finish loading roidb from scope 2012_test
2020-08-05 13:45:30,326-INFO: finish loading roidbs, total num = 970
2020-08-05 13:45:30,334-INFO: set max batches to 0
2020-08-05 13:45:30,342-INFO: places would be ommited when DataLoader is not iterable
W0805 13:45:30.530916 19928 device_context.cc:252] Please NOTE: device: 3, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0805 13:45:30.535248 19928 device_context.cc:260] device: 3, cuDNN Version: 7.6.
/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "tools/eval.py", line 243, in
main()
File "tools/eval.py", line 180, in main
sub_eval_prog, sub_keys, sub_values, resolution)
File "/home/bwang/projects/sniper-paddle/ppdet/utils/eval_utils.py", line 134, in eval_run
return_merged=False)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl
return_merged=return_merged)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel
tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::operators::CUDNNConvOpKernel::Compute(paddle::framework::ExecutionContext const&) const
3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernel, paddle::operators::CUDNNConvOpKernelpaddle::platform::float16 >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
7 paddle::framework::details::ComputationOpHandle::RunImpl()
8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase*, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long*)
10 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&)
11 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
12 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

Python Call Stacks (More useful to users):

File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(args,kwargs)
File "/home/bwang/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 2938, in conv2d
"data_format": data_format,
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 187, in _conv_norm
name=_name + '.conv2d.output.1')
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 462, in c1_stage
name=_name)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/backbones/resnet.py", line 490, in
call
*
res = self.c1_stage(res)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 193, in build_multi_scale
body_feats = self.backbone(im)
File "/home/bwang/projects/sniper-paddle/ppdet/modeling/architectures/cascade_rcnn.py", line 345, in eval
return self.build_multi_scale(feed_vars)
File "tools/eval.py", line 105, in main
fetches = model.eval(feed_vars, multi_scale_test)
File "tools/eval.py", line 243, in
main()

Error Message Summary:

ExternalError: Cudnn error, CUDNN_STATUS_BAD_PARAM at (/paddle/paddle/fluid/operators/conv_cudnn_op.cu:300)
[operator < conv2d > error]

wrrgggsh

wrrgggsh13#

能否使用docker配置环境?

ewm0tg9j

ewm0tg9j14#

想确认下你在单卡单batch的情况下预测也是会随机出现这种问题吗?

hgb9j2n6

hgb9j2n615#

今天又崩溃退出了,报了另一个错:
2020-08-12 22:24:51,236-INFO: set max batches to 0
2020-08-12 22:24:51,237-INFO: places would be ommited when DataLoader is not iterable
W0812 22:24:51.457595 58764 device_context.cc:252] Please NOTE: device: 3, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0812 22:24:51.462311 58764 device_context.cc:260] device: 3, cuDNN Version: 7.6.
W0812 22:25:10.694689 58836 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0812 22:25:10.694741 58836 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0812 22:25:10.694751 58836 init.cc:221] The detail failure signal is:

W0812 22:25:10.694756 58836 init.cc:224]Aborted at 1597242310 (unix time) try "date -d @1597242310" if you are using GNU date
W0812 22:25:10.702319 58836 init.cc:224] PC: @ 0x0 (unknown)
W0812 22:25:10.703627 58836 init.cc:224]***SIGSEGV (@0x0) received by PID 58764 (TID 0x7ff0b87f4700) from PID 0; stack trace:***
W0812 22:25:10.708917 58836 init.cc:224] @ 0x7ff2b336e390 (unknown)
W0812 22:25:10.713052 58836 init.cc:224] @ 0x7ff1ff5111b8 (unknown)
W0812 22:25:10.716704 58836 init.cc:224] @ 0x7ff1ff51136a (unknown)
W0812 22:25:10.719149 58836 init.cc:224] @ 0x7ff1feda26f0 (unknown)
W0812 22:25:10.720286 58836 init.cc:224] @ 0x7ff1fec98d4c (unknown)
W0812 22:25:10.721880 58836 init.cc:224] @ 0x7ff1fe41b5fc (unknown)
W0812 22:25:10.725221 58836 init.cc:224] @ 0x7ff1fe41d429 cudnnGetConvolutionForwardWorkspaceSize
W0812 22:25:10.745923 58836 init.cc:224] @ 0x7ff267a238f0 paddle::operators::SearchAlgorithm<>::GetWorkspaceSize()
W0812 22:25:10.766815 58836 init.cc:224] @ 0x7ff267a3af5d paddle::operators::SearchAlgorithm<>::Find<>()
W0812 22:25:10.783972 58836 init.cc:224] @ 0x7ff267adc889 paddle::operators::CUDNNConvOpKernel<>::Compute()
W0812 22:25:10.804280 58836 init.cc:224] @ 0x7ff267addb33 ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators17CUDNNConvOpKernelIfEENSA_IdEENSA_INS7_7float16EEEEEclEPKcSH_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4
W0812 22:25:10.818567 58836 init.cc:224] @ 0x7ff269a5cac0 paddle::framework::OperatorWithKernel::RunImpl()
W0812 22:25:10.848381 58836 init.cc:224] @ 0x7ff269a5d2b1 paddle::framework::OperatorWithKernel::RunImpl()
W0812 22:25:10.879992 58836 init.cc:224] @ 0x7ff269a56261 paddle::framework::OperatorBase::Run()
W0812 22:25:10.905500 58836 init.cc:224] @ 0x7ff269763f16 paddle::framework::details::ComputationOpHandle::RunImpl()
W0812 22:25:10.914219 58836 init.cc:224] @ 0x7ff26970b551 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0812 22:25:10.922811 58836 init.cc:224] @ 0x7ff26970904f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0812 22:25:10.930382 58836 init.cc:224] @ 0x7ff269709314 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0812 22:25:10.944406 58836 init.cc:224] @ 0x7ff2664fcfb3 std::_Function_handler<>::_M_invoke()
W0812 22:25:10.954800 58836 init.cc:224] @ 0x7ff2662f8647 std::__future_base::_State_base::_M_do_set()
W0812 22:25:10.961282 58836 init.cc:224] @ 0x7ff2b336ba99 __pthread_once_slow
W0812 22:25:10.965298 58836 init.cc:224] @ 0x7ff2697054e2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0812 22:25:10.975075 58836 init.cc:224] @ 0x7ff2662faaa4 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0812 22:25:10.977896 58836 init.cc:224] @ 0x7ff2a361b421 execute_native_thread_routine_compat
W0812 22:25:10.989349 58836 init.cc:224] @ 0x7ff2b33646ba start_thread
W0812 22:25:10.993046 58836 init.cc:224] @ 0x7ff2b309a41d clone
W0812 22:25:10.996786 58836 init.cc:224] @ 0x0 (unknown)
Segmentation fault (core dumped)

相关问题