Paddle python API放到multiprocess_reader中预测报cuda错

vohkndzv  于 2021-11-30  发布在  Java
关注(0)|答案(27)|浏览(469)

我的需求是在reader中使用inference_model进行预测,reader放在multiprocess_reader中。使用了两种实现方式都报错:Error: cudaSetDevice failed in paddle::platform::SetDeviced, error code : 3, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.htm l#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038: initialization error at (/paddle/paddle/fluid/platform/gpu_info.cc:240)
第一种是:在reader中定义CUDAPlace和exe;第二种是在reader中使用python API预测。其中第二种方法在pool构造的多进程中没问题,在本场景的multiprocess_reader中有问题。
最小可复现代码:paddle1.7 post97

import os
os.environ["CUDA_VISIBLE_DEVICES"]="2"
import paddle
import paddle.fluid as fluid
from paddle.fluid.core import PaddleTensor
from paddle.fluid.core import AnalysisConfig
from paddle.fluid.core import create_paddle_predictor
def reader():
    landmark_config = AnalysisConfig('lmk_model/model','lmk_model/params')
    landmark_config.switch_use_feed_fetch_ops(False)
    #landmark_config.disable_gpu() 
    landmark_config.enable_use_gpu(100, 0)  
    landmark_predictor = create_paddle_predictor(landmark_config) 
    for i in range(10):
        yield i
if __name__=="__main__":
    place = fluid.CUDAPlace(0)
    exe = fluid.Executor(place)
    train_reader = paddle.reader.multiprocess_reader([reader, reader])
    for data in train_reader():
        print(data)
6vl6ewon

6vl6ewon1#

在这个例子中,不用multiprocess_reader装饰是可以正常工作的吗

dwbf0jvd

dwbf0jvd2#

可以的,将for循环改成:for data in reader(): 可以正常跑。

6tqwzwtp

6tqwzwtp3#

贴一下我这边的报错供分析:

Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args,**self._kwargs)
  File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 587, in _read_into_pipe
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 578, in _read_into_pipe
    for sample in reader():
  File "reader_demo.py", line 13, in reader
    landmark_predictor = create_paddle_predictor(landmark_config)
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2   paddle::platform::SetDeviceId(int)
3   paddle::AnalysisConfig::fraction_of_gpu_memory_for_pool() const
4   std::unique_ptr<paddle::PaddlePredictor, std::default_delete<paddle::PaddlePredictor> > paddle::CreatePaddlePredictor<paddle::AnalysisConfig, (paddle::PaddleEngineKind)2>(paddle::AnalysisConfig const&)
5   std::unique_ptr<paddle::PaddlePredictor, std::default_delete<paddle::PaddlePredictor> > paddle::CreatePaddlePredictor<paddle::AnalysisConfig>(paddle::AnalysisConfig const&)

----------------------
Error Message Summary:
----------------------
ExternalError:  Cuda error(3), initialization error.
  [Advise: Please search for the error code(3) on website( https://docs.nvidia.com/cuda/archive/10.0/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 ) to get Nvidia's official solution about CUDA Error.] at (/work/paddle/paddle/fluid/platform/gpu_info.cc:212)
ou6hu8tu

ou6hu8tu4#

create_paddle_predictor 非线程安全,可加锁试下

holgip5t

holgip5t5#

我改成:
#place = fluid.CUDAPlace(0)
#exe = fluid.Executor(place)
train_reader = paddle.reader.multiprocess_reader([reader])
也报同样的错,看起来是multiprocess_reader导致的问题?

ipakzgxi

ipakzgxi6#

multiprocess reader我理解是多线程了吧,你在reader中使用了create_paddle_predictor ,但是这个接口非线程安全的。

v440hwme

v440hwme7#

文档里写的multiprocess_reader是多进程的。而且我上面改动后,仅剩一个reader在运行了,非线程安全也没影响吧

svujldwt

svujldwt8#

其实multiprocess_reader只是一个简单的多进程装饰器,没有实现进程间同步的安全机制,仅用在原先数据读取接口较慢的情况下,用户从磁盘读取数据这种简单场景,而且现在也不推荐使用

6g8kf2rb

6g8kf2rb9#

您上面说pool构造的多进程没有问题,是怎么实现的?不能满足您的需求吗

izj3ouym

izj3ouym10#

@NHZlX 从示例代码和报错能看出来,为什么这么写会报CUDA没初始化的错误吗

szqfcxe2

szqfcxe211#

@wang-kangkang第一种是:在reader中定义CUDAPlace和exe;,请问第一种报什么错

tag5nh1u

tag5nh1u12#

类似,又不完全一样。

import os
os.environ["CUDA_VISIBLE_DEVICES"]="2"
import paddle
import paddle.fluid as fluid
def reader():
    place = fluid.CUDAPlace(0)
    exe = fluid.Executor(place)
    [program, feed, fetch] = fluid.io.load_inference_model('lmk_model', exe, 'model', 'params')
    for i in range(10):
        yield i
if __name__=="__main__":
    place = fluid.CUDAPlace(0)
    exe = fluid.Executor(place)
    train_reader = paddle.reader.multiprocess_reader([reader])
    for data in train_reader():
        print(data)

报错为:

/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py:782: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "test1.py", line 15, in <module>
    for data in train_reader():
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/reader/decorator.py", line 614, in pipe_reader
    raise ValueError("multiprocess reader raises an exception")
ValueError: multiprocess reader raises an exception
Process Process-1:
Traceback (most recent call last):
  File "/ssd3/my/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/ssd3/my/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args,**self._kwargs)
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/reader/decorator.py", line 587, in _read_into_pipe
    six.reraise(*sys.exc_info())
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/reader/decorator.py", line 578, in _read_into_pipe
    for sample in reader():
  File "test1.py", line 8, in reader
    [program, feed, fetch] = fluid.io.load_inference_model('lmk_model', exe, 'model', 'params')
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/io.py", line 1377, in load_inference_model
    load_persistables(executor, load_dirname, program, params_filename)
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/io.py", line 917, in load_persistables
    filename=filename)
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/io.py", line 742, in load_vars
    filename=filename)
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/io.py", line 794, in load_vars
    executor.run(load_prog)
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 783, in run
    six.reraise(*sys.exc_info())
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 778, in run
    use_program_cache=use_program_cache)
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 831, in _run_impl
    use_program_cache=use_program_cache)
  File "/ssd3/my/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 905, in _run_program
    fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2   paddle::platform::GetCurrentDeviceId()
3   paddle::platform::CUDADeviceContext::CUDADeviceContext(paddle::platform::CUDAPlace)
4   std::_Function_handler<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > (), std::reference_wrapper<std::_Bind_simple<void paddle::platform::EmplaceDeviceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >, std::less<paddle::platform::Place>, std::allocator<std::pair<paddle::platform::Place const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > > > > >*, paddle::platform::Place)::{lambda()#1} ()> > >::_M_invoke(std::_Any_data const&)
5   std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >, std::__future_base::_Result_base::_Deleter>, std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > > >::_M_invoke(std::_Any_data const&)
6   std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
7   std::__future_base::_Deferred_state<std::_Bind_simple<void paddle::platform::EmplaceDeviceContext<paddle::platform::CUDADeviceContext, paddle::platform::CUDAPlace>(std::map<paddle::platform::Place, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >, std::less<paddle::platform::Place>, std::allocator<std::pair<paddle::platform::Place const, std::shared_future<std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > > > > >*, paddle::platform::Place)::{lambda()#1} ()>, std::unique_ptr<paddle::platform::DeviceContext, std::default_delete<paddle::platform::DeviceContext> > >::_M_run_deferred()
8   paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
9   paddle::framework::GarbageCollector::GarbageCollector(paddle::platform::Place const&, unsigned long)
10  paddle::framework::UnsafeFastGPUGarbageCollector::UnsafeFastGPUGarbageCollector(paddle::platform::CUDAPlace const&, unsigned long)
11  paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
12  paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)

----------------------
Error Message Summary:
----------------------
Error: cudaGetDevice failed in paddle::platform::GetCurrentDeviceId, error code : 3, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038: initialization error at (/paddle/paddle/fluid/platform/gpu_info.cc:211)
quhf5bfb

quhf5bfb13#

这里多个进程不能都用一张GPU卡,这里是不是每个进程都用的是CUDAPlace(0)

lp0sw83n

lp0sw83n14#

什么原因呢?而且我的场景里就是需要在multiprocess_reader里的reader里进行预测,得把这个方案走通

liwlm1x9

liwlm1x915#

稍等,我再排查一下

相关问题