Paddle 单个显卡执行empy_cache()时,提示out of memory

xu3bshqb  于 4个月前  发布在  其他
关注(0)|答案(1)|浏览(67)

请提出你的问题 Please ask your question

目前有一台机器4090跑语义分割任务,每次执行时会清一下显存,执行empty_cache(),但是偶发会出现提示out of memory错误。具体提示如下
`Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 66, in build_component
obj = self.build_component_impl(com_class, **params)
File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 80, in build_component_impl
return component_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/paddleseg/models/backbones/hrnet.py", line 802, in HRNet_W48
model = HRNet(
File "/usr/local/lib/python3.10/dist-packages/paddleseg/models/backbones/hrnet.py", line 97, in init
self.conv_layer1_1 = layers.ConvBNReLU(
File "/usr/local/lib/python3.10/dist-packages/paddleseg/models/layers/layer_libs.py", line 44, in init
self._conv = nn.Conv2D(
File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/conv.py", line 690, in init
super().init(
File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/conv.py", line 156, in init
self.weight = self.create_parameter(
File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 781, in create_parameter
return self._helper.create_parameter(
File "/usr/local/lib/python3.10/dist-packages/paddle/base/layer_helper_base.py", line 430, in create_parameter
return self.main_program.global_block().create_parameter(
File "/usr/local/lib/python3.10/dist-packages/paddle/base/framework.py", line 4381, in create_parameter
initializer(param, self)
File "/usr/local/lib/python3.10/dist-packages/paddle/nn/initializer/initializer.py", line 40, in call
return self.forward(param, block)
File "/usr/local/lib/python3.10/dist-packages/paddle/nn/initializer/normal.py", line 75, in forward
out_var = _C_ops.gaussian(
OSError: (External) CUDA error(2), out of memory.
[Hint: 'cudaErrorMemoryAllocation'. The API call failed because it was unable to allocate enough memory to perform the requested operation. ] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:209)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 175, in use_cuda_device
yield
File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 191, in model_predict
model, transforms = load_model_new(args_dict)
File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 51, in load_model_new
model = builder.model
File "/usr/local/lib/python3.10/dist-packages/paddleseg/utils/utils.py", line 275, in get
val = self.func(obj)
File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 153, in model
return self.build_component(model_cfg)
File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 56, in build_component
params[key] = self.build_component(val)
File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 72, in build_component
raise RuntimeError(
RuntimeError: Tried to create a HRNet_W48 object, but the operation has failed. Please double check the arguments used to create the object.
The error message is:
(External) CUDA error(2), out of memory.
[Hint: 'cudaErrorMemoryAllocation'. The API call failed because it was unable to allocate enough memory to perform the requested operation. ] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:209)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 472, in call
result = fn(*args, **kwargs)
File "/home/soft/paddle_out/api/seg_utils/split_predict.py", line 372, in seg_split_predict
img_result = use_paddleseg_model_to_predict_semantic_segmentation(paddleseg_args_dict,
File "/home/soft/paddle_out/api/seg_utils/split_predict.py", line 200, in use_paddleseg_model_to_predict_semantic_segmentation
return_dict = model_predict(paddleseg_args_dict, progress_bar=progress_bar, time_scale=time_scale)
File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 189, in model_predict
with use_cuda_device(0):
File "/usr/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 178, in use_cuda_device
paddle.device.cuda.empty_cache()
File "/usr/local/lib/python3.10/dist-packages/paddle/device/cuda/init.py", line 173, in empty_cache
core.cuda_empty_cache()
OSError: (External) CUDA error(2), out of memory.
[Hint: 'cudaErrorMemoryAllocation'. The API call failed because it was unable to allocate enough memory to perform the requested operation. ] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:209)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/soft/paddle_out/api/predict/utils.py", line 357, in seg_predict_out
seg_split_predict(image_path_list[0],
File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 332, in wrapped_f
return self(f, *args, **kw)
File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 469, in call
do = self.iter(retry_state=retry_state)
File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 370, in iter
result = action(retry_state)
File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 413, in exc_check
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7ff99741cc40 state=finished raised OSError>]`

5fjcxozz

5fjcxozz1#

这个服务部署了3个集群,另外每次都是其中同一个节点出的问题。实际上看了下后台整体显存最高时候也只用了25%。

相关问题