- 版本、环境信息:
1)PaddlePaddle版本:1.6.3
2)GPU:V100 32g、CUDA 10.0、CUDNN 7.6
4)系统环境:Ubuntu 16.04、Python3.6.9
- 训练信息
1)单机多进程多卡
使用的是文本生成模型,模型训练过程中执行测试时,部分卡上的进程中断出现如下错误:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/root/liwei85/installed-packages/Python3.6.9/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/root/liwei85/installed-packages/Python3.6.9/lib/python3.6/threading.py", line 864, in run
self._target(self._args,self._kwargs)
File "/root/liwei85/envs/paddle1.6_py3.6/lib/python3.6/site-packages/paddle/fluid/layers/io.py", line 474, inprovider_thread*
six.reraise(*sys.exc_info())
File "/root/liwei85/envs/paddle1.6_py3.6/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/root/liwei85/envs/paddle1.6_py3.6/lib/python3.6/site-packages/paddle/fluid/layers/io.py", line 455, inprovider_thread
for tensors in func():
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 256, in wrapper
examples, batch_size, phase=phase, do_dec=do_dec, place=place):
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 217, in _prepare_batch_data
yield self._pad_batch_records(batch_records, do_dec, place)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 300, in _pad_batch_records
return self._prepare_infer_input(batch_records, place=place)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 350, in _prepare_infer_input
place, [range(trg_word.shape[0] + 1)] * 2)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 342, in to_lodtensor
data_tensor.set(data, place)
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)
Error Message Summary:
Error: cudaMemcpy failed in paddle::platform::GpuMemcpySync (0x7f2389053f00 -> 0x7f23cf286540, length: 60) error code : 2, Please see detail in https://docs.nvidia.com/cuda/
cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 at (/paddle/paddle/fluid/platform/gpu_info.cc:288)
Traceback (most recent call last):
File "./src/run.py", line 38, in
run_graphsum(args)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/run_graphsum.py", line 418, in main
decode_path=args.decode_path + "/test_final_preds")
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/run_graphsum.py", line 618, in evaluate
preds.append(dec_out[i][0])
KeyError: 0
6条答案
按热度按时间k4ymrczo1#
请问:
result = numpy.array(dec_out)
, 然后看下result的shape和内容,判断下result[i][0]是不是越界了。wztqucjr2#
hof1towb3#
有可能是这个原因,建议您根据读入的数据选择不同的执行模式。
6qqygrtg4#
『根据读入的数据选择不同的执行模式』是什么意思?多卡预测需要满足什么条件?
tnkciper5#
可以生命两个graph,一个调用with_data_parallel),一个不调用with_data_parallel。最后一个batch调用非并行的graph.
dgtucam16#
我试试,谢谢