AIStudio后台脚本任务运行时paddle版本是1.8。我在平台做作业时,得到的代码使用python work/basic_train.py是可以多次运行的(选择的也是GPU卡),但移到后台的4卡训练时就总报下面的错,不能继续运行。
Traceback (most recent call last):
File "basic_train.py", line 423, in
main()
File "basic_train.py", line 387, in main
epoch)
File "basic_train.py", line 288, in train
train_loss_meter.update(loss.numpy()[0], n)
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)
Error Message Summary:
ExternalError: Cuda error(77), an illegal memory access was encountered.
[Advise: The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] at (/paddle/paddle/fluid/platform/gpu_info.cc:281)
/mnt
[INFO]: train job failed! train_ret: 1
3条答案
按热度按时间x8diyxa71#
以前在pytorch时,对于在CUDA上计算得到的loss,通常是把loss移到CPU上,然后转换为loss.numpy()。不知道是否是这个原因报错。但是,同样的代码,我已经在AIStudio平台做作业时使用GPU算力运行好几次(可以继续运行),也没因为这个loss.numpy()操作报错啊,为什么到脚本任务的4卡时就报错呢
p5fdfcr12#
我刚才又试一下,在后台脚本任务时,即使选择单卡P40运行,也是报同样的错,但是,在平台做作业的状态下不报错
aemubtdh3#
我把loss.numpy()那句单独拿出来赋给其他变量,在这时又报错了
2020-11-13 13:14:52,183-WARNING: Please NOTE: imperative mode can support return as list only. Change to return as list.
2020-11-13 13:14:52,186-WARNING: Please NOTE: imperative mode can support return as list only. Change to return as list.
W1113 13:14:52.188210 216 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 9.0
W1113 13:14:52.192946 216 device_context.cc:260] device: 0, cuDNN Version: 7.6.
now in train
Traceback (most recent call last):
File "basic_train.py", line 426, in
File "basic_train.py", line 390, in main
File "basic_train.py", line 289, in train
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)
Error Message Summary:
ExternalError: Cuda error(77), an illegal memory access was encountered.
[Advise: The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] at (/paddle/paddle/fluid/platform/gpu_info.cc:281)
/mnt
[INFO]: train job failed! train_ret: 1