Paddle 在AIStudio平台上训练train函数没问题,在脚本任务4卡训练时报错

8ehkhllq  于 2021-11-29  发布在  Java
关注(0)|答案(3)|浏览(550)

AIStudio后台脚本任务运行时paddle版本是1.8。我在平台做作业时,得到的代码使用python work/basic_train.py是可以多次运行的(选择的也是GPU卡),但移到后台的4卡训练时就总报下面的错,不能继续运行。

Traceback (most recent call last):

File "basic_train.py", line 423, in

main()

File "basic_train.py", line 387, in main

epoch)

File "basic_train.py", line 288, in train

train_loss_meter.update(loss.numpy()[0], n)

paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)

1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)

2 paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)

Error Message Summary:

ExternalError: Cuda error(77), an illegal memory access was encountered.

[Advise: The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] at (/paddle/paddle/fluid/platform/gpu_info.cc:281)

/mnt

[INFO]: train job failed! train_ret: 1

x8diyxa7

x8diyxa71#

以前在pytorch时,对于在CUDA上计算得到的loss,通常是把loss移到CPU上,然后转换为loss.numpy()。不知道是否是这个原因报错。但是,同样的代码,我已经在AIStudio平台做作业时使用GPU算力运行好几次(可以继续运行),也没因为这个loss.numpy()操作报错啊,为什么到脚本任务的4卡时就报错呢

p5fdfcr1

p5fdfcr12#

我刚才又试一下,在后台脚本任务时,即使选择单卡P40运行,也是报同样的错,但是,在平台做作业的状态下不报错

aemubtdh

aemubtdh3#

我把loss.numpy()那句单独拿出来赋给其他变量,在这时又报错了

2020-11-13 13:14:52,183-WARNING: Please NOTE: imperative mode can support return as list only. Change to return as list.

2020-11-13 13:14:52,186-WARNING: Please NOTE: imperative mode can support return as list only. Change to return as list.

W1113 13:14:52.188210 216 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 9.0

W1113 13:14:52.192946 216 device_context.cc:260] device: 0, cuDNN Version: 7.6.

now in train

Traceback (most recent call last):

File "basic_train.py", line 426, in

main()

File "basic_train.py", line 390, in main

epoch)

File "basic_train.py", line 289, in train

lsfloat = loss.numpy()[0]

paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)

1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)

2 paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)

Error Message Summary:

ExternalError: Cuda error(77), an illegal memory access was encountered.

[Advise: The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] at (/paddle/paddle/fluid/platform/gpu_info.cc:281)

/mnt

[INFO]: train job failed! train_ret: 1

相关问题