tensorflow SelfAdjointEigV2 GPU操作需要大量的临时内存,

2fjabf4q  于 5个月前  发布在  其他
关注(0)|答案(3)|浏览(46)

系统信息

  • 操作系统平台和发行版(例如,Linux Ubuntu 16.04):Ubuntu 20.04
  • 从源代码或二进制安装的TensorFlow:源代码
  • TensorFlow版本(请使用以下命令):2.9.0
  • Python版本:3.8.10
  • CUDA/cuDNN版本:11.5
  • GPU型号和内存:GTX 1660 Ti
    描述当前行为

单个调用tf.linalg.eigh()需要线性数量的内存,尽管是逐矩阵处理。我认为ScratchSpace仅在调用结束时释放,而不是在逐矩阵处理过程中。我从OOM期间分配器报告的巨大分配量以及查看代码得出了这个结论。

Contributing

  • 您是否想提交PR?(是/否):也许
  • 简要描述您的候选解决方案(如果贡献):
  • **最佳解决方案:**为每个矩阵重用ScratchSpace。
  • **次佳解决方案:**在批次中的每个矩阵后释放ScratchSpace。
    独立代码以重现问题
import tensorflow as tf
tensor = tf.random.uniform((1024 * 256, 4, 4)) # just make sure the batch size is big enough
tensor = tf.matmul(tensor, tensor, transpose_b=True)
t = tf.linalg.eigvalsh(tensor)

其他信息/日志

请参阅下面分配器的摘要:

2022-01-03 16:18:32.407928: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 215.4KiB (rounded to 220672)requested by op SelfAdjointEigV2

2022-01-03 16:18:32.518897: I tensorflow/core/common_runtime/bfc_allocator.cc:1071]      Summary of in-use Chunks by size: 
2022-01-03 16:18:32.518906: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 256 totalling 256B
2022-01-03 16:18:32.518916: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 1280 totalling 1.2KiB
2022-01-03 16:18:32.518924: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 65536 totalling 64.0KiB
2022-01-03 16:18:32.518930: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 11841 Chunks of size 220672 totalling 2.43GiB
2022-01-03 16:18:32.518937: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 251648 totalling 245.8KiB
2022-01-03 16:18:32.518944: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 410880 totalling 401.2KiB
2022-01-03 16:18:32.518951: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 4194304 totalling 4.00MiB
2022-01-03 16:18:32.518958: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 268435456 totalling 512.00MiB
2022-01-03 16:18:32.518964: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 2.94GiB
2022-01-03 16:18:32.518970: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 3154771968 memory_limit_: 3154771968 available bytes: 0 curr_region_allocation_bytes_: 6309543936
2022-01-03 16:18:32.518982: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats: 
Limit:                      3154771968
InUse:                      3154771968
MaxInUse:                   3154771968
NumAllocs:                       11850
MaxAllocSize:                268435456
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-01-03 16:18:32.519244: W tensorflow/core/common_runtime/bfc_allocator.cc:474] ****************************************************************************************************
2022-01-03 16:18:32.519294: F ./tensorflow/core/util/gpu_solvers.h:533] Non-OK-status: context->allocate_temp(DataTypeToEnum<Scalar>::value, shape, &scratch_tensor_, alloc_attr) status: RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[55137] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

特别是这一行:

...r.cc:1074] 11841 Chunks of size 220672 totalling 2.43GiB

因此,我认为当它尝试处理矩阵编号11842时崩溃。
指针:

  • HeevdImpl调用中的ScratchSpace请求(第635行):

tensorflow/tensorflow/core/util/cuda_solvers.cc
322cba0中第620到643行:
| | template<typename Scalar, typename BufSizeFnT, typename SolverFnT> |
| | staticinline Status HeevdImpl(BufSizeFnT bufsize, SolverFnT solver, |
| | GpuSolver* cuda_solver, OpKernelContext* context, |
| | cusolverDnHandle_t cusolver_dn_handle, |
| | cusolverEigMode_t jobz, cublasFillMode_t uplo, |
| | int n, Scalar* dev_A, int lda, |
| | typename Eigen::NumTraits::Real* dev_W, |
| | int* dev_lapack_info) { |
| | mutex_lock lock(handle_map_mutex); |
| | /* Get amount of workspace memory required. / |
| | int lwork; |
| | TF_RETURN_IF_CUSOLVER_ERROR(bufsize(cusolver_dn_handle, jobz, uplo, n, |
| | CUDAComplex(dev_A), lda, |
| | CUDAComplex(dev_W), &lwork)); |
| | /
Allocate device memory for workspace. / |
| | auto dev_workspace = |
| | cuda_solver->GetScratchSpace(lwork, "", /
on_host /false); |
| | /
Launch the solver kernel. */ |
| | TF_RETURN_IF_CUSOLVER_ERROR( |
| | solver(cusolver_dn_handle, jobz, uplo, n, CUDAComplex(dev_A), lda, |
| | CUDAComplex(dev_W), CUDAComplex(dev_workspace.mutable_data()), |
| | lwork, dev_lapack_info)); |
| | returnStatus::OK(); |
| | } |

zfciruhq

zfciruhq1#

@sachinprasadhs ,
我能够在tf v2.7和nightly中执行代码。请找到gist here

tv6aics1

tv6aics12#

@tilakrayal@sachinprasadhs 谷歌云合作项目拥有的内存比我多。将其更改为

tensor = tf.random.uniform((1024 * 256, 4, 4)) # just make sure the batch size is big enough

可以展示这个问题(请注意,浮点数的总数量甚至比以前更少:这里的关键是批处理大小)。

iih3973s

iih3973s3#

能够复现Tensorflow 2.9.2的问题。

相关问题