TensorFlow：内存不足取决于数据大小

我有一个虚拟模型（一个线性自动编码器）。当训练一个1000条记录的数据集时，它工作正常;但是在更大的数据集上，大三个数量级，它会耗尽GPU内存;即使批量大小是固定的并且计算机具有足够的RAM来容纳。
我是不是做了什么傻事？
注意：它在TF 2.5上运行良好，但在TF 2.6-2.9上崩溃。如果在CPU上训练，它总是工作。
模型为：

def get_model(n_inputs: int) -> models.Model:
    inp = layers.Input(shape=(n_inputs,))

    out = layers.Dense(n_inputs, activation='linear')(inp)
    m = models.Model(inputs=inp, outputs=out)
    m.compile(loss='mse', optimizer='adam')
    m.summary()
    return m

我通过tf.data API提供数据

def wrap_data(data: np.ndarray) -> tf.data.Dataset:
        dataset = tf.data.Dataset.from_tensor_slices(data)
        shuffled = dataset.shuffle(buffer_size=len(data), reshuffle_each_iteration=True)
        batched = shuffled.batch(16, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
        autoencoder = batched.map(lambda x: (x, x)).prefetch(5)

        return autoencoder

完整的复制脚本是here.运行python benchmark.py可以工作，但python benchmark.py --big不行。
我在Fedora 36上使用Python 3.9。GPU是Nvidia RTX 2070，内存为8 GiB。驱动程序版本为515.48.07，CUDA版本：11.7. nvidia-smi报告大部分内存在运行之间可用，小版本需要不到800 MiB。
完整的traceback是：

2022-09-05 15:29:37.525261: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:29:54.002629: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.26GiB (rounded to 16384000000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-09-05 15:29:54.002987: W tensorflow/core/common_runtime/bfc_allocator.cc:491] *_******____________________________________________________________________________________________
Traceback (most recent call last):
  File "/home/david/[path]/benchmark.py", line 49, in <module>
    main(parser.parse_args().big)
  File "/home/david/[path]/benchmark.py", line 40, in main
    train_data_iterator = wrap_data(train_data)
  File "/home/david/[path]/benchmark.py", line 33, in wrap_data
    dataset = tf.data.Dataset.from_tensor_slices(data)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
    return TensorSliceDataset(tensors, name=name)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
    element = structure.normalize_element(element)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
    ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
    return func(*args, **kwargs)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

按照建议指定GPU内存分配器也无济于事：

2022-09-05 15:33:19.542433: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:33:25.973935: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:288] gpu_async_0 cuMemAllocAsync failed to allocate 16384000000 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
 Reported by CUDA: Free memory/Total memory: 1115357184/8369799168
2022-09-05 15:33:25.973961: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:293] Stats: Limit:                      6221922304
InUse:                        67126312
MaxInUse:                    201327628
NumAllocs:                          13
MaxAllocSize:                 67108864
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-09-05 15:33:25.973970: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:56] Histogram of current allocation: (allocation_size_in_bytes, nb_allocation_of_that_sizes), ...;
2022-09-05 15:33:25.973974: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 4, 5
2022-09-05 15:33:25.973976: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 8, 2
2022-09-05 15:33:25.973979: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1028, 1
2022-09-05 15:33:25.973982: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 16384, 1
2022-09-05 15:33:25.973985: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 67108864, 1
Traceback (most recent call last):
  File "/home/david/[path]/benchmark.py", line 48, in <module>
    main(parser.parse_args().big)
  File "/home/david/[path]/benchmark.py", line 40, in main
    train_data_iterator = wrap_data(train_data)
  File "/home/david/[path]/benchmark.py", line 33, in wrap_data
    dataset = tf.data.Dataset.from_tensor_slices(data)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
    return TensorSliceDataset(tensors, name=name)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
    element = structure.normalize_element(element)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
    ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
    return func(*args, **kwargs)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

更新：bug report

TensorFlow：内存不足取决于数据大小

1条答案

相关问题

热门标签

最新问答