我有一个虚拟模型(一个线性自动编码器)。当训练一个1000条记录的数据集时,它工作正常;但是在更大的数据集上,大三个数量级,它会耗尽GPU内存;即使批量大小是固定的并且计算机具有足够的RAM来容纳。
我是不是做了什么傻事?
注意:它在TF 2.5上运行良好,但在TF 2.6-2.9上崩溃。如果在CPU上训练,它总是工作。
模型为:
def get_model(n_inputs: int) -> models.Model:
inp = layers.Input(shape=(n_inputs,))
out = layers.Dense(n_inputs, activation='linear')(inp)
m = models.Model(inputs=inp, outputs=out)
m.compile(loss='mse', optimizer='adam')
m.summary()
return m
我通过tf.data
API提供数据
def wrap_data(data: np.ndarray) -> tf.data.Dataset:
dataset = tf.data.Dataset.from_tensor_slices(data)
shuffled = dataset.shuffle(buffer_size=len(data), reshuffle_each_iteration=True)
batched = shuffled.batch(16, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
autoencoder = batched.map(lambda x: (x, x)).prefetch(5)
return autoencoder
完整的复制脚本是here.运行python benchmark.py
可以工作,但python benchmark.py --big
不行。
我在Fedora 36上使用Python 3.9。GPU是Nvidia RTX 2070,内存为8 GiB。驱动程序版本为515.48.07,CUDA版本:11.7. nvidia-smi
报告大部分内存在运行之间可用,小版本需要不到800 MiB。
完整的traceback是:
2022-09-05 15:29:37.525261: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:29:54.002629: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.26GiB (rounded to 16384000000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2022-09-05 15:29:54.002987: W tensorflow/core/common_runtime/bfc_allocator.cc:491] *_******____________________________________________________________________________________________
Traceback (most recent call last):
File "/home/david/[path]/benchmark.py", line 49, in <module>
main(parser.parse_args().big)
File "/home/david/[path]/benchmark.py", line 40, in main
train_data_iterator = wrap_data(train_data)
File "/home/david/[path]/benchmark.py", line 33, in wrap_data
dataset = tf.data.Dataset.from_tensor_slices(data)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
return TensorSliceDataset(tensors, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
element = structure.normalize_element(element)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
按照建议指定GPU内存分配器也无济于事:
2022-09-05 15:33:19.542433: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 16384000000 exceeds 10% of free system memory.
2022-09-05 15:33:25.973935: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:288] gpu_async_0 cuMemAllocAsync failed to allocate 16384000000 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
Reported by CUDA: Free memory/Total memory: 1115357184/8369799168
2022-09-05 15:33:25.973961: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:293] Stats: Limit: 6221922304
InUse: 67126312
MaxInUse: 201327628
NumAllocs: 13
MaxAllocSize: 67108864
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-09-05 15:33:25.973970: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:56] Histogram of current allocation: (allocation_size_in_bytes, nb_allocation_of_that_sizes), ...;
2022-09-05 15:33:25.973974: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 4, 5
2022-09-05 15:33:25.973976: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 8, 2
2022-09-05 15:33:25.973979: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 1028, 1
2022-09-05 15:33:25.973982: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 16384, 1
2022-09-05 15:33:25.973985: E tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:59] 67108864, 1
Traceback (most recent call last):
File "/home/david/[path]/benchmark.py", line 48, in <module>
main(parser.parse_args().big)
File "/home/david/[path]/benchmark.py", line 40, in main
train_data_iterator = wrap_data(train_data)
File "/home/david/[path]/benchmark.py", line 33, in wrap_data
dataset = tf.data.Dataset.from_tensor_slices(data)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 809, in from_tensor_slices
return TensorSliceDataset(tensors, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4551, in __init__
element = structure.normalize_element(element)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/data/util/structure.py", line 125, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1640, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 48, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 267, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 279, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/david/.virtualenvs/ainet/lib64/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
更新:bug report
1条答案
按热度按时间kiayqfof1#
请检查系统中TensorFlow的构建配置。因为
TensorFlow 2.5 or 2.6 - 2.9
将与CUDA 11.2
和cuDNN 8.1
兼容,如下图所示:(请参阅此链接并重新验证您是否已完成所有硬件/软件要求和路径设置,以安装TensorFlow支持GPU。)x1c 0d1x