如何在训练keras模型时避免这种内存分配错误?

kcugc4gi  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(75)

我一直在关注this guide,试图学习如何使用keras创建POS-tagger。
我正在使用Python 3.9,并且我已经安装了Tensorflow 2.10CUDA Toolkit 11.2cuDNN 8.2,因为这是Windows 10本机支持的最后一个配置。
我正在使用具有8 Gb VRAM的NVIDIA GeForce RTX 2070 SUPER进行训练,我的PC上有64 Gb RAM。
我在训练中使用的数据是典型的token和POS标签元组,组合成句子列表:

[[("hello", "INTJ"), ("world", "NOUN"), ("!", "PUNCT")], [("oh", "INTJ"), ("hi", "INTJ")], ...]

字符串
然后将它们分为测试集、验证集和训练集,然后使用sklearn中的DictVectorizer进行矢量化,并根据我所遵循的指南进行独热编码。
我写了下面的函数来构造一个模型:

def construct_model(input_dim, hidden_neurons, output_dim):
    pos_model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim),
        Activation('relu'),
        Dropout(0.2),
        Dense(hidden_neurons),
        Activation('relu'),
        Dropout(0.2),
        Dense(output_dim, activation='softmax')
    ])
    pos_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return pos_model


然后我加载数据并使用以下代码拟合模型:

if __name__ == "__main__":

    X_train = processed_data[0]
    y_train = processed_data[1]

    X_val = processed_data[2]
    y_val = processed_data[3]

    X_test = processed_data[4]
    y_test = processed_data[5]

    model_params = {
        'build_fn': construct_model,
        'input_dim': X_train.shape[1],
        'hidden_neurons': 512,
        'output_dim': y_train.shape[1],
        'epochs': 5,
        'batch_size': 256,
        'verbose': 1,
        'validation_data': (X_val, y_val),
        'shuffle': True
    }

    classifier = KerasClassifier(**model_params)
    pos_model = classifier.fit(X_train, y_train)


每当我试图拟合模型时,我都会得到一个带有长Traceback的错误:

2023-12-25 09:44:36.255452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5973 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:0a:00.0, compute capability: 7.5
2023-12-25 09:57:57.132922: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 73.90GiB (rounded to 79346949120)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-12-25 09:57:57.134352: I tensorflow/core/common_runtime/bfc_allocator.cc:1033] BFCAllocator dump for GPU_0_bfc
2023-12-25 09:57:57.134940: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (256):  Total Chunks: 12, Chunks in use: 12. 3.0KiB allocated for chunks. 3.0KiB in use in bin. 120B client-requested in use in bin.
2023-12-25 09:57:57.135083: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (512):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.135206: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (1024):     Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2023-12-25 09:57:57.135492: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (2048):     Total Chunks: 2, Chunks in use: 2. 4.0KiB allocated for chunks. 4.0KiB in use in bin. 4.0KiB client-requested in use in bin.
2023-12-25 09:57:57.135686: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (4096):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.136594: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (8192):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.136830: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (16384):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137034: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (32768):    Total Chunks: 1, Chunks in use: 1. 34.0KiB allocated for chunks. 34.0KiB in use in bin. 34.0KiB client-requested in use in bin.
2023-12-25 09:57:57.137230: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (65536):    Total Chunks: 1, Chunks in use: 0. 67.8KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137483: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (131072):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137682: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (262144):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137801: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (524288):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138025: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (1048576):  Total Chunks: 2, Chunks in use: 1. 2.90MiB allocated for chunks. 1.00MiB in use in bin. 1.00MiB client-requested in use in bin.
2023-12-25 09:57:57.138253: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (2097152):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138490: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (4194304):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138743: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (8388608):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138971: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (16777216):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139212: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (33554432):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139371: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (67108864):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139585: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (134217728):    Total Chunks: 1, Chunks in use: 1. 205.92MiB allocated for chunks. 205.92MiB in use in bin. 205.92MiB client-requested in use in bin.
2023-12-25 09:57:57.139717: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (268435456):    Total Chunks: 2, Chunks in use: 0. 5.63GiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.140005: I tensorflow/core/common_runtime/bfc_allocator.cc:1056] Bin for 73.90GiB was 256.00MiB, Chunk State: 
2023-12-25 09:57:57.141115: I tensorflow/core/common_runtime/bfc_allocator.cc:1062]   Size: 408.84MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 1.00MiB | Requested Size: 1.00MiB | in_use: 1 | bin_num: -1, next:   Size: 205.92MiB | Requested Size: 205.92MiB | in_use: 1 | bin_num: -1
2023-12-25 09:57:57.141309: I tensorflow/core/common_runtime/bfc_allocator.cc:1062]   Size: 5.23GiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 205.92MiB | Requested Size: 205.92MiB | in_use: 1 | bin_num: -1
2023-12-25 09:57:57.141420: I tensorflow/core/common_runtime/bfc_allocator.cc:1069] Next region of size 6263144448
2023-12-25 09:57:57.141763: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000000 of size 256 next 1
2023-12-25 09:57:57.141835: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000100 of size 1280 next 2
2023-12-25 09:57:57.141920: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000600 of size 256 next 3
2023-12-25 09:57:57.141987: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000700 of size 256 next 4
2023-12-25 09:57:57.142074: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000800 of size 256 next 6
2023-12-25 09:57:57.142135: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000900 of size 2048 next 7
2023-12-25 09:57:57.142194: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001100 of size 256 next 5
2023-12-25 09:57:57.142252: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001200 of size 256 next 8
2023-12-25 09:57:57.142581: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001300 of size 2048 next 12
2023-12-25 09:57:57.142678: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001b00 of size 256 next 13
2023-12-25 09:57:57.142789: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001c00 of size 256 next 11
2023-12-25 09:57:57.142910: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001d00 of size 256 next 17
2023-12-25 09:57:57.143002: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001e00 of size 256 next 18
2023-12-25 09:57:57.143142: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001f00 of size 256 next 14
2023-12-25 09:57:57.143262: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d002000 of size 256 next 19
2023-12-25 09:57:57.143450: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 130d002100 of size 69376 next 20
2023-12-25 09:57:57.143583: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d013000 of size 34816 next 21
2023-12-25 09:57:57.143682: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 130d01b800 of size 1990144 next 15
2023-12-25 09:57:57.143840: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d201600 of size 1048576 next 16
2023-12-25 09:57:57.143986: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 130d301600 of size 428696832 next 9
2023-12-25 09:57:57.144123: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 1326bd7b00 of size 215922688 next 10
2023-12-25 09:57:57.144276: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 13339c3300 of size 5615373568 next 18446744073709551615
2023-12-25 09:57:57.144457: I tensorflow/core/common_runtime/bfc_allocator.cc:1094]      Summary of in-use Chunks by size: 
2023-12-25 09:57:57.144711: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 12 Chunks of size 256 totalling 3.0KiB
2023-12-25 09:57:57.144796: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 1280 totalling 1.2KiB
2023-12-25 09:57:57.144876: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 2 Chunks of size 2048 totalling 4.0KiB
2023-12-25 09:57:57.144950: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 34816 totalling 34.0KiB
2023-12-25 09:57:57.145023: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 1048576 totalling 1.00MiB
2023-12-25 09:57:57.145094: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 215922688 totalling 205.92MiB
2023-12-25 09:57:57.145168: I tensorflow/core/common_runtime/bfc_allocator.cc:1101] Sum Total of in-use chunks: 206.96MiB
2023-12-25 09:57:57.145231: I tensorflow/core/common_runtime/bfc_allocator.cc:1103] total_region_allocated_bytes_: 6263144448 memory_limit_: 6263144448 available bytes: 0 curr_region_allocation_bytes_: 12526288896
2023-12-25 09:57:57.145763: I tensorflow/core/common_runtime/bfc_allocator.cc:1109] Stats: 
Limit:                      6263144448
InUse:                       217014528
MaxInUse:                    647770624
NumAllocs:                          33
MaxAllocSize:                215922688
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2023-12-25 09:57:57.146178: W tensorflow/core/common_runtime/bfc_allocator.cc:491] *_____*****_________________________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\admd9\PycharmProjects\codalab-sigtyp2024\train_pos_tagger.py", line 104, in <module>
    pos_model = classifier.fit(X_train, y_train)
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\wrappers\scikit_learn.py", line 248, in fit
    return super().fit(x, y, **kwargs)
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\wrappers\scikit_learn.py", line 175, in fit
    history = self.model.fit(x, y, **fit_args)
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.


显然,这个错误与内存的分配方式有关,可能是验证集一次性传递给GPU的结果。
我已经在网上寻找解决方案,大多数似乎建议减少批量大小。我尝试将批量大小减少到2,但这没有帮助。
我还尝试在加载数据后插入以下代码块,我发现here,以允许TensorFlow分配GPU内存:

gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)


这也没有解决问题。
最后,有人在this thread中建议在循环中使用gc.collect(),以便在每次循环后清除RAM,但我没有像中的用户那样使用循环,所以我不知道如何才能使其工作。
我如何解决这个问题并训练我的模型?

gk7wooem

gk7wooem1#

似乎整个数据都被加载到GPU内存中。我建议你实现Generators来加载你的数据。它避免了将整个数据集传递到GPU。
我建议你访问这个线程来获得一些例子:
Failed copying input tensor from CPU to GPU in order to run GatherVe: Dst tensor is not initialized. [Op:GatherV2]
编辑:
我在这里发布了示例中的代码。

from tensorflow.keras.utils import Sequence
import numpy as np   

class DataGenerator(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

train_gen = DataGenerator(X_train, y_train, 32)
test_gen = DataGenerator(X_test, y_test, 32)

history = model.fit(train_gen,
                    epochs=6,
                    validation_data=test_gen)

字符串
编辑2
使其与Keras一起工作的代码(sciKeras已删除)

if __name__ == "__main__":

    X_train = processed_data[0]
    y_train = processed_data[1]

    X_val = processed_data[2]
    y_val = processed_data[3]

    X_test = processed_data[4]
    y_test = processed_data[5]

    model_params = {
        'input_dim': X_train.shape[1],
        'hidden_neurons': 512,
        'output_dim': y_train.shape[1],
        'epochs': 5,
        'batch_size': 256,
        'verbose': 1,
        'shuffle': True
    }

    train_gen = DataGenerator(
        X_train,
        y_train, 
        model_parameters['batch_size']
    )
    test_gen = DataGenerator(
        X_test,
        y_test, 
        model_parameters['batch_size']
    )

    model = construct_model(
        input_dim=model_parameters['input_dim'],
        hidden_neurons=model_parameters['hidden_neurons'],
        output_dim=model_parameters['output_dim']
        )

    history = model.fit(
        train_gen,
        epochs=model_parameters['epochs'],
        verbose=model_parameters['verbose'],
        shuffle=model_parameters['shuffle'],
        validation_data=test_gen
    )

相关问题