我正在使用GridSearchCV
API在远程服务器上训练我的模型,以调整一些超参数,如epochs
、l_rate
、batch_size
和patience
。不幸的是,在调优它们时,经过几次迭代后,我得到了以下错误:
Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0
to /job:localhost/replica:0/task:0/device:GPU:0
in order to run _EagerConst: Dst tensor is not initialized.
似乎the GPU memory of the server is not enough和这个错误时,GPU内存已满,他们建议减少数据集大小和/或batch_size
提出。
首先,我将batch_size
减少到2
,4
,8
和16
,但错误仍然存在,因为我得到:
W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran
out of memory trying to allocate 1.17GiB (rounded to 1258291200) requested
by op _EagerConst
If the cause is memory fragmentation maybe the environment variable
'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation
然后,我按照建议设置了os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'
,但问题仍然存在。
尽管如此,如果我减少数据集的大小,这个问题似乎得到了解决,但是 * 我必须使用整个数据集 *(我不能浪费数据)。
为了解决这个问题,我的主要想法是:
1.防止重新创建用于损失和培训管理的新模型和相关对象。这将是最佳的解决方案,因为它将始终使用相同的模型(显然确保它被每个新的超参数组合“重置”),具有相对损失和训练。这个解决方案可能是最复杂的,因为我不知道我选择使用的库是否允许它。
1.验证相同的问题不是由数据而不是模型引起的(即我不希望为每个超参数组合重新分配相同的数据,而将旧的数据留在内存中)。这也可能是一个原因,我认为解决办法比前一个或类似的办法更简单,但我认为它作为一个原因的可能性较小。在任何情况下,检查这不会发生。
1.通过调用垃圾收集器在每个超参数组合处重置内存(我不知道它是否也适用于GPU)。这是最简单的解决方案,也可能是我会尝试的第一件事,但它不一定有效,因为如果它使用的库维护对内存中对象的引用(即使它们不再使用),这些引用不会被垃圾收集器消除。
还有with the tensorflow backend the current model is not destroyed,所以我需要清除会话。
如果您有任何其他想法或想法,请随时与我分享。这些是涉及的功能:
def grid_search_vae(x_train, latent_dimension):
param_grid = {
'epochs': [2500],
'l_rate': [10 ** -4, 10 ** -5, 10 ** -6, 10 ** -7],
'batch_size': [32, 64], # [2, 4, 8, 16] won't fix the issue
'patience': [30]
}
ssim_scorer = make_scorer(my_ssim, greater_is_better=True)
grid = GridSearchCV(
VAEWrapper(encoder=Encoder(latent_dimension), decoder=Decoder()),
param_grid, scoring=ssim_scorer, cv=5, refit=False
)
grid.fit(x_train, x_train)
return grid
def refit(fitted_grid, x_train, y_train, latent_dimension):
best_epochs = fitted_grid.best_params_["epochs"]
best_l_rate = fitted_grid.best_params_["l_rate"]
best_batch_size = fitted_grid.best_params_["batch_size"]
best_patience = fitted_grid.best_params_["patience"]
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2)
encoder = Encoder(latent_dimension)
decoder = Decoder()
vae = VAE(encoder, decoder, best_epochs, best_l_rate, best_batch_size)
vae.compile(Adam(best_l_rate))
early_stopping = EarlyStopping("val_loss", patience=best_patience)
history = vae.fit(x_train, x_train, best_batch_size, best_epochs,
validation_data=(x_val, x_val), callbacks=[early_stopping])
return history, vae
下面是main
代码:
if __name__ == '__main__':
x_train, x_test, y_train, y_test = load_data("data", "labels")
# Reducing data set size will fix the issue
# new_size = 200
# x_train, y_train = reduce_size(x_train, y_train, new_size)
# x_test, y_test = reduce_size(x_test, y_test, new_size)
latent_dimension = 25
grid = grid_search_vae(x_train, latent_dimension)
history, vae = refit(grid, x_train, y_train, latent_dimension)
你能帮帮我吗?
如果你需要这些信息,这些是GPU:
2023-09-18 11:21:25.628286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7347 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1
2023-09-18 11:21:25.629120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 7371 MB memory: -> device: 1, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1
2023-09-18 11:21:31.911969: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
我使用tensorflow作为keras后端,即:
from keras import backend as K
K.backend() # 'tensorflow'
我还试图补充:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
在main
代码(作为第一个指令),但这没有帮助。
如果你需要模型的代码,这里是:
import numpy as np
import tensorflow as tf
from keras.initializers import he_uniform
from keras.layers import Conv2DTranspose, BatchNormalization, Reshape, Dense, Conv2D, Flatten
from keras.optimizers.legacy import Adam
from keras.src.callbacks import EarlyStopping
from skimage.metrics import structural_similarity as ssim
from sklearn.base import BaseEstimator
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV
from tensorflow import keras
class VAEWrapper:
def __init__(self, **kwargs):
self.vae = VAE(**kwargs)
self.vae.compile(Adam())
def fit(self, x, y, **kwargs):
self.vae.fit(x, y, **kwargs)
def get_config(self):
return self.vae.get_config()
def get_params(self, deep):
return self.vae.get_params(deep)
def set_params(self, **params):
return self.vae.set_params(**params)
class VAE(keras.Model, BaseEstimator):
def __init__(self, encoder, decoder, epochs=None, l_rate=None, batch_size=None, patience=None, **kwargs):
super().__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
self.epochs = epochs # For grid search
self.l_rate = l_rate # For grid search
self.batch_size = batch_size # For grid search
self.patience = patience # For grid search
self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
self.reconstruction_loss_tracker = keras.metrics.Mean(name="reconstruction_loss")
self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
def call(self, inputs, training=None, mask=None):
_, _, z = self.encoder(inputs)
outputs = self.decoder(z)
return outputs
@property
def metrics(self):
return [
self.total_loss_tracker,
self.reconstruction_loss_tracker,
self.kl_loss_tracker,
]
def train_step(self, data):
data, labels = data
with tf.GradientTape() as tape:
# Forward pass
z_mean, z_log_var, z = self.encoder(data)
reconstruction = self.decoder(z)
# Compute losses
reconstruction_loss = tf.reduce_mean(
tf.reduce_sum(
keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
)
)
kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
total_loss = reconstruction_loss + kl_loss
# Compute gradient
grads = tape.gradient(total_loss, self.trainable_weights)
# Update weights
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
# Update metrics
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {
"loss": self.total_loss_tracker.result(),
"reconstruction_loss": self.reconstruction_loss_tracker.result(),
"kl_loss": self.kl_loss_tracker.result(),
}
def test_step(self, data):
data, labels = data
# Forward pass
z_mean, z_log_var, z = self.encoder(data)
reconstruction = self.decoder(z)
# Compute losses
reconstruction_loss = tf.reduce_mean(
tf.reduce_sum(
keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
)
)
kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
total_loss = reconstruction_loss + kl_loss
# Update metrics
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {
"loss": self.total_loss_tracker.result(),
"reconstruction_loss": self.reconstruction_loss_tracker.result(),
"kl_loss": self.kl_loss_tracker.result(),
}
@keras.saving.register_keras_serializable()
class Encoder(keras.layers.Layer):
def __init__(self, latent_dimension):
super(Encoder, self).__init__()
self.latent_dim = latent_dimension
seed = 42
self.conv1 = Conv2D(filters=64, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn1 = BatchNormalization()
self.conv2 = Conv2D(filters=128, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn2 = BatchNormalization()
self.conv3 = Conv2D(filters=256, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn3 = BatchNormalization()
self.flatten = Flatten()
self.dense = Dense(units=100, activation="relu")
self.z_mean = Dense(latent_dimension, name="z_mean")
self.z_log_var = Dense(latent_dimension, name="z_log_var")
self.sampling = sample
def call(self, inputs, training=None, mask=None):
x = self.conv1(inputs)
x = self.bn1(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.conv3(x)
x = self.bn3(x)
x = self.flatten(x)
x = self.dense(x)
z_mean = self.z_mean(x)
z_log_var = self.z_log_var(x)
z = self.sampling(z_mean, z_log_var)
return z_mean, z_log_var, z
@keras.saving.register_keras_serializable()
class Decoder(keras.layers.Layer):
def __init__(self):
super(Decoder, self).__init__()
self.dense1 = Dense(units=4096, activation="relu")
self.bn1 = BatchNormalization()
self.dense2 = Dense(units=1024, activation="relu")
self.bn2 = BatchNormalization()
self.dense3 = Dense(units=4096, activation="relu")
self.bn3 = BatchNormalization()
seed = 42
self.reshape = Reshape((4, 4, 256))
self.deconv1 = Conv2DTranspose(filters=256, kernel_size=3, activation="relu", strides=2, padding="same",
kernel_initializer=he_uniform(seed))
self.bn4 = BatchNormalization()
self.deconv2 = Conv2DTranspose(filters=128, kernel_size=3, activation="relu", strides=1, padding="same",
kernel_initializer=he_uniform(seed))
self.bn5 = BatchNormalization()
self.deconv3 = Conv2DTranspose(filters=128, kernel_size=3, activation="relu", strides=2, padding="valid",
kernel_initializer=he_uniform(seed))
self.bn6 = BatchNormalization()
self.deconv4 = Conv2DTranspose(filters=64, kernel_size=3, activation="relu", strides=1, padding="valid",
kernel_initializer=he_uniform(seed))
self.bn7 = BatchNormalization()
self.deconv5 = Conv2DTranspose(filters=64, kernel_size=3, activation="relu", strides=2, padding="valid",
kernel_initializer=he_uniform(seed))
self.bn8 = BatchNormalization()
self.deconv6 = Conv2DTranspose(filters=1, kernel_size=2, activation="sigmoid", padding="valid",
kernel_initializer=he_uniform(seed))
def call(self, inputs, training=None, mask=None):
x = self.dense1(inputs)
x = self.bn1(x)
x = self.dense2(x)
x = self.bn2(x)
x = self.dense3(x)
x = self.bn3(x)
x = self.reshape(x)
x = self.deconv1(x)
x = self.bn4(x)
x = self.deconv2(x)
x = self.bn5(x)
x = self.deconv3(x)
x = self.bn6(x)
x = self.deconv4(x)
x = self.bn7(x)
x = self.deconv5(x)
x = self.bn8(x)
decoder_outputs = self.deconv6(x)
return decoder_outputs
1条答案
按热度按时间c86crjj01#
要解决你的记忆问题,试试:
清除GPU内存:TensorFlow可以与GPU内存紧密结合。每次迭代后,像这样清除它:
这也是有帮助的,因为在网格搜索中,每个超参数组合都可能创建模型的一个新示例。
使用混合精度:这可以保存内存并加快速度。在代码的顶部设置一个策略:
在模型的最后一个激活层中,使用dtype='float32'来保持稳定性。例如
你也可以尝试流水线数据集。这使得数据加载更加高效。