keras 未知错误/崩溃-带GPU的TensorFlow LSTM(第1个时期开始后无输出)

jhdbpxl9  于 2022-11-13  发布在  其他
关注(0)|答案(2)|浏览(237)

我正在尝试使用LSTM图层训练模型。我正在使用GPU,并且所有需要的库都已加载。

当我这样构建模型时:

model = keras.Sequential()

model.add(layers.LSTM(256, activation="relu", return_sequences=False))  # note the activation function
model.add(layers.Dropout(0.2))

model.add(layers.Dense(256, activation="relu"))
model.add(layers.Dropout(0.2))

model.add(layers.Dense(1))
model.add(layers.Activation(activation="sigmoid"))

model.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer="adam",
    metrics=["accuracy"]
)

这是可行的,但它在LSTM层上使用activation="relu",所以它不是CuDNNLSTM --如果我没记错的话,当激活函数是tanh(默认值)时,它会自动选择。
所以,它慢得让人痛苦,我想运行更快的CuDNNLSTM。我的代码:

model = keras.Sequential()

model.add(layers.LSTM(256, return_sequences=False))
model.add(layers.Dropout(0.2))

model.add(layers.Dense(256, activation="relu"))
model.add(layers.Dropout(0.2))

model.add(layers.Dense(1))
model.add(layers.Activation(activation="sigmoid"))

model.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer="adam",
    metrics=["accuracy"]
)

基本上是一样的,只是没有提供激活函数,所以会用到tanh,但现在不是训练,输出的结尾是这样的:

2021-04-19 22:41:46.046218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 22:41:46.046426: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-19 22:41:46.046642: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-19 22:41:46.046942: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-19 22:41:46.047124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-19 22:41:46.047312: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-19 22:41:46.047489: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-19 22:41:46.047663: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-19 22:41:46.047936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-19 22:41:46.665456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-19 22:41:46.665712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-04-19 22:41:46.665876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-04-19 22:41:46.666186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2982 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-04-19 22:41:46.667505: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-19 22:42:07.374456: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/50
2021-04-19 22:42:08.922891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-19 22:42:09.272264: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-19 22:42:09.302667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll

Process finished with exit code -1073740791 (0xC0000409)

它只是开始第一个纪元,然后冻结一分钟,并退出与这个奇怪的退出代码。

  • 输入数据的形状:tf.Tensor([50985 29 7], shape=(3,), dtype=int32)
  • 我的GPU:Nvidia GTX 1050 Ti
  • CUDA:v11.3
  • 操作系统:Windows 10
  • 集成开发环境:PyCharm

找到这个问题的解决方案有点困难,因为我没有输出任何错误。我做错了什么吗?有人遇到过类似的问题吗?应该有什么帮助?

//编辑;我尝试过:

  • 使用更少的单位(2个单位,而不是256个单位)和更低的batch_size运行此模型
  • 使用python 3.7.1将tensorflow降级为2.4.0,将CUDA降级为11.0,将cudnn降级为8.0.1(根据TensorFlow网站上的列表,这应该是正确的组合)
  • 重新启动我的电脑:)
nbysray5

nbysray51#

"我找到了解决办法"
因此,当我将Tensorflow降级为2.1.0,将CUDA降级为10.1,将cudnn降级为7.6.5(当时是TensorFlow网站上此列表中的第4个组合)时,它可以正常工作。
我不知道为什么它在最新版本中不起作用,或者在tensorflow 2.4.0的有效组合中不起作用。
它运行得很好,所以我的问题解决了。尽管如此,我还是很想知道为什么在更高版本上使用带有cudnn的LSTM对我不起作用,因为我在任何地方都没有发现这个问题。

mnowg1ta

mnowg1ta2#

取代了

y1 = LSTM(64)(input)

y1 = RNN(tf.keras.layers.LSTMCell(64))(input)

相关问题