系统信息
- 操作系统平台和发行版(例如,Linux Ubuntu 16.04):我在Ubuntu 18.04上进行了测试。
- 从哪里安装的TensorFlow(源代码或二进制文件):二进制文件
- TensorFlow版本(使用下面的命令查看):2.7.0
- Python版本:3.7
- CUDA/cuDNN版本:11.0
- GPU型号和内存:Tesla P100
描述当前行为
当我在多GPU上使用XLA编译训练模型时,会出现以下错误。
021-11-20 12:57:08.333476: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-20 12:57:09.302772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15397 MB memory: -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 6.0
2021-11-20 12:57:09.303502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15397 MB memory: -> device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0
2021-11-20 12:57:10.556310: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:241 : INVALID_ARGUMENT: Trying to access resource _AnonymousVar3 (defined @ /home/sdb/wda/tf_xla/lib/python3.7/site-packages/keras/engine/base_layer_utils.py:129) located in device /job:localhost/replica:0/task:0/device:GPU:1 from device /job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
File "/home/sdb/wda/TF2-jit-compile-on-multi-gpu/xla_tf_function_distributed.py", line 59, in <module>
train_step_dist(images, labels)
File "/home/sdb/wda/tf_xla/lib/python3.7/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/sdb/wda/tf_xla/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource _AnonymousVar3 (defined @ /home/sdb/wda/tf_xla/lib/python3.7/site-packages/keras/engine/base_layer_utils.py:129) located in device /job:localhost/replica:0/task:0/device:GPU:1 from device /job:localhost/replica:0/task:0/device:GPU:0 [Op:__inference_train_step_dist_650]
描述预期行为
我希望使用XLA编译MirroredStrategy,因为我发现*从2.5.0的RELEASE.md中,XLA现在可以编译MirroredStrategy:传递给strategy.run
的步骤函数现在可以用jit_compile=True
进行注解。*独立代码重现问题:
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
# Size of each input image, 28 x 28 pixels
IMAGE_SIZE = 28 * 28
# Number of distinct number labels, [0..9]
NUM_CLASSES = 10
# Number of examples in each training batch (step)
TRAIN_BATCH_SIZE = 100
# Number of training steps to run
TRAIN_STEPS = 1000
# Loads MNIST dataset.
train, test = tf.keras.datasets.mnist.load_data()
train_ds = tf.data.Dataset.from_tensor_slices(train).batch(TRAIN_BATCH_SIZE).repeat()
# Casting from raw data to the required datatypes.
def cast(images, labels):
images = tf.cast(
tf.reshape(images, [-1, IMAGE_SIZE]), tf.float32)
labels = tf.cast(labels, tf.int64)
return (images, labels)
layer = tf.keras.layers.Dense(NUM_CLASSES)
optimizer = tf.keras.optimizers.Adam()
@tf.function(jit_compile=True)
def compiled_step(images, labels):
images, labels = cast(images, labels)
with tf.GradientTape() as tape:
predicted_labels = layer(images)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=predicted_labels, labels=labels
))
gradients = tape.gradient(loss, layer.trainable_variables)
return loss, predicted_labels, gradients
@tf.function()
def train_step(images, labels):
loss, pred, gradients = compiled_step(images, labels)
optimizer.apply_gradients(zip(gradients, layer.trainable_variables))
strategy = tf.distribute.MirroredStrategy()
@tf.function(jit_compile=True)
def train_step_dist(image, labels):
strategy.run(train_step, args=(image, labels))
for images, labels in train_ds:
if optimizer.iterations > TRAIN_STEPS:
break
train_step_dist(images, labels)
8条答案
按热度按时间sgtfey8w1#
你好,@xuanhuo!我在Colab中没有发现任何问题。你能按照这个帖子的建议,尝试使用CUDA 11.2和CuDNN 8.1吗?谢谢!
xxls0lw82#
@mohantym 非常感谢。现在,如果我使用一个GPU是正确的,但如果我使用两个GPU就是错误的。那么你用了多少个GPU?
pdtvr36n3#
@xuanhuo !我在Colab中使用了单GPU。@sanatmpa1!请问您能看一下这个问题吗?尽管在Colab中没有复制问题,但我附上了Gist供参考。
olmpazwi4#
Hi, does the issue appear with
jit_compile=True
only or without as well?wfypjpf45#
also @smit-hinsu
ncgqoxb06#
当我在两个GPU上运行MirroredStrategy(tf-nightly)时,也遇到了同样的问题:
neskvpey7#
在 TensorFlow 2.14 中遇到了相同的问题!有人在使用多 GPU 吗?!我无法验证 TensorFlow 2.15,因为官方的 docker 镜像完全破坏了 GPU 支持(已经有一个关于这个问题的不同的 ticket 在 open)
pbgvytdp8#
你好,有计划解决这个问题吗?这对于更快地训练模型非常有用。