tensorflow 关于XLA编译的MirroredStrategy问题

soat7uwm 于 6个月前发布在其他

关注(0)|答案(8)|浏览(49)

系统信息

操作系统平台和发行版(例如，Linux Ubuntu 16.04):我在Ubuntu 18.04上进行了测试。
从哪里安装的TensorFlow(源代码或二进制文件):二进制文件
TensorFlow版本(使用下面的命令查看):2.7.0
Python版本：3.7
CUDA/cuDNN版本：11.0
GPU型号和内存：Tesla P100
描述当前行为

当我在多GPU上使用XLA编译训练模型时，会出现以下错误。

021-11-20 12:57:08.333476: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-20 12:57:09.302772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15397 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 6.0
2021-11-20 12:57:09.303502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15397 MB memory:  -> device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:b1:00.0, compute capability: 6.0
2021-11-20 12:57:10.556310: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:241 : INVALID_ARGUMENT: Trying to access resource _AnonymousVar3 (defined @ /home/sdb/wda/tf_xla/lib/python3.7/site-packages/keras/engine/base_layer_utils.py:129) located in device /job:localhost/replica:0/task:0/device:GPU:1 from device /job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
  File "/home/sdb/wda/TF2-jit-compile-on-multi-gpu/xla_tf_function_distributed.py", line 59, in <module>
    train_step_dist(images, labels)
  File "/home/sdb/wda/tf_xla/lib/python3.7/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/sdb/wda/tf_xla/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Trying to access resource _AnonymousVar3 (defined @ /home/sdb/wda/tf_xla/lib/python3.7/site-packages/keras/engine/base_layer_utils.py:129) located in device /job:localhost/replica:0/task:0/device:GPU:1 from device /job:localhost/replica:0/task:0/device:GPU:0 [Op:__inference_train_step_dist_650]

描述预期行为

我希望使用XLA编译MirroredStrategy,因为我发现*从2.5.0的RELEASE.md中，XLA现在可以编译MirroredStrategy:传递给strategy.run的步骤函数现在可以用jit_compile=True进行注解。*独立代码重现问题：

import tensorflow as tf
tf.compat.v1.enable_eager_execution()

# Size of each input image, 28 x 28 pixels
IMAGE_SIZE = 28 * 28
# Number of distinct number labels, [0..9]
NUM_CLASSES = 10
# Number of examples in each training batch (step)
TRAIN_BATCH_SIZE = 100
# Number of training steps to run
TRAIN_STEPS = 1000

# Loads MNIST dataset.
train, test = tf.keras.datasets.mnist.load_data()
train_ds = tf.data.Dataset.from_tensor_slices(train).batch(TRAIN_BATCH_SIZE).repeat()

# Casting from raw data to the required datatypes.
def cast(images, labels):
    images = tf.cast(
        tf.reshape(images, [-1, IMAGE_SIZE]), tf.float32)
    labels = tf.cast(labels, tf.int64)
    return (images, labels)

layer = tf.keras.layers.Dense(NUM_CLASSES)
optimizer = tf.keras.optimizers.Adam()

@tf.function(jit_compile=True)
def compiled_step(images, labels):
    images, labels = cast(images, labels)

    with tf.GradientTape() as tape:
        predicted_labels = layer(images)
        loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=predicted_labels, labels=labels
        ))
    gradients = tape.gradient(loss, layer.trainable_variables)
    return loss, predicted_labels, gradients

@tf.function()
def train_step(images, labels):
    loss, pred, gradients = compiled_step(images, labels)
    optimizer.apply_gradients(zip(gradients, layer.trainable_variables))

strategy = tf.distribute.MirroredStrategy()

@tf.function(jit_compile=True)
def train_step_dist(image, labels):
    strategy.run(train_step, args=(image, labels))

for images, labels in train_ds:
    if optimizer.iterations > TRAIN_STEPS:
        break
    train_step_dist(images, labels)

tensorflow

来源：https://github.com/tensorflow/tensorflow/issues/53140

8条答案

按热度按时间

sgtfey8w1#

你好，@xuanhuo!我在Colab中没有发现任何问题。你能按照这个帖子的建议，尝试使用CUDA 11.2和CuDNN 8.1吗？谢谢！

赞(0）回复(0）举报 6个月前

xxls0lw82#

@mohantym 非常感谢。现在，如果我使用一个GPU是正确的，但如果我使用两个GPU就是错误的。那么你用了多少个GPU?

赞(0）回复(0）举报 6个月前

pdtvr36n3#

@xuanhuo !我在Colab中使用了单GPU。@sanatmpa1!请问您能看一下这个问题吗？尽管在Colab中没有复制问题，但我附上了Gist供参考。

赞(0）回复(0）举报 6个月前

olmpazwi4#

Hi, does the issue appear with jit_compile=True only or without as well?

赞(0）回复(0）举报 6个月前

wfypjpf45#

also @smit-hinsu

赞(0）回复(0）举报 6个月前

ncgqoxb06#

当我在两个GPU上运行MirroredStrategy(tf-nightly)时，也遇到了同样的问题：

2023-06-30 09:27:16.791900: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at xla_ops.cc:566 : INVALID_ARGUMENT: Trying to access resource Resource-690-at-0x53c71be0 (defined @ /home/nicolas/.local/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py:113) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device

File "/home/nicolas/.local/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1340, in apply_grad_to_update_var
    return self._update_step_xla(grad, var, id(self._var_key(var)))

2 root error(s) found.
  (0) INVALID_ARGUMENT:  Trying to access resource Resource-690-at-0x539f5c10 (defined @ /home/nicolas/.local/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py:113) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
         [[{{node Adam/update_1_212/StatefulPartitionedCall}}]]
         [[GroupCrossDeviceControlEdges_0/Adam/AssignAddVariableOp/_297]]
  (1) INVALID_ARGUMENT:  Trying to access resource Resource-690-at-0x539f5c10 (defined @ /home/nicolas/.local/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py:113) located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:1
 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
         [[{{node Adam/update_1_212/StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_84674]

赞(0）回复(0）举报 6个月前

neskvpey7#

在 TensorFlow 2.14 中遇到了相同的问题！有人在使用多 GPU 吗？!我无法验证 TensorFlow 2.15,因为官方的 docker 镜像完全破坏了 GPU 支持(已经有一个关于这个问题的不同的 ticket 在 open)

赞(0）回复(0）举报 6个月前

pbgvytdp8#

你好，有计划解决这个问题吗？这对于更快地训练模型非常有用。

赞(0）回复(0）举报 6个月前