tensorflow 具有复杂权重和实数权重的模型在使用tf.distribute.MirroredStrategy时无法正常工作,

u5rb5r59  于 5个月前  发布在  其他
关注(0)|答案(5)|浏览(65)

系统信息

  • 是否编写了自定义代码(与在TensorFlow中使用的库存示例脚本相反):是
  • OS平台和发行版(例如,Linux Ubuntu 16.04):Ubuntu 20.04,Debian Bullseye
  • 移动设备(例如iPhone 8,Pixel 2,三星Galaxy)如果问题发生在移动设备上:N/A
  • 从哪里安装的TensorFlow(源代码或二进制文件):二进制
  • TensorFlow版本(使用以下命令):v2.5.1-13-g386ce34a1c1 2.5.1
  • Python版本:3.8.10
  • Bazel版本(如果从源代码编译):N/A
  • GCC/编译器版本(如果从源代码编译):N/A
  • CUDA/cuDNN版本:11.0
  • GPU型号和内存:2x V100 32GB
    描述当前行为

当尝试使用包含复数和实数(float32)权重的模型时,会产生以下错误:

Traceback (most recent call last):
  File "/storage/tf_complex64_bug.py", line 67, in <module>
    model2.fit(data)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1178, in fit
    tmp_logs = self.train_function(iterator)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 933, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 763, in _initialize
    self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3050, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3444, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3279, in _create_graph_function
    func_graph_module.func_graph_from_py_func(
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 999, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 672, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 986, in wrapper
    raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:

    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:850 train_function  *
        return step_function(self, iterator)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:840 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:1285 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2833 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py:678 _call_for_each_replica
        return mirrored_run.call_for_each_replica(
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py:104 call_for_each_replica
        return _call_for_each_replica(strategy, fn, args, kwargs)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py:245 _call_for_each_replica
        coord.join(threads)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/training/coordinator.py:389 join
        six.reraise(*self._exc_info_to_raise)
    /root/miniconda3/lib/python3.8/site-packages/six.py:703 reraise
        raise value
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
        yield
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py:238 _call_for_each_replica
        merge_result = threads[0].merge_fn(distribution, *merge_args,
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/optimizer_v2/utils.py:148 _all_reduce_sum_fn  **
        return distribution.extended.batch_reduce_to(ds_reduce_util.ReduceOp.SUM,
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2402 batch_reduce_to
        return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py:767 _batch_reduce_to
        return cross_device_ops.batch_reduce(
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:446 batch_reduce
        return self.batch_reduce_implementation(reduce_op, value_destination_pairs,
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:874 batch_reduce_implementation
        return self._batch_all_reduce(reduce_op,
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:887 _batch_all_reduce
        dense_results = self._do_batch_all_reduce(reduce_op, dense_values)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:910 _do_batch_all_reduce
        device_grad_packs, tensor_packer = _pack_tensors(grouped, self._num_packs)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:820 _pack_tensors
        device_grad_packs = tensor_packer.pack(device_grads)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:747 pack
        concat_grads = array_ops.concat(flat_grads, 0)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206 wrapper
        return target(*args, **kwargs)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py:1768 concat
        return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/ops/gen_array_ops.py:1227 concat_v2
        _, _, _op, _outputs = _op_def_library._apply_op_helper(
    /root/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:466 _apply_op_helper
        raise TypeError("%s that don't all match." % prefix)

    TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float32, float32, float32, complex64, complex64, float32, float32] that don't all match.

描述预期行为

该模型应成功训练

Contributing

  • 您是否想提交PR?(是/否):
  • 简要描述您的候选解决方案(如果贡献):
    重现问题的独立代码

提供一个最小必要的可重现测试用例,以生成问题。如果可能,请分享一个链接到Colab/Jupyter/任何笔记本。

import tensorflow as tf
import numpy as np

def complex_uniform_initializer(scale=0.05):
    real_initializer = tf.keras.initializers.RandomUniform(-scale,scale)
    def initializer(shape,dtype):
        if dtype == tf.complex64:
            dtype = tf.float32
        elif dtype == tf.complex128:
            dtype = tf.float64
        real = real_initializer(shape,dtype)
        imag = real_initializer(shape,dtype)
        return tf.dtypes.complex(real,imag)
    return initializer

class ComplexDenseLayer(tf.keras.layers.Layer):

    def __init__(self, out_units, activation=None):
        super().__init__()
        self.out_units = out_units
        self.activation = activation

    def build(self, input_shape):
        inp_units = input_shape[-1]
        initializer = complex_uniform_initializer()
        self.w = self.add_weight(shape=[inp_units, self.out_units],
                                 initializer = initializer,
                                 dtype=tf.complex64, trainable=True)
        self.b = self.add_weight(shape=[self.out_units],
                                 initializer = initializer,
                                 dtype=tf.complex64, trainable=True)

    def call(self,inp):
        x = tf.einsum('bi,ij->bj', inp, self.w)
        x = tf.nn.bias_add(x, self.b)
        return self.activation(x)

    

def model(input_units, intermediate_units, output_units):
    inp = tf.keras.layers.Input((input_units,))
    xreal = tf.keras.layers.Dense(intermediate_units)(inp)
    ximag = tf.keras.layers.Dense(intermediate_units)(inp)
    x = tf.cast(xreal, 'complex64') + 1j*tf.cast(ximag,'complex64')
    x = ComplexDenseLayer(intermediate_units, activation = lambda w: w * tf.math.conj(w))(x)
    x = tf.math.real(x)
    x = tf.keras.layers.Dense(output_units)(x)
    return tf.keras.Model(inp,x) 

nsamples = 100
bsize = 10
ninp,nintermediate,nout = 16,128,16
inp = np.random.rand(nsamples, ninp)
tar = np.random.rand(nsamples, nout)
data = tf.data.Dataset.from_tensor_slices((inp,tar)).batch(bsize)

#Single GPU training works fine
model1 = model(ninp,nintermediate,nout)
model1.summary()
model1.compile(loss='mse', optimizer='adam')
model1.fit(data)

#Distributed training fails
distribute_strategy =  tf.distribute.MirroredStrategy()
with distribute_strategy.scope():
    model2 = model(ninp,nintermediate,nout)
    model2.summary()
    model2.compile(loss='mse', optimizer='adam')
    model2.fit(data)
yx2lnoni

yx2lnoni1#

请尝试使用TensorFlow最新稳定版本v2.7,并告知我们是否遇到了相同的错误。谢谢!

e0bqpujr

e0bqpujr2#

@tilakrayal ,我目前无法在2.7 GPU版本上测试,因为我唯一拥有多物理GPU的机器允许通过conda安装TF。在那台机器上,我用TF 2.6.2测试了相同的脚本,问题仍然存在。
在CPU TF 2.7上,我可以确认在CPU版本上,cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()存在这个问题。然而,我发现设置cross_device_ops=tf.distribute.ReductionToOneDevice()可以解决GPU TF 2.6.2和CPU TF 2.7上的这个问题。

imzjd6km

imzjd6km3#

我能够在Tensorflow 2.7中运行您提供的示例代码,没有任何问题。请查看gist here并确认。谢谢!

8wigbo56

8wigbo564#

我确认这个问题在Tensorflow 2.7中仍然存在,但似乎仅限于GPU。(注意,要复现具有2个逻辑GPU的问题,需要一个GPU colab会话和cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()。如果有两台物理GPU存在,我只能在本地机器上复现这个问题)

vshtjzan

vshtjzan5#

这个问题在Tensorflow 2.8中仍然存在,如the gist from my previous post所示。你知道它是否可能很快修复吗?这是从Colab获取的堆栈跟踪:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
    yield
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 236, in _call_for_each_replica
    **merge_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 689, in wrapper
    return converted_call(f, args, kwargs, options=options)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 458, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/utils.py", line 152, in _all_reduce_sum_fn
    grads_and_vars)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2456, in batch_reduce_to
    return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 768, in _batch_reduce_to
    options=self._communication_options.merge(options))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 444, in batch_reduce
    options)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 872, in batch_reduce_implementation
    [v[0] for v in value_destination_pairs])
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 884, in _batch_all_reduce
    dense_results = self._do_batch_all_reduce(reduce_op, dense_values)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 907, in _do_batch_all_reduce
    device_grad_packs, tensor_packer = _pack_tensors(grouped, self._num_packs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 817, in _pack_tensors
    device_grad_packs = tensor_packer.pack(device_grads)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cross_device_ops.py", line 744, in pack
    concat_grads = array_ops.concat(flat_grads, 0)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 458, in _apply_op_helper
    raise TypeError(f"{prefix} that don't all match.")
TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float32, float32, float32, complex64, complex64, float32, float32] that don't all match.

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-1-f6e3c733669f>](https://localhost:8080/#) in <module>()
     83     model2.summary()
     84     model2.compile(loss='mse', optimizer='adam')
---> 85     model2.fit(data)

1 frames

[/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py](https://localhost:8080/#) in error_handler(*args, **kwargs)
     65     except Exception as e:  # pylint: disable=broad-except
     66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
     68     finally:
     69       del filtered_tb

[/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py](https://localhost:8080/#) in autograph_handler(*args, **kwargs)
   1145           except Exception as e:  # pylint:disable=broad-except
   1146             if hasattr(e, "ag_error_metadata"):
-> 1147               raise e.ag_error_metadata.to_exception(e)
   1148             else:
   1149               raise

TypeError: in user code:

    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.7/dist-packages/six.py", line 703, in reraise
        raise value
    File "/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/utils.py", line 152, in _all_reduce_sum_fn  **
        grads_and_vars)

    TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float32, float32, float32, complex64, complex64, float32, float32] that don't all match.

相关问题