系统信息
- 是否编写了自定义代码(与在TensorFlow中使用的库存示例脚本相反):是
- OS平台和发行版(例如,Linux Ubuntu 16.04):Ubuntu 20.04,Debian Bullseye
- 移动设备(例如iPhone 8,Pixel 2,三星Galaxy)如果问题发生在移动设备上:N/A
- 从哪里安装的TensorFlow(源代码或二进制文件):二进制
- TensorFlow版本(使用以下命令):v2.5.1-13-g386ce34a1c1 2.5.1
- Python版本:3.8.10
- Bazel版本(如果从源代码编译):N/A
- GCC/编译器版本(如果从源代码编译):N/A
- CUDA/cuDNN版本:11.0
- GPU型号和内存:2x V100 32GB
描述当前行为
当尝试使用包含复数和实数(float32)权重的模型时,会产生以下错误:
Traceback (most recent call last):
File "/storage/tf_complex64_bug.py", line 67, in <module>
model2.fit(data)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1178, in fit
tmp_logs = self.train_function(iterator)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 933, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 763, in _initialize
self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3050, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3444, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3279, in _create_graph_function
func_graph_module.func_graph_from_py_func(
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 999, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 672, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 986, in wrapper
raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:850 train_function *
return step_function(self, iterator)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:840 step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:1285 run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2833 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py:678 _call_for_each_replica
return mirrored_run.call_for_each_replica(
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py:104 call_for_each_replica
return _call_for_each_replica(strategy, fn, args, kwargs)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py:245 _call_for_each_replica
coord.join(threads)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/training/coordinator.py:389 join
six.reraise(*self._exc_info_to_raise)
/root/miniconda3/lib/python3.8/site-packages/six.py:703 reraise
raise value
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
yield
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py:238 _call_for_each_replica
merge_result = threads[0].merge_fn(distribution, *merge_args,
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/optimizer_v2/utils.py:148 _all_reduce_sum_fn **
return distribution.extended.batch_reduce_to(ds_reduce_util.ReduceOp.SUM,
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2402 batch_reduce_to
return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py:767 _batch_reduce_to
return cross_device_ops.batch_reduce(
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:446 batch_reduce
return self.batch_reduce_implementation(reduce_op, value_destination_pairs,
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:874 batch_reduce_implementation
return self._batch_all_reduce(reduce_op,
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:887 _batch_all_reduce
dense_results = self._do_batch_all_reduce(reduce_op, dense_values)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:910 _do_batch_all_reduce
device_grad_packs, tensor_packer = _pack_tensors(grouped, self._num_packs)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:820 _pack_tensors
device_grad_packs = tensor_packer.pack(device_grads)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/distribute/cross_device_ops.py:747 pack
concat_grads = array_ops.concat(flat_grads, 0)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206 wrapper
return target(*args, **kwargs)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py:1768 concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/ops/gen_array_ops.py:1227 concat_v2
_, _, _op, _outputs = _op_def_library._apply_op_helper(
/root/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:466 _apply_op_helper
raise TypeError("%s that don't all match." % prefix)
TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float32, float32, float32, complex64, complex64, float32, float32] that don't all match.
描述预期行为
该模型应成功训练
- 您是否想提交PR?(是/否):
- 简要描述您的候选解决方案(如果贡献):
重现问题的独立代码
提供一个最小必要的可重现测试用例,以生成问题。如果可能,请分享一个链接到Colab/Jupyter/任何笔记本。
import tensorflow as tf
import numpy as np
def complex_uniform_initializer(scale=0.05):
real_initializer = tf.keras.initializers.RandomUniform(-scale,scale)
def initializer(shape,dtype):
if dtype == tf.complex64:
dtype = tf.float32
elif dtype == tf.complex128:
dtype = tf.float64
real = real_initializer(shape,dtype)
imag = real_initializer(shape,dtype)
return tf.dtypes.complex(real,imag)
return initializer
class ComplexDenseLayer(tf.keras.layers.Layer):
def __init__(self, out_units, activation=None):
super().__init__()
self.out_units = out_units
self.activation = activation
def build(self, input_shape):
inp_units = input_shape[-1]
initializer = complex_uniform_initializer()
self.w = self.add_weight(shape=[inp_units, self.out_units],
initializer = initializer,
dtype=tf.complex64, trainable=True)
self.b = self.add_weight(shape=[self.out_units],
initializer = initializer,
dtype=tf.complex64, trainable=True)
def call(self,inp):
x = tf.einsum('bi,ij->bj', inp, self.w)
x = tf.nn.bias_add(x, self.b)
return self.activation(x)
def model(input_units, intermediate_units, output_units):
inp = tf.keras.layers.Input((input_units,))
xreal = tf.keras.layers.Dense(intermediate_units)(inp)
ximag = tf.keras.layers.Dense(intermediate_units)(inp)
x = tf.cast(xreal, 'complex64') + 1j*tf.cast(ximag,'complex64')
x = ComplexDenseLayer(intermediate_units, activation = lambda w: w * tf.math.conj(w))(x)
x = tf.math.real(x)
x = tf.keras.layers.Dense(output_units)(x)
return tf.keras.Model(inp,x)
nsamples = 100
bsize = 10
ninp,nintermediate,nout = 16,128,16
inp = np.random.rand(nsamples, ninp)
tar = np.random.rand(nsamples, nout)
data = tf.data.Dataset.from_tensor_slices((inp,tar)).batch(bsize)
#Single GPU training works fine
model1 = model(ninp,nintermediate,nout)
model1.summary()
model1.compile(loss='mse', optimizer='adam')
model1.fit(data)
#Distributed training fails
distribute_strategy = tf.distribute.MirroredStrategy()
with distribute_strategy.scope():
model2 = model(ninp,nintermediate,nout)
model2.summary()
model2.compile(loss='mse', optimizer='adam')
model2.fit(data)
5条答案
按热度按时间yx2lnoni1#
请尝试使用TensorFlow最新稳定版本v2.7,并告知我们是否遇到了相同的错误。谢谢!
e0bqpujr2#
@tilakrayal ,我目前无法在2.7 GPU版本上测试,因为我唯一拥有多物理GPU的机器允许通过conda安装TF。在那台机器上,我用TF 2.6.2测试了相同的脚本,问题仍然存在。
在CPU TF 2.7上,我可以确认在CPU版本上,
cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()
存在这个问题。然而,我发现设置cross_device_ops=tf.distribute.ReductionToOneDevice()
可以解决GPU TF 2.6.2和CPU TF 2.7上的这个问题。imzjd6km3#
我能够在Tensorflow 2.7中运行您提供的示例代码,没有任何问题。请查看gist here并确认。谢谢!
8wigbo564#
我确认这个问题在Tensorflow 2.7中仍然存在,但似乎仅限于GPU。(注意,要复现具有2个逻辑GPU的问题,需要一个GPU colab会话和
cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()
。如果有两台物理GPU存在,我只能在本地机器上复现这个问题)vshtjzan5#
这个问题在Tensorflow 2.8中仍然存在,如the gist from my previous post所示。你知道它是否可能很快修复吗?这是从Colab获取的堆栈跟踪: