tensorflow 在Eager和Graph模式下生成的非确定性随机数

1szpjjfi  于 6个月前  发布在  其他
关注(0)|答案(7)|浏览(42)

问题类型

Bug

你是否在TF nightly中复现了这个bug?

来源

二进制文件

Tensorflow版本

2.11.0

自定义代码

OS平台和发行版

Linux CentOS

移动设备

  • 无响应*

Python版本

3.7

Bazel版本

  • 无响应*

GCC/编译器版本

  • 无响应*

CUDA/cuDNN版本

  • 无响应*

GPU型号和内存

  • 无响应*

当前行为?

Executing the identical Keras model in eager and graph mode produce different results when the identical seed is passed to `tf.random.uniform`. I created a small example that just generates 2 random outputs. In one case, I use different values for seed, in the other I use identical seeds.

As you can see in the output, in the case where the seeds differ, I get the same result using `model(data)` and `model.predict(data)`. However, when my seeds are identical, then I get other results in eager mode.

I tried to pinpoint the source for this, and I think I found it. I modified the PhiloxRandom class to do some output whenever a function is called. It shows the pointer of `this`, which function gets called and also the `key[0:2]` and `columns[0:4]` values. `key` gets initialized through the global seed `tf.random.set_seed(...)` and `columns[2:4]` through the `seed` value passed to `tf.random.uniform`.

What we see is, that in the case where the seed differs, TensorFlow creates two instances of `PhiloxRandom(...)`, one with 0x7C (==124) and one with 0x7B (==123) as seed. The same happens in graph mode. I think this is the expected behavior.

In the case where the seeds are identical, we see that in eager mode only one instance of PhiloxRandom is created, and it gets called twice. So it uses an accumulated seed in the second `tf.random.uniform(...)`, which is wrong. My assumption is that TensorFlow in eager mode caches the operator instance and just reuses it when the "identical" layer gets called. However, in this particular case this causes undeterministic results, which differ from the graph mode.

重现问题的独立代码

import tensorflow as tf
import numpy as np

# see: https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism
tf.keras.utils.set_random_seed(1)
tf.config.experimental.enable_op_determinism()

def create(seed_a, seed_b):
    inp = tf.keras.Input(shape=(5,))
    x = inp / inp # == 1.0
    a = tf.random.uniform(tf.shape(x), dtype=x.dtype, minval=-5, maxval=5, seed=seed_a) * x
    b = tf.random.uniform(tf.shape(x), dtype=x.dtype, minval=-5, maxval=5, seed=seed_b) * x
    return tf.keras.Model(inputs=[inp], outputs=[a, b])

def compare(seed_a, seed_b):
    print(f"## RUNNING {seed_a} {seed_b} ##")
    data = np.ones((1, 5), np.float32)

model = create(seed_a, seed_b)

eager = model(data)
graph = model.predict(data)

for i, (a, b) in enumerate(zip(eager, graph)):
        print(f'#{i}, A: {a.numpy()}, B: {b}')

tf.random.set_seed(314159)
compare(123, 124)

tf.random.set_seed(314159)
compare(123, 123)

相关日志输出

# EXECUTION RESULTS --------------------------------------------------------- #

## RUNNING 123 124 ##
#0, A: [[ 0.64956665  4.783145   -4.816675    4.7801285   1.6770649 ]], B: [[ 0.64956665  4.783145   -4.816675    4.7801285   1.6770649 ]]
#1, A: [[ 1.321938   4.436612   2.8036633 -2.5721383 -3.595649 ]], B: [[ 1.321938   4.436612   2.8036633 -2.5721383 -3.595649 ]]

## RUNNING 123 123 ##
#0, A: [[ 2.2466993   2.8111506   1.4583216  -0.34343958  2.204914  ]], B: [[ 0.64956665  4.783145   -4.816675    4.7801285   1.6770649 ]]
#1, A: [[ 0.64956665  4.783145   -4.816675    4.7801285   1.6770649 ]], B: [[ 0.64956665  4.783145   -4.816675    4.7801285   1.6770649 ]]

# /EXECUTION RESULTS -------------------------------------------------------- #

# DEBUG --------------------------------------------------------------------- #

## RUNNING 123 124 ##

### EAGER
0x7fffb27e8170 PhiloxRandom(uint64_t, uint64_t): 0004CB2F 00000000 / 00000000 00000000 0000007C 00000000
0x4101a50 Skip(1280)
0x7fffb27e86f0 Skip(0)
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000000 00000000 0000007C 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / 5150EBBA 4EF8C9E4 4D63E30B 919F139E
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000001 00000000 0000007C 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / BC11F9C7 A411A7B1 99682563 B50F8A71
0x7fffb27e8170 PhiloxRandom(uint64_t, uint64_t): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x412a130 Skip(1280)
0x7fffb27e86f0 Skip(0)
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / 5EC85080 05FD3969 A08258B8 61FD2F86
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000001 00000000 0000007B 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / 4DD57768 F92C9439 A139802D 55996F6F

### GRAPH
0x7fffb27e9890 PhiloxRandom(uint64_t, uint64_t): 0004CB2F 00000000 / 00000000 00000000 0000007C 00000000
0x7fffb27e9890 PhiloxRandom(uint64_t, uint64_t): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x441f8b0 Skip(1280)
0x7f8e1fffda30 Skip(0)
0x7f8e1fffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x7f8e1fffda30 OperatorEND(): 8FF812B0 96A522AD / 5EC85080 05FD3969 A08258B8 61FD2F86
0x7f8e1fffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000001 00000000 0000007B 00000000
0x7f8e1fffda30 OperatorEND(): 8FF812B0 96A522AD / 4DD57768 F92C9439 A139802D 55996F6F
0x4454e30 Skip(1280)
0x7f8dfbffda30 Skip(0)
0x7f8dfbffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000000 00000000 0000007C 00000000
0x7f8dfbffda30 OperatorEND(): 8FF812B0 96A522AD / 5150EBBA 4EF8C9E4 4D63E30B 919F139E
0x7f8dfbffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000001 00000000 0000007C 00000000
0x7f8dfbffda30 OperatorEND(): 8FF812B0 96A522AD / BC11F9C7 A411A7B1 99682563 B50F8A71

## RUNNING 123 123 ##
### EAGER
0x7fffb27e8170 PhiloxRandom(uint64_t, uint64_t): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x4167510 Skip(1280)
0x7fffb27e86f0 Skip(0)
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / 5EC85080 05FD3969 A08258B8 61FD2F86
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000001 00000000 0000007B 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / 4DD57768 F92C9439 A139802D 55996F6F
0x4167510 Skip(1280)
0x7fffb27e86f0 Skip(0)
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000500 00000000 0000007B 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / C7DCC1FC 4C63FB94 2352AAA1 D83B9A9E
0x7fffb27e86f0 OperatorBEGIN(): 0004CB2F 00000000 / 00000501 00000000 0000007B 00000000
0x7fffb27e86f0 OperatorEND(): 8FF812B0 96A522AD / 8DDC3910 CEB432E2 6CF2E131 6770FC94

### GRAPH
0x7fffb27e9890 PhiloxRandom(uint64_t, uint64_t): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x7fffb27e9890 PhiloxRandom(uint64_t, uint64_t): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x440ed40 Skip(1280)
0x7f8e1fffda30 Skip(0)
0x7f8e1fffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x7f8e1fffda30 OperatorEND(): 8FF812B0 96A522AD / 5EC85080 05FD3969 A08258B8 61FD2F86
0x7f8e1fffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000001 00000000 0000007B 00000000
0x7f8e1fffda30 OperatorEND(): 8FF812B0 96A522AD / 4DD57768 F92C9439 A139802D 55996F6F
0x4500200 Skip(1280)
0x7f8dfbffda30 Skip(0)
0x7f8dfbffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000000 00000000 0000007B 00000000
0x7f8dfbffda30 OperatorEND(): 8FF812B0 96A522AD / 5EC85080 05FD3969 A08258B8 61FD2F86
0x7f8dfbffda30 OperatorBEGIN(): 0004CB2F 00000000 / 00000001 00000000 0000007B 00000000
0x7f8dfbffda30 OperatorEND(): 8FF812B0 96A522AD / 4DD57768 F92C9439 A139802D 55996F6F

# /DEBUG -------------------------------------------------------------------- #
oalqel3c

oalqel3c1#

感谢报告此问题。
我能够在 TF 2.11 和 TF Nightly 2.13.0-dev20230209 中重现此问题。请查看以下代码片段:here
@SuryanarayanaY,请问您能否调查一下这个问题?谢谢。

m3eecexj

m3eecexj2#

你好,@mergian ,
你有没有计划修复这个问题?请随时为这个问题创建一个PR。谢谢!

mrfwxfqh

mrfwxfqh3#

嘿@SuryanarayanaY
我很遗憾没有一个好的想法。根据我的理解,问题如下:
在急切模式下,TensorFlow使用本地种子检查随机操作是否“唯一”。因此,在我的示例中,TF为123分配一次本地种子对象PhiloxRandom,并在两个调用中使用相同的对象。
相比之下,在图模式下,TensorFlow为图中的每个tf.random.uniform(...)出现创建本地种子对象,而不管种子参数是什么。
话虽如此:从语法上讲,我认为急切模式实际上是正确的,因为用户指定了相同的本地种子,因此TF应该在两个调用中使用相同的本地种子对象。然而,问题可能在于应用分布策略时,对tf.random.uniform(..., seed=123)的第一次和第二次调用可能不会在同一个设备上执行,因此共享相同的本地种子对象将不可能或至少难以实现。
另一方面,我认为急切模式无法识别这两个调用是否相同。
一个“快速修复”的方法是记录这种行为并让用户意识到他们应该在随机调用中使用不同的本地种子以获得正确的行为。
其他解决方案可能需要API更改。例如,我研究了PyTorch是如何生成随机数(https://pytorch.org/docs/stable/generated/torch.rand.html)的。用户可以提供一个torch.Generator对象,这是一个本地种子对象,或者它将回退到使用全局种子对象。我认为这种方法没有被纳入TensorFlow以实现分布式计算的局部随机状态。

yi0zb3m4

yi0zb3m44#

对于分发策略,已经在以下内容中记录。
不要使用 tf.compat.v1.Sessiontf.distribute.experimental.ParameterServerStrategy,这可能会引入非确定性。除了 ops(包括 tf.data ops),这是在 TensorFlow 中已知的唯一潜在的非确定性来源(如果您发现更多,请提交一个问题)。请注意,tf.compat.v1.Session 是使用 TF1 API 所必需的,因此在使用 TF1 API 时无法保证确定性。
其他详细信息在这里 https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism

pieyvz9o

pieyvz9o7#

感谢您发布这两篇指南。我同意使用Generator或Stateless的较新版本不应该产生这个问题。然而:只要tf.random.uniform没有被弃用,我认为应该在相应的文档中提到这个bug:https://www.tensorflow.org/api_docs/python/tf/random/uniform
此外,在tf.random.uniform页面上也没有提到强烈不建议使用这些。

相关问题