bug描述 Describe the Bug
这是我们的模型训练代码,它在reciprocal层的输出与pytorch有很大差异
class Model_1715507020(nn.Layer):
def __init__(self):
super(Model_1715507020, self).__init__()
self.conv1_mutated = paddle.nn.Conv2DTranspose(in_channels=1, out_channels=6, kernel_size=[5, 5], stride=[1, 1], padding=[0, 0], output_padding=[0, 0], dilation=[1, 1], groups=1, bias_attr=None)
self.relu1 = paddle.nn.ReLU()
self.pool1 = paddle.nn.MaxPool2D(kernel_size=[2, 2], stride=[2, 2], padding=[0, 0], ceil_mode=False)
self.conv2_mutated = paddle.nn.Conv2D(in_channels=6, out_channels=16, kernel_size=[6, 8], stride=[1, 1], padding=[0, 0], dilation=[1, 1], groups=1, bias_attr=None)
self.relu2_mutated = paddle.nn.Softsign()
self.pool2 = paddle.nn.MaxPool2D(kernel_size=[2, 2], stride=[2, 2], padding=[0, 0], ceil_mode=False)
self.flatten = paddle.nn.Flatten()
self.linear1_mutated = paddle.nn.Linear(in_features=320, out_features=120)
self.relu3 = paddle.nn.ReLU()
self.linear2 = paddle.nn.Linear(in_features=120, out_features=84)
self.relu4_mutated = paddle.reciprocal
self.tail_flatten = paddle.nn.Flatten()
self.tail_fc = paddle.nn.Linear(in_features=84, out_features=10)
def forward(self, input):
conv1_output = self.conv1_mutated(input)
relu1_output = self.relu1(conv1_output)
maxpool1_output = self.pool1(relu1_output)
conv2_output = self.conv2_mutated(maxpool1_output)
relu2_output = self.relu2_mutated(conv2_output)
maxpool2_output = self.pool2(relu2_output)
flatten_output = self.flatten(maxpool2_output)
fc1_output = self.linear1_mutated(flatten_output)
relu3_output = self.relu3(fc1_output)
fc2_output = self.linear2(relu3_output)
relu4_output = self.relu4_mutated(fc2_output)
tail_flatten_output = self.tail_flatten(relu4_output)
tail_fc_output = self.tail_fc(tail_flatten_output)
tail_fc_output = tail_fc_output
return tail_fc_output
输出差异
fc2_output.npz 0.00019089877605438232
relu4_output.npz 2942823.25
output.npz 717287.0
paddle的结果与其他几个框架都不尽相同
梯度也与pytorch不一致
tail_fc.bias: 梯度数据不一致, 差值:0.004240369889885187
conv1_mutated.bias: 梯度数据不一致, 差值:38971648.0
linear1_mutated.bias: 梯度数据不一致, 差值:154666512.0
linear2.weight: 梯度数据不一致, 差值:271438272.0
linear1_mutated.weight: 梯度数据不一致, 差值:41292152.0
conv1_mutated.weight: 梯度数据不一致, 差值:109647968.0
tail_fc.weight: 梯度数据不一致, 差值:2942.853759765625
conv2_mutated.bias: 梯度数据不一致, 差值:348457472.0
conv2_mutated.weight: 梯度数据不一致, 差值:223159072.0
linear2.bias: 梯度数据不一致, 差值:682456960.0
复现代码
https://github.com/PhyllisJi/MoCoDiff_Bug/tree/paddle-issue%2364606
其中有详细的复现步骤
其他补充信息 Additional Supplementary Information
paddle版本 2.6.1
2条答案
按热度按时间1sbrub3j1#
你好,感谢你提供详细的反馈~ 但是基于你提供的复现代码,我使用Paddle2.6.1并不能复现“paddle输出与pytorch不一致”的错误。按照仓库的README,我得到的结果如下:
关于梯度对齐的部分,我查看了
grad_diff.py
,发现代码中读取的存储梯度数值的npz文件在执行layer_diff.py
和grad_diff.py
时,内容并不会得到更新:https://github.com/PhyllisJi/MoCoDiff_Bug/blob/paddle-issue%2364606/paddle_bug/grad_diff.py#L29
因此,检查梯度对齐的代码似乎并不能正确地完成任务。如果我的理解有错误,请纠正我~
tuwxkamq2#
这是我们使用的环境:
得到结果如下
仓库也已更新,现在执行
layer_diff.py
时会更新存储梯度数值的npz文件