keras Tensorflow中多头注意与注意层的区别

dzhpxtsq 于 2023-03-02 发布在其他

关注(0)|答案(1)|浏览(158)

Tensorflow中以下层之间的区别是什么：一个月一个月一个月一个月一个月一个月和一个月二个月一个月？
另外，如何使用基本层（如Dense、Add、LayerNormalization等）实现tf.keras.layers.MultiHeadAttention？我想了解本教程中发生的确切操作。

keras

来源：https://stackoverflow.com/questions/75590491/difference-between-multiheadattention-and-attention-layer-in-tensorflow

1条答案

按热度按时间

yjghlzjz1#

https://paperswithcode.com/是了解不同深度学习术语和实现的细微差别的良好资源
变压器模型中注意机制的一般定义：

注意力机制是神经网络中用来模拟长距离交互的组件，例如NLP中的文本。关键思想是在上下文向量和输入之间建立捷径，以允许模型关注不同的部分。- paperswithcode

用我自己的话来说 *，“快捷方式”的注意力是通过对“查询”执行顺序矩阵乘法来创建的（输入）至“值”（您希望将输入Map到的目标），在它们之间，有一个“键”，其作用类似于查询理论上应该利用来将查询投射到该值的信号。注意力机制的公共输出是的向量/矩阵/Tensor表示，它编码了这个捷径。

这些“快捷方式”（又名注意力机制）有许多变体，研究人员试图从查询+键-〉值中找到最佳连接。

注意与多头注意

用我自己的话说 *，一般注意力和多重注意力之间的主要区别在于“多重注意力”输入的冗余性。如果单个头部（一般）注意力将一个Q + KMap到V，可以将多重注意力视为创建多个Q，对应多个K，并创建多个对应V的快捷方式。

在代码中，假设Attention、MultiHeadAttention的初始化相同，则以下各项的output_tensor值应相同：

import tensorflow as tf
from tensorflow.keras.layers import Attention, MultiHeadAttention

layer = MultiHeadAttention(num_heads=1, key_dim=2)
target = tf.keras.Input(shape=[8, 16])
source = tf.keras.Input(shape=[4, 16])
output_tensor, weights = layer(target, source,
                               return_attention_scores=True)



layer_vanilla = Attention()
target_vanilla = tf.keras.Input(shape=[8, 16])
source_vanilla = tf.keras.Input(shape=[4, 16])
output_tensor_vanilla, weights_vanilla = layer_vanilla([target_vanilla, source_vanilla],
                               return_attention_scores=True)

print(output_tensor)
print(output_tensor_vanilla)

[out]：

KerasTensor(type_spec=TensorSpec(shape=(None, 8, 16), dtype=tf.float32, name=None), name='multi_head_attention_6/attention_output/add:0', description="created by layer 'multi_head_attention_6'")

KerasTensor(type_spec=TensorSpec(shape=(None, 8, 16), dtype=tf.float32, name=None), name='attention_3/MatMul_1:0', description="created by layer 'attention_3'")

https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention

注意力与附加注意力

附加注意是一个有趣的现象;它是OG注意机制：
加性注意，也称为Bahdanau注意，使用一个隐藏层前馈网络来计算注意对齐分数
详细信息：https://paperswithcode.com/method/additive-attention
在“IMOW”之前，让我们看一下代码：

from tensorflow.keras.layers import AdditiveAttention

layer_bdn = AdditiveAttention()
target_bdn = tf.keras.Input(shape=[8, 16])
source_bdn = tf.keras.Input(shape=[4, 16])
output_tensor_bdn, weights_bdn = layer_bdn([target_bdn, source_bdn],
                               return_attention_scores=True)

print(output_tensor_bdn)

[out]：

<KerasTensor: shape=(None, 8, 16) dtype=float32 (created by layer 'additive_attention')>

比较实现：

https://github.com/keras-team/keras/blob/v2.11.0/keras/layers/attention/attention.py#L30-L204
https://github.com/keras-team/keras/blob/v2.11.0/keras/layers/attention/additive_attention.py#L30-L178

https://www.diffchecker.com/5i9Viqm9/

通用Attention具有：

scores = self.concat_score_weight * tf.reduce_sum(
                    tf.tanh(self.scale * (q_reshaped + k_reshaped)), axis=-1
                )

其中，如果if self.score_mode == "concat"：

if self.score_mode == "concat":
            self.concat_score_weight = self.add_weight(
                name="concat_score_weight",
                shape=(),
                initializer="ones",
                dtype=self.dtype,
                trainable=True,
            )

但是如果self.use_scale被设置为True，则AdditiveAttention使用Glorot初始化器：

if self.use_scale:
            self.scale = self.add_weight(
                name="scale",
                shape=[dim],
                initializer="glorot_uniform",
                dtype=self.dtype,
                trainable=True,
            )

不过，在实现中还有更多的细微差别。

用我自己的话来说，* 加性注意是一般注意机制的早期定义，它们达到了与单头注意相同的目的，而且如果初始化和标度设置相等，加性注意==一般注意。

问：那么在选择关注层时，我应该使用什么？

答：取决于最终目标是什么，如果目标是复制原始的Bahdanau论文，那么添加性注意力将是最接近的。如果不是，那么香草注意力很可能是你想要的。

问：多头怎么样？

答：在大多数情况下，你会总是使用多头注意力，因为

附加注意是一种具有特定初始化和操作的普通注意
注意是一种多个头部的注意类型，其中头部的数目被设置为1

赞(0）回复(0）举报 2023-03-02

我来回答

keras Tensorflow中多头注意与注意层的区别

1条答案

注意与多头注意

注意力与附加注意力

问：那么在选择关注层时，我应该使用什么？

问：多头怎么样？

相关问题

热门标签

最新问答