我试图理解在keras视觉转换器的实现。
这是完整的code。
我不明白为什么patches = tf.reshape(patches, [batch_size, -1, patch_dims])返回的是(none,none,108)的Tensor，而不是(none,144,108)的Tensor，在这种情况下，只返回一个面片，我可以
patches在整形之前的尺寸是(none,12,12,108)，其中12和12是图像中所有块的高度和宽度

class Patches(layers.Layer):
    def __init__(self, patch_size):
        super().__init__()
        self.patch_size = patch_size

    def call(self, images):
        batch_size = tf.shape(images)[0]
        patches = tf.image.extract_patches(
            images=images,
            sizes=[1, self.patch_size, self.patch_size, 1],
            strides=[1, self.patch_size, self.patch_size, 1],
            rates=[1, 1, 1, 1],
            padding="VALID",
        )
        patch_dims = patches.shape[-1]
        patches = tf.reshape(patches, [batch_size, -1, patch_dims])
        return patches

稍后，将此tensor传递给PatchEncoder()，PatchEncoder()在64 dimension dense layer中传递此108 elements patch，但不应对144个patches中的每个patches执行此操作，而应仅对一个patch执行此操作（Patches()的返回patch）？
这样我就可以有一个embedding layer为每个144 patches我有64 dimension vector elements所有不同的其他基于相应的补丁？

class PatchEncoder(layers.Layer):
    def __init__(self, num_patches, projection_dim):
        super().__init__()
        self.num_patches = num_patches
        self.projection = layers.Dense(units=projection_dim)
        self.position_embedding = layers.Embedding(
            input_dim=num_patches, output_dim=projection_dim
        )

    def call(self, patch):
        positions = tf.range(start=0, limit=self.num_patches, delta=1)
        encoded = self.projection(patch) + self.position_embedding(positions)
        return encoded

因此，我认为embedding layer应该是这样的，对于每个patch，我根据实际修补程序中的值使用不同的值

**Embedding layer that I think should be returned**
    0.[0 0 0 ... 0]
    1.[1 1 1 ... 1]
    .
    .
    .
    143.[143 143 143 ... 143]

而不是这样，因为tf.reshape()中返回shape，所以初始patches中的所有值都相同

**Embedding layer that I think is returned but I don't understand if it makes sense**
    0.[0 0 0 ... 0]
    1.[0 0 0 ... 0]
    .
    .
    .
    143.[0 0 0 ... 0]

我的问题是，传递(none,none,108)形式的tensor对于这个ViT实现有什么意义？
下面也是该模型的总结：

input_3 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 data_augmentation (Sequential)  (None, 72, 72, 3)   7           ['input_3[0][0]']                
                                                                                                  
 patches_2 (Patches)            (None, None, 108)    0           ['data_augmentation[1][0]']      
                                                                                                  
 patch_encoder_2 (PatchEncoder)  (None, 144, 64)     16192       ['patches_2[0][0]']

1条答案

按热度按时间

g2ieeal71#

在Vision Transformer model的实现中，每个面片首先经过PatchEncoder层，PatchEncoder层由投影层和embedding layer组成，投影层将108维面片表示Map到64-dimensional vector，而embedding layer向每个面片添加位置编码。位置编码是添加到面片表示以编码其在图像中的位置的矢量。
但是，需要注意的是，每个patch使用相同的64-dimensional vector，而每个patch的位置编码不同。这是因为投影层在所有面片之间共享，因此它为每个面片生成相同的64维矢量。另一方面，位置编码对于每个面片是唯一的。因为它取决于补片在图像中的位置。

赞(0）回复(0）举报 2023-03-18

了解Keras中的Vision Transformer实现：缀片形状和嵌入层问题

1条答案

相关问题

热门标签

最新问答