assembly YMM寄存器之间的逻辑移位

uplii1fm 于 2023-11-19 发布在其他

关注(0)|答案(2)|浏览(144)

我是否可以将一个2048位的数字加载到8个AVX ymm寄存器中，并在所有这些寄存器之间左移和右移位？
我每次只需要移动一点。
我试着在AVX上找到准确的信息，但是xmm/ymm/zmm和进位之间的相互作用似乎很多时候都不清楚。

assembly

来源：https://stackoverflow.com/questions/77391577/logical-shift-between-ymm-registers

2条答案

按热度按时间

i5desfxk1#

我试着在AVX上找到准确的信息，但是xmm/ymm/zmm和进位之间的相互作用似乎很多时候都不清楚。
这是简单的部分：没有交互。SSE/AVX算法不涉及标志。有一些特定的指令可以比较/测试向量（ptest）或向量中的标量（comiss等），然后设置标志，但它们在这里没有那么有用。
一种方法是从你的数字的顶部而不是底部开始，加载两个稍微偏移（大部分是重叠的，因此其中一个向量与另一个向量相比偏移了一个元素）向量，并使用“连接和移位”指令之一。（例如vpshld）进行左移，从前一个元素移位位（通常它不是来自前一个元素，而是来自另一个向量，但这就是为什么我们在一个元素偏移量处加载第二个向量）而不是零。在AVX 2中，您可以使用左移，右移和vpor来模拟这一点。

赞(0）回复(0）举报 2023-11-19

y3bcpkx12#

有可能，但不简单。
下面是C++中的AVX 2实现，它在每个寄存器中执行5条指令。

#include <immintrin.h>

// Shift AVX vector left by 1 bit
// The flag should contain either 0 or 1 in the lowest int32 lane, higher 96 bits are unused
inline __m256i shiftLeft1( const __m256i src, __m128i& carryFlag )
{
    // Shift 64 bit lanes right by 63 bits, i.e. isolate the high bit into low location
    __m256i right = _mm256_srli_epi64( src, 63 );
    // Cyclic permute across the complete vector
    right = _mm256_permute4x64_epi64( right, _MM_SHUFFLE( 2, 1, 0, 3 ) );

    // Deal with the carry flags
    const __m128i nextFlag = _mm256_castsi256_si128( right );
    right = _mm256_blend_epi32( right, _mm256_castsi128_si256( carryFlag ), 1 );
    carryFlag = nextFlag;

    // Shift 64 bit lanes left by 1 bit
    __m256i left = _mm256_slli_epi64( src, 1 );
    // Assemble the result
    return _mm256_or_si256( left, right );
}

// Shift AVX vector right by 1 bit
// The flag should contain either 0 or 0x80000000 in the highest int32 lane, lower 224 bits are unused
inline __m256i shiftRight1( const __m256i src, __m256i& carryFlag )
{
    // Shift 64 bit lanes left by 63 bits, i.e. isolate low bits into high location
    __m256i left = _mm256_slli_epi64( src, 63 );
    // Cyclic permute across the complete vector
    left = _mm256_permute4x64_epi64( left, _MM_SHUFFLE( 0, 3, 2, 1 ) );

    // Deal with the carry flags
    const __m256i nextFlag = left;
    left = _mm256_blend_epi32( left, carryFlag, 0b10000000 );
    carryFlag = nextFlag;

    // Shift 64 bit lanes right by 1 bit
    __m256i right = _mm256_srli_epi64( src, 1 );
    // Assemble the result
    return _mm256_or_si256( left, right );
}

字符串
这5条指令中的大多数都非常快，只有1个周期的延迟，除了vpermq，它在大多数处理器上需要3-6个周期。幸运的是，vpermq指令不依赖于进位标志，它只依赖于输入向量。现代乱序处理器应该能够很好地运行这些代码。
4个向量中1024位数字的用法示例：

// 1024 bits of data in 4 AVX registers
struct Blob1k
{
    __m256i v0, v1, v2, v3;
};

void shiftLeft1( Blob1k& blob )
{
    __m128i cf = _mm_setzero_si128();
    blob.v0 = shiftLeft1( blob.v0, cf );
    blob.v1 = shiftLeft1( blob.v1, cf );
    blob.v2 = shiftLeft1( blob.v2, cf );
    blob.v3 = shiftLeft1( blob.v3, cf );
}

void shiftRight1( Blob1k& blob )
{
    __m256i cf = _mm256_setzero_si256();
    blob.v3 = shiftRight1( blob.v3, cf );
    blob.v2 = shiftRight1( blob.v2, cf );
    blob.v1 = shiftRight1( blob.v1, cf );
    blob.v0 = shiftRight1( blob.v0, cf );
}

型

赞(0）回复(0）举报 2023-11-19

我来回答

assembly YMM寄存器之间的逻辑移位

2条答案

相关问题

热门标签

最新问答