c++ 给定一个字节数组,我如何创建一个字节,其中每个位都包含来自输入的该位置的位的模式(平均)值?

zf2sa74q  于 2023-04-01  发布在  其他
关注(0)|答案(7)|浏览(125)

我试图找出快速和简单的方法来找到一个字节数组中的每个位的模式(“平均”)。
以下是我正在寻找的一个例子:

Byte 1  1010 1010
Byte 2  0101 0101
Byte n  1010 1000
Result  1010 1000

因此,如果位位置主要包含1,则答案中的位位置为1。如果位位置主要包含0,则答案为0。如果1和0的出现次数相等,则我不关心答案中的该位置的值。
对于我的用例来说,输入数量的数量级很小(大约10到20个输入),但是欢迎讨论您的方法的性能,因为它随输入数量而扩展。
我可以手动计算每个1和每个0,并以这种方式计算出来,但我希望有一种更优雅,可能更快的方法来完成它。

z3yyvxxp

z3yyvxxp1#

假设bytesvector<uint8_t>array<uint8_t>
保持一个名为counts的表,其中包含8个整数,所有整数都初始化为0。伪代码:

For each byte B in the input set of bytes
   for each bit b at index i in B
       if b is set:
          counts[i]++

然后使用counts构建最终结果:

vector<int> counts(8);
uint8_t result = 0;

for (uint8_t b : bytes)
{
    uint8_t mask = 0x80;
    size_t needed = (bytes.size() +1) / 2;

    for (size_t i = 0; i < 8; i++)
    {
        counts[i] += (mask & b) ? 1 : 0;
        if (counts[i] >= needed)
        {
            result = mask | result;
        }
        mask = mask >> 1;
    }
}

std::cout << "Result: " << result << "\n";
bz4sfanl

bz4sfanl2#

对于标准C++,已经有很多答案了。
有人在评论中说:“* 但我不认为有一个比特旋转的黑客可以让你在不检查单个比特的情况下做到这一点。*”,这给了我写这个的动力,它基于英特尔内部函数,而不是标准的C++。
您可以使用PDEP将位提取到U64中的字节中(如果您的输入中有超过256个元素,请相应地调整PDEP,以使用多个U64,并为每个位累加器“通道”增加位宽)。
Wikipedia link for Intel BMI,Intel intrinsic reference,related Stack Overflow question .
您可以将std::transform_reducestd::execution::par_unseq一起使用,其中转换是PDEP展开,reduce是求和操作(假设您基于不可能溢出的输入数量将每个通道的位宽设置得足够宽),然后在结束时,如果通道的值超过输入字节数的一半,然后将相应的输出位设置为1。可能有一种奇特的SIMD方法来完成这一部分,但这一步的性能影响是一个常数(我现在懒得找到这种方法)。

krugob8w

krugob8w3#

如果你有一个for循环,迭代0x 80,0x 40,0x 20,0x 10,0x 08,0x 04,0x 02,0x 01,我们可以把它作为一个掩码来检查数据中的每一位:

for (unsigned char mask = 0x80; mask; mask >>= 1 ) {
        if (data & mask) {
            // 1-case
        } else {
            // 0-case
        }
    }

我们可以通过使用条件来重构if语句,即

(data & mask) ? /* 1-case */ : /* 0-case */

当计算位数时,我们只需要计算足够多的位数,直到我们可以在移动到下一位之前确定特定位的大多数。如果数据的倾向变得明显,我们可以考虑提前退出实现。如果它是一个令人紧张的平均值,那么我们别无选择,只能迭代所有记录。
在下列情况下达到多数:

majority == n/2     , n even
majority == (n+1)/2 , n odd

在进行整数数学运算时,我们可以使用以下语句来涵盖这两种情况:

int majority = (n+1) >> 1;

当我们确定了一个1-case时,我们可以在结果中设置一个位:

result |= mask;

下面是一个C++的例子:

#include <stdio.h>
#include <bitset>
#include <iostream>

void print_data(unsigned char data) {
    std::cout << std::bitset<8>(data) << '\n';
}

void print_arr(unsigned char*data, int n) {
    for (int i = 0; i < n; i++, data++) {
        std::cout << "Byte " << i << " ";
        print_data(*data);
    }
}

unsigned char mean_arr(unsigned char*data, int n) {
    int majority = (n + 1) >> 1;
    unsigned char result = 0;
    for (unsigned char mask = 0x80; mask; mask >>= 1 ) {
        int count[2] = {0,0};
        unsigned char* ptr = data;
        for (int i = 0; i < n; i++, ptr++) {
            if (++count[*ptr & mask ? 1 : 0] < majority) continue;
            if (*ptr & mask) result |= mask;
            break;
        }
    }
    return result;
}

int main() {
    unsigned char data[3] = { 0xaa, 0x55, 0xa8 };
    print_arr(data, 3);
    unsigned char result = mean_arr(data, 3);
    std::cout << "Result ";
    print_data(result);
    return 0;
}

其输出:

Byte 0 10101010
Byte 1 01010101
Byte 2 10101000
Result 10101000
wfveoks0

wfveoks04#

C和C的解决方案(因为我后来才注意到C标签):

#include <stdio.h>
#include <stdint.h>

uint8_t bitvote(uint8_t *bytes, unsigned n)
{
    uint8_t bit = 1;
    uint8_t result = 0;
    // Iterate over each bit position
    for(unsigned i = 0; i < 8*sizeof(uint8_t); i++)
    {
        // For the current position, gather a "vote" count from each input
        // byte. A 0-bit counts as -1 and a 1-bit as +1.
        int vote = 0; 
        for(unsigned c = 0; c < n; c++)
        {
            vote += (bytes[c] & bit) ? 1 : -1;
        }
        // If the "vote" is larger than 0, there is a majority of 1-bits, so
        // set the corresponding bit in the result.
        // On equal count of 0's and 1's, the resulting bit is 0. To change
        // that to 1, use "vote >= 0".
        if(vote > 0) result |= bit;
        bit = bit << 1;
    }
    return result;
}

int main()
{
    uint8_t bytes[] = {0xaa, 0x55, 0xa8};  // Test vector of the OP
    uint8_t result = bitvote(bytes, sizeof(bytes)/sizeof(bytes[0]));
    printf("0x%02x", result);   // Prints 0xa8 matching the OP expected result

    return 0;
}

这种方法相当简单,但我认为不可能做出一些聪明的位操作动作。部分参数如下:
考虑两个比特,因此可能的组合是00,01,10,11。这给出了3种可能性:多数0、相等计数和多数1。因此,不可能将这些信息压缩到单个比特中。因此,如果我们逐个处理输入字节,我们就不能有大小仅为一个字节的中间状态。

6ojccjat

6ojccjat5#

从数学上讲,您可以通过乘法和一些位操作将8位字节的位“扩展”为64位无符号值。
然后,您可以通过添加64位数字来“并行”添加多达255个。
user提到的方法在phuclv的详细回答中显示(沿着使用intrinsic实现的一些示例):
https://stackoverflow.com/a/51750902/4944425
使用他们的公式,我们可以写如下:

// The following is valid only if the size of the span is less than 256,
// otherwise a temporary array must be used to accumulate the results
// every 255 elements.
auto bits_mode(std::span<std::uint8_t> src) -> std::uint8_t
{
  // Multiplying a byte by this 64-bit constant will "copy" it in 8
  // different positions with a gap of 1 bit between them.
  // 
  // E.g. given x = 10100100
  //                                                           10100100
  // * 1000000001000000001000000001000000001000000001000000001000000001
  // = 0010100100010100100010100100010100100010100100010100100010100100
  //   ^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^
  constexpr std::uint64_t mult {
    0b10000000'01000000'00100000'00010000'00001000'00000100'00000010'00000001
  };

  // The gaps between the copies is crucial, it let us retrieve the 
  // spread bits (with reversed endianness) with a simple mask:
  //
  // & 1000000010000000100000001000000010000000100000001000000010000000 
  // = 0000000000000000100000000000000000000000100000000000000010000000
  //   ^       ^       ^       ^       ^       ^       ^       ^
  constexpr std::uint64_t mask {
    0b10000000'10000000'10000000'10000000'10000000'10000000'10000000'10000000
  };

  // Shifting the intermediate gives us something we can safely sum up
  // to 255 times.
  //
  // >> 7
  // = 0000000000000000000000010000000000000000000000010000000000000001
  //          ^       ^       ^       ^       ^       ^       ^       ^
  constexpr auto expand = [] (std::uint8_t x) -> std::uint64_t {
    return ((x * mult) & mask) >> 7;
  };

  auto result{ std::transform_reduce( src.begin(), src.end()
                                    , std::uint64_t{}
                                    , std::plus<>{}, expand ) };

  // the resulting byte can be formed a bit at a time "traversing"
  // the bytes of the previous sum.
  std::uint64_t target{ src.size() / 2 };
  std::uint8_t mode{};
  for ( int i{8}; i--; )
  {
      mode |= ((result & 0xff) > target) << i;
      result >>= 8;
  }

  return mode;
}

地址:https://godbolt.org/z/K1MrYcnvK

9o685dep

9o685dep6#

有一种有点繁琐的方法可以在std::uint8_tstd::uint64_t上工作。(几乎)没有分支并且是原位的。然而,有一个警告,我在下面讨论。

创意

如果你有一系列的位,你要对它们进行排序,你可以查看中间值并得到平均值。事实上,你可以得到任何你想要的分位数。这是可行的,因为位是二进制的。
例如:01001010001 -〉00000001111 -〉中间位为0 -〉平均值为0。

分类

如果你有两个数字ab,你可以通过实现将所有1“向下”移动和将所有0“向上”移动的结果来按列排序位:

// pseudo-swap:
a2 = a&b;
b2 = a|b;

这保证了保留1和0的总数,并且它以位并行方式工作。* 并且是无分支的。*

问题

将其应用于n个值而不分支并不像人们想象的那么简单,简单的解决方案是将每个值彼此伪交换,这在纸面上具有二次复杂度,但也是无分支的(除了可以很容易地进行分支预测或展开的循环),并且除了现有的数组之外不需要额外的内存。我很确定有一种更好的方法来做到这一点,但我没想到。也许有人能进一步发展这个想法。

算法

该算法将其变为:

11010000
00100010
11100111
11010101
00100000
11111000
11101001

变成这样:

00000000
00000000
11100000
11100000 <--- your average
11110001
11111111
11111111

示例实现

void pseudoswap(std::uint8_t& a, std::uint8_t& b) {
  auto const fst = a & b;
  auto const snd = a | b;
  a = fst;
  b = snd;
}

template <std::size_t N>
void sieve(std::uint8_t (&xs)[N]) {
  for (auto i = 0; i < N; ++i) {
    for (auto j = i+1; j < N; ++j) {
      pseudoswap(xs[i],xs[j]);
    }
  }
}

看一下working demo on compiler explorer,它展示了对主要实现的一些很好的优化:

.L88:
        mov     rsi, r8
        add     r8, 1
        mov     rax, rsi
.L90:
        movzx   edx, BYTE PTR [rsi-1]
        movzx   ecx, BYTE PTR [rax]
        add     rax, 1
        mov     edi, edx
        or      edx, ecx
        and     edi, ecx
        mov     BYTE PTR [rsi-1], dil
        mov     BYTE PTR [rax-1], dl
        cmp     rbx, rax
        jne     .L90
        cmp     rbx, r8
        jne     .L88

对于小的数组大小,sieve循环似乎是展开的,在热路径中没有任何跳跃。

讨论

这听起来很糟糕,这是二次的。但是你说你的输入数组非常小,在10和20之间。这意味着复杂度bn和nn(b =一个字中的位数)是无法区分的,因为b ~ n。这样做的好处是:

  • 无数据相关分支
  • 无附加内存(无分配)
  • 热路径中无整数除法
  • 可以有任何分位数(不只是平均值)

如果你达到了拥有数百万或数十亿字节的地步,那么就切换到一个计数1 s的解决方案,因为这样的话,二次复杂度就会成为一个问题。

*如有疑问 * 基准

nzkunb0c

nzkunb0c7#

这是一个部分的答案,因为我的方法不能处理任意数量的输入字节。它可以处理任何数量的输入,有效的排序网络是已知的。我在下面的代码中展示了长度为2到16字节的数组的例子。
Knuth,TAOCP,第4A卷,第7.1.1节详细解释了三元多数运算的实用性(x ∧ y)∨(y ∧ z)∨(x ∧ z)。此函数按位应用于三个输入时,将实现asker请求的结果。Knuth更喜欢将该函数称为“中位数”而不是“多数”因为如果让∧对应于min,∨对应于max,那么当x ≤ y ≤ z时,恰好〈xyz〉= y。
这里有趣的观察是,构建排序网络的一种方法是从minmax原语。三个median3 (x,y,z) = min (max (min (y, z), x), max (y,z))的中位数,而min3 (x,y,z) = min (x, min (y, z))max3 (x,y,z) = max (x, max (y, z))
克努特指出,任何单调布尔函数都可以只用中值运算和常数0和1来表示,因此,五的中值可以用这种方式表示,根据克努特(第64页),最有效的排列是:∠ vwxyz ∠ = ∠ v ∠ xyz ∠ ∠ wx ∠ wyz ∠。
为了测试排序网络的位中值计算的可导出性,我使用了文献中的一个9输入网络,并将其转换为9位中值计算,它提供了asker所要求的结果。我还将以前工作中的一些搜索网络图转换为相应的mix / max操作序列。以及从TAOCP vol.3翻译的其他网络图。对于其他排序网络,我参考了Bert Dobbelaere's list,如代码注解中所述。Wikipedia article on sorting networks建议(接近)最佳排序网络已知多达20个输入,因此覆盖了提问者感兴趣的数组尺子范围。
至于效率,compiled with Clang 16下面代码中的byte_mode_16()编译为大约170条x86指令,具有大量指令级并行性,因此我 * 猜测 * 在现代x86-64 CPU上执行大约需要50个周期。在NVIDIA GPU上,LOP3指令支持任意三输入逻辑运算,相同的函数编译为大约80条指令。

#include <cstdio>
#include <cstdlib>
#include <climits>
#include <cstdint>

#define REPS      (1000000)
#define N         (16)
#define USE_KNUTH (0)

uint8_t byte_mode_ref (const uint8_t *bytes, int n)
{
    int count1 [CHAR_BIT] = {0, 0, 0, 0, 0, 0, 0, 0};
    uint8_t res = 0;

    for (int i = 0; i < n; i++) {
        uint8_t a = bytes [i];
        for (int j = 0; j < CHAR_BIT; j++) {
            uint8_t bit = (a >> j) & 1;
            count1 [j] += bit;
        }
    }
    for (int j = 0; j < CHAR_BIT; j++) {
        res |= (count1[j] > (n / 2)) << j;
    }
    return res;
}

#define MIN3(x,y,z)     (x & y & z)
#define MEDIAN3(x,y,z)  (((y & z) | x) & (y | z))
#define MAX3(x,y,z)     (x | y | z)

#define MIN2(x,y) (x & y)
#define MAX2(x,y) (x | y)
#define SWAP(i,j) do { int s = MIN2(a##i,a##j); int t = MAX2(a##i,a##j); a##i = s; a##j = t; } while (0)

uint8_t byte_mode_2 (const uint8_t *bytes)
{
    int x = bytes [0];
    int y = bytes [1];

    return MIN2 (x, y);
}

uint8_t byte_mode_3 (const uint8_t *bytes)
{
    int x = bytes [0];
    int y = bytes [1];
    int z = bytes [2];

    return MEDIAN3 (x, y, z);
}

uint8_t byte_mode_4 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];

    SWAP (0, 2); SWAP (1, 3);
    SWAP (0, 1); SWAP (2, 3);
    SWAP (1, 2);

    return a1;
}

uint8_t byte_mode_5 (const uint8_t *bytes)
{
#if USE_KNUTH
    int v = bytes [0];
    int w = bytes [1];
    int x = bytes [2];
    int y = bytes [3];
    int z = bytes [4];

    /* Knuth, TAOCP, Vol. 4a, p. 64 */
    return MEDIAN3 (v, MEDIAN3 (x, y, z), MEDIAN3 (w, x, (MEDIAN3 (w, y, z))));
#else // USE_KNUTH
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];

    SWAP (0, 3); SWAP (1, 4);
    SWAP (0, 2); SWAP (1, 3);
    SWAP (0, 1); SWAP (2, 4);
    SWAP (1, 2); SWAP (3, 4);
    SWAP (2, 3);
    
    return a2;
#endif // USE_KNUTH
}

uint8_t byte_mode_6 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
  
    SWAP (0, 5); SWAP (1, 3); SWAP (2,4);
    SWAP (1, 2); SWAP (3, 4); 
    SWAP (0, 3); SWAP (2, 5);
    SWAP (0, 1); SWAP (2, 3); SWAP (4, 5);
    SWAP (1, 2); SWAP (3, 4);

    return a2;
}

uint8_t byte_mode_7 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
    int a6 = bytes [6];
  
    SWAP (0, 6); SWAP (2, 3); SWAP (4, 5);
    SWAP (0, 2); SWAP (1, 4); SWAP (3, 6);
    SWAP (0, 1); SWAP (2, 5); SWAP (3, 4);
    SWAP (1, 2); SWAP (4, 6);
    SWAP (2, 3); SWAP (4, 5);
    SWAP (1, 2); SWAP (3, 4); SWAP (5, 6);

    return a3;
}

uint8_t byte_mode_8 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
    int a6 = bytes [6];
    int a7 = bytes [7];

    SWAP (0, 2); SWAP (1, 3); SWAP (4, 6); SWAP (5, 7);
    SWAP (0, 4); SWAP (1, 5); SWAP (2, 6); SWAP (3, 7);
    SWAP (0, 1); SWAP (2, 3); SWAP (4, 5); SWAP (6, 7);
    SWAP (2, 4); SWAP (3, 5);
    SWAP (1, 4); SWAP (3, 6);
    SWAP (1, 2); SWAP (3, 4); SWAP (5, 6);
    
    return a3;
}

uint8_t byte_mode_9 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
    int a6 = bytes [6];
    int a7 = bytes [7];
    int a8 = bytes [8];
    int b0, b1, b2, b3, b4, b5, b6, b7, b8;
    int c2, c4, c6;

    /*
      Chaitali Chakrabarti and Li-Yu Wang, "Novel sorting network-based 
      architectures for rank order filters." IEEE Transactions on Very
      Large Scale Integration Systems, Vol. 2, No. 4, 1994, pp. 502-507
    */
    // SORT3 (0, 1, 2)
    b0 = MIN3    (a0, a1, a2);
    b1 = MEDIAN3 (a0, a1, a2);
    b2 = MAX3    (a0, a1, a2);
    // SORT3 (3, 4, 5)
    b3 = MIN3    (a3, a4, a5);
    b4 = MEDIAN3 (a3, a4, a5);
    b5 = MAX3    (a3, a4, a5);
    // SORT3 (6, 7, 8)
    b6 = MIN3    (a6, a7, a8);
    b7 = MEDIAN3 (a6, a7, a8);
    b8 = MAX3    (a6, a7, a8);
    // SORT3 (0, 3, 6)
    // c0 = MIN3    (b0, b3, b6);
    // c3 = MEDIAN3 (b0, b3, b6);
    c6 = MAX3    (b0, b3, b6);
    // SORT3 (1, 4, 7)
    // c1 = MIN3    (b1, b4, b7);
    c4 = MEDIAN3 (b1, b4, b7);
    // c7 = MAX3    (b1, b4, b7);
    // SORT3 (2, 5, 8)
    c2 = MIN3    (b2, b5, b8);
    // c5 = MEDIAN3 (b2, b5, b8);
    // c8 = MAX3    (b2, b5, b8);
    // final median computation
    // SORT3 (2, 4, 6)
    return MEDIAN3 (c2, c4, c6);
}

uint8_t byte_mode_10 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];

#if USE_KNUTH
    /* Knuth, TAOCP, vol. 3, p. 227 */
    SWAP (4, 9); SWAP (3, 8); SWAP (2, 7); SWAP (1, 6); SWAP (0, 5);
    SWAP (1, 4); SWAP (6, 9); SWAP (0, 3); SWAP (5, 8);
    SWAP (0, 2); SWAP (3, 6); SWAP (7, 9);
    SWAP (0, 1); SWAP (2, 4); SWAP (5, 7); SWAP (8, 9);
    SWAP (4, 6); SWAP (1, 2); SWAP (7, 8); SWAP (3, 5);
    SWAP (2, 5); SWAP (6, 8); SWAP (1, 3); SWAP (4, 7);
    SWAP (2, 3); SWAP (6, 7);
    SWAP (3, 4); SWAP (5, 6);
    SWAP (4, 5);
#else // USE_KNUTH
    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0, 8); SWAP (1, 9); SWAP (2, 7); SWAP (3, 5); SWAP (4, 6);
    SWAP (0, 2); SWAP (1, 4); SWAP (5, 8); SWAP (7, 9);
    SWAP (0, 3); SWAP (2, 4); SWAP (5, 7); SWAP (6, 9);
    SWAP (0, 1); SWAP (3, 6); SWAP (8, 9);
    SWAP (1, 5); SWAP (2, 3); SWAP (4, 8); SWAP (6, 7);
    SWAP (1, 2); SWAP (3, 5); SWAP (4, 6); SWAP (7, 8);
    SWAP (2, 3); SWAP (4, 5); SWAP (6, 7);
    SWAP (3, 4); SWAP (5, 6);
#endif // USE_KNUTH

    return a4;
}

uint8_t byte_mode_11 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];

    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0, 9); SWAP (1, 6); SWAP (2,  4); SWAP (3,  7); SWAP (5,  8);
    SWAP (0, 1); SWAP (3, 5); SWAP (4, 10); SWAP (6,  9); SWAP (7,  8);
    SWAP (1, 3); SWAP (2, 5); SWAP (4,  7); SWAP (8, 10);
    SWAP (0, 4); SWAP (1, 2); SWAP (3,  7); SWAP (5,  9); SWAP (6,  8);
    SWAP (0, 1); SWAP (2, 6); SWAP (4,  5); SWAP (7,  8); SWAP (9, 10);
    SWAP (2, 4); SWAP (3, 6); SWAP (5,  7); SWAP (8,  9);
    SWAP (1, 2); SWAP (3, 4); SWAP (5,  6); SWAP (7,  8);
    SWAP (2, 3); SWAP (4, 5); SWAP (6,  7);

    return a5;
}

uint8_t byte_mode_12 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];

#if USE_KNUTH
    /* Knuth, TAOCP, vol. 3, p. 227 */
    SWAP (0, 1); SWAP (2,  3); SWAP (4,  5); SWAP (6,  7); SWAP (8,  9); SWAP (10, 11);
    SWAP (1, 3); SWAP (5,  7); SWAP (9, 11); SWAP (0,  2); SWAP (4,  6); SWAP ( 8, 10);
    SWAP (1, 2); SWAP (5,  6); SWAP (9, 10);
    SWAP (1, 5); SWAP (6, 10); 
    SWAP (2, 6); SWAP (5,  9);  
    SWAP (1, 5); SWAP (6, 10);
    SWAP (0, 4); SWAP (7, 11);
    SWAP (3, 7); SWAP (4,  8);
    SWAP (0, 4); SWAP (7, 11);
    SWAP (1, 4); SWAP (7, 10); SWAP (3,  8);
    SWAP (2, 3); SWAP (8,  9);
    SWAP (2, 4); SWAP (7,  9); SWAP (3,  5); SWAP (6,  8);
    SWAP (3, 4); SWAP (5,  6); SWAP (7,  8);
#else // USE_KNUTH
    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0, 8); SWAP (1,  7); SWAP (2,  6); SWAP (3, 11); SWAP (4, 10); SWAP (5,   9);
    SWAP (0, 1); SWAP (2,  5); SWAP (3,  4); SWAP (6,  9); SWAP (7,  8); SWAP (10, 11);
    SWAP (0, 2); SWAP (1,  6); SWAP (5, 10); SWAP (9, 11);
    SWAP (0, 3); SWAP (1,  2); SWAP (4,  6); SWAP (5,  7); SWAP (8, 11); SWAP (9,  10);
    SWAP (1, 4); SWAP (3,  5); SWAP (6,  8); SWAP (7, 10);
    SWAP (1, 3); SWAP (2,  5); SWAP (6,  9); SWAP (8, 10);
    SWAP (2, 3); SWAP (4,  5); SWAP (6,  7); SWAP (8,  9);
    SWAP (4, 6); SWAP (5,  7);
    SWAP (3, 4); SWAP (5,  6); SWAP (7,  8);
#endif // USE_KNUTH

    return a5;
}

uint8_t byte_mode_13 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];

    /* Knuth, TAOCP, vol. 3, p. 227 */
    SWAP (0,  5); SWAP ( 8, 12); SWAP (1, 7); SWAP ( 3,  9); SWAP ( 2,  4); SWAP (6, 11);
    SWAP (0,  6); SWAP ( 7,  9); SWAP (1, 3); SWAP ( 5, 11); SWAP ( 2,  8); SWAP (4, 12);
    SWAP (0,  2); SWAP ( 4,  5); SWAP (6, 8); SWAP ( 9, 10); SWAP (11, 12); 
    SWAP (7,  8); SWAP (10, 12); SWAP (5, 9); SWAP ( 3, 11);
    SWAP (1,  5); SWAP ( 9, 11); SWAP (2, 3); SWAP ( 4,  7); SWAP ( 8, 10);
    SWAP (0,  1); SWAP ( 5,  6); SWAP (8, 9); SWAP (10, 11);
    SWAP (3,  6); SWAP ( 2,  5); SWAP (1, 4); SWAP ( 7,  8); SWAP ( 9, 10);
    SWAP (3,  7); SWAP ( 1,  2); SWAP (4, 5); SWAP ( 6,  8);
    SWAP (2,  4); SWAP ( 6,  7); SWAP (3, 5); SWAP ( 8,  9);
    SWAP (3,  4); SWAP ( 5,  6);

    return a6;
}

uint8_t byte_mode_14 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];
    int a13 = bytes [13];

    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0,  1); SWAP (2, 3); SWAP (4,  5); SWAP (6,  7); SWAP ( 8,  9); SWAP (10, 11);  SWAP (12, 13);
    SWAP (0,  2); SWAP (1, 3); SWAP (4,  8); SWAP (5,  9); SWAP (10, 12); SWAP (11, 13);
    SWAP (0, 10); SWAP (1, 6); SWAP (2, 11); SWAP (3, 13); SWAP ( 5,  8); SWAP ( 7, 12);
    SWAP (1,  4); SWAP (2, 8); SWAP (3,  6); SWAP (5, 11); SWAP ( 7, 10); SWAP ( 9, 12);
    SWAP (0,  1); SWAP (3, 9); SWAP (4, 10); SWAP (5,  7); SWAP ( 6,  8); SWAP (12, 13);
    SWAP (1,  5); SWAP (2, 4); SWAP (3,  7); SWAP (6, 10); SWAP ( 8, 12); SWAP ( 9, 11);
    SWAP (1,  2); SWAP (3, 5); SWAP (4,  6); SWAP (7,  9); SWAP ( 8, 10); SWAP (11, 12);
    SWAP (2,  3); SWAP (4, 5); SWAP (6,  7); SWAP (8,  9); SWAP (10, 11);
    SWAP (3,  4); SWAP (5, 6); SWAP (7,  8); SWAP (9, 10);

    return a6;
}

uint8_t byte_mode_15 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];
    int a13 = bytes [13];
    int a14 = bytes [14];

    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0,  6); SWAP (1, 10); SWAP (2,14); SWAP (3,  9); SWAP ( 4, 12); SWAP ( 5, 13); SWAP ( 7, 11);
    SWAP (0,  7); SWAP (2,  5); SWAP (3, 4); SWAP (6, 11); SWAP ( 8, 10); SWAP ( 9, 12); SWAP (13, 14);
    SWAP (1, 13); SWAP (2,  3); SWAP (4, 6); SWAP (5,  9); SWAP ( 7,  8); SWAP (10, 14); SWAP (11, 12);
    SWAP (0,  3); SWAP (1,  4); SWAP (5, 7); SWAP (6, 13); SWAP ( 8,  9); SWAP (10, 11); SWAP (12, 14);
    SWAP (0,  2); SWAP (1,  5); SWAP (3, 8); SWAP (4,  6); SWAP ( 7, 10); SWAP ( 9, 11); SWAP (12, 13);
    SWAP (0,  1); SWAP (2,  5); SWAP (3,10); SWAP (4,  8); SWAP ( 6,  7); SWAP ( 9, 12); SWAP (11, 13);
    SWAP (1,  2); SWAP (3,  4); SWAP (5, 6); SWAP (7,  9); SWAP ( 8, 10); SWAP (11, 12);
    SWAP (3,  5); SWAP (4,  6); SWAP (7, 8); SWAP (9, 10);
    SWAP (2,  3); SWAP (4,  5); SWAP (6, 7); SWAP (8,  9); SWAP (10, 11);

    return a7;
}

uint8_t byte_mode_16 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];
    int a13 = bytes [13];
    int a14 = bytes [14];
    int a15 = bytes [15];

    /* Knuth TAOCP, vol. 3, p. 229 */
    SWAP (0,  1); SWAP ( 2,  3); SWAP ( 4,  5); SWAP ( 6,  7); SWAP ( 8,  9); SWAP (10, 11); SWAP (12, 13); SWAP (14, 15);
    SWAP (1,  3); SWAP ( 5,  7); SWAP ( 9, 11); SWAP (13, 15); SWAP ( 0,  2); SWAP ( 4,  6); SWAP ( 8, 10); SWAP (12, 14);
    SWAP (3,  7); SWAP (11, 15); SWAP ( 2,  6); SWAP (10, 14); SWAP ( 1,  5); SWAP ( 9, 13); SWAP ( 0,  4); SWAP ( 8, 12);
    SWAP (7, 15); SWAP ( 6, 14); SWAP ( 5, 13); SWAP ( 4, 12); SWAP ( 3, 11); SWAP ( 2, 10); SWAP ( 1,  9); SWAP ( 0,  8);
    SWAP (1,  2); SWAP ( 3, 12); SWAP (13, 14); SWAP ( 7, 11); SWAP ( 4,  8); SWAP ( 5, 10); SWAP ( 6,  9);
    SWAP (1,  4); SWAP (11, 14); SWAP ( 7, 13); SWAP ( 2,  8); SWAP ( 6, 12); SWAP ( 3, 10); SWAP ( 5,  9);
    SWAP (3,  5); SWAP ( 7,  9); SWAP (11, 13); SWAP ( 2,  4); SWAP ( 6,  8); SWAP (10, 12); 
    SWAP (5,  8); SWAP ( 9, 12); SWAP ( 3,  6); SWAP ( 7, 10); 
    SWAP (3,  4); SWAP ( 5,  6); SWAP ( 7,  8); SWAP ( 9, 10); SWAP (11, 12);

    return a7;
}

// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC  ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
              kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{    
    uint8_t res, ref, arr [N];
    printf ("Testing byte arrays of length %d\n", N);
    for (int i = 0; i < REPS; i++) {
        for (int j = 0; j < N; j++) {
            arr [j] = KISS;
        }
        ref = byte_mode_ref (arr, N);
#if N==2
        res = byte_mode_2 (arr);
#elif N==3
        res = byte_mode_3 (arr);
#elif N==4
        res = byte_mode_4 (arr);
#elif N==5
        res = byte_mode_5 (arr);
#elif N==6
        res = byte_mode_6 (arr);
#elif N==7
        res = byte_mode_7 (arr);
#elif N==8
        res = byte_mode_8 (arr);
#elif N==9
        res = byte_mode_9 (arr);
#elif N==10
        res = byte_mode_10 (arr);
#elif N==11
        res = byte_mode_11 (arr);
#elif N==12
        res = byte_mode_12 (arr);
#elif N==13
        res = byte_mode_13 (arr);
#elif N==14
        res = byte_mode_14 (arr);
#elif N==15
        res = byte_mode_15 (arr);
#elif N==16
        res = byte_mode_16 (arr);
#else
#error unsupported value of N
#endif
        if (res != ref) {
            printf ("Error: res=%02x ref=%02x bytes: ", res, ref);
            for (int j = 0; j < N; j++) {
                printf ("%02x ", arr[j]);
            }
            printf ("\nTest FAILED\n");
            return EXIT_FAILURE;
        }
    }
    printf ("Test PASSED\n");
    return EXIT_SUCCESS;
}

相关问题