c++ 64位整数乘法的高部分求法

3duebb1j 于 2022-12-20 发布在其他

关注(0)|答案(5)|浏览(250)

在C++中，可以这样说：

uint64_t i;
uint64_t j;

那么i * j将产生一个uint64_t，它的值是i和j之间乘法的较低部分，即(i * j) mod 2^64。现在，如果我想要乘法的较高部分呢？我知道存在一个汇编指令，可以在使用32位整数时执行类似的操作，但我对汇编一点也不熟悉，所以我希望能得到帮助。
什么是最有效的方法来制作像这样的东西：

uint64_t k = mulhi(i, j);

c++

来源：https://stackoverflow.com/questions/28868367/getting-the-high-part-of-64-bit-integer-multiplication

5条答案

按热度按时间

bq9c1y661#

以下是ARMv8或Aarch64版本的asm：

// High (p1) and low (p0) product
uint64_t p0, p1;
// multiplicand and multiplier
uint64_t a = ..., b = ...;

p0 = a*b; asm ("umulh %0,%1,%2" : "=r"(p1) : "r"(a), "r"(b));

下面是旧DEC编译器的asm：

p0 = a*b; p1 = asm("umulh %a0, %a1, %v0", a, b);

如果您有x86的BMI2，并希望使用mulxq：

asm ("mulxq %3, %0, %1" : "=r"(p0), "=r"(p1) : "d"(a), "r"(b));

通用x86乘法使用mulq：

asm ("mulq %3" : "=a"(p0), "=d"(p1) : "a"(a), "g"(b) : "cc");

赞(0）回复(0）举报 2022-12-20

gajydyqb2#

如果你使用的是gcc，并且你的版本支持128位数（尝试使用__uint128_t），那么执行128乘法并提取高64位可能是获得结果的最有效方法。
如果你的编译器不支持128位数，那么Yakk的答案是正确的。然而，对于一般的消费者来说，它可能太简短了。特别是，实际的实现必须小心溢出64位整数。
他提出的简单而便携的解决方案是将a和b分别分解为2个32位数，然后使用64位乘法运算将这些32位数相乘。

uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;

那么很明显：

a = (a_hi << 32) + a_lo;
b = (b_hi << 32) + b_lo;

以及：

a * b = ((a_hi << 32) + a_lo) * ((b_hi << 32) + b_lo)
      = ((a_hi * b_hi) << 64) +
        ((a_hi * b_lo) << 32) +
        ((b_hi * a_lo) << 32) +
          a_lo * b_lo

假设使用128位（或更大）算术来执行计算。
但是这个问题要求我们使用64位算术来执行所有的计算，所以我们不得不担心溢出。
由于a_hi、a_lo、b_hi和b_lo都是无符号的32位数，因此它们的乘积将适合无符号的64位数而不会溢出。
下面的代码将在数学运算必须以2^64为模时实现mulhi（a，b）：

uint64_t    a_lo = (uint32_t)a;
uint64_t    a_hi = a >> 32;
uint64_t    b_lo = (uint32_t)b;
uint64_t    b_hi = b >> 32;

uint64_t    a_x_b_hi =  a_hi * b_hi;
uint64_t    a_x_b_mid = a_hi * b_lo;
uint64_t    b_x_a_mid = b_hi * a_lo;
uint64_t    a_x_b_lo =  a_lo * b_lo;

uint64_t    carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
                         (uint64_t)(uint32_t)b_x_a_mid +
                         (a_x_b_lo >> 32) ) >> 32;

uint64_t    multhi = a_x_b_hi +
                     (a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
                     carry_bit;

return multhi;

正如Yakk所指出的，如果不介意高64位相差+1，可以省略进位位的计算。

赞(0）回复(0）举报 2022-12-20

2jcobegt3#

**64位伊萨的TL：DR和GCC：(a * (unsigned __int128)b) >> 64可以很好地编译为一条全乘或高半乘指令。**无需再使用内联asm。
不幸的是，目前的编译器 * 没有 * 优化@craigster0的可移植版本，所以如果你想利用64位CPU，你不能使用它，除非作为你没有#ifdef的目标的后备。你需要一个128位的类型或者一个内在的。）

GNU C（gcc、clang或ICC）在大多数64位平台上支持has unsigned __int128（或者在旧版本中支持__uint128_t）。不过GCC在32位平台上不支持这种类型。
这是让编译器发出64位全乘指令并保留高半部分的一种简单而有效的方法（GCC知道，一个转换为128位整数的uint64_t指令的上半部分仍然是全零，所以你不会使用三个64位乘法来得到一个128位乘法）。
MSVC also has a __umulh intrinsic用于64位高半乘法，但同样，它仅在64位平台（特别是x86-64和AArch 64）上可用。文档还提到IPF（IA-64）有_umul128可用，但我没有用于安腾的MSVC可用。

#define HAVE_FAST_mul64 1

#ifdef __SIZEOF_INT128__     // GNU C
 static inline
 uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int128 prod =  a * (unsigned __int128)b;
     return prod >> 64;
 }

#elif defined(_M_X64) || defined(_M_ARM64)     // MSVC
   // MSVC for x86-64 or AArch64
   // possibly also  || defined(_M_IA64) || defined(_WIN64)
   // but the docs only guarantee x86-64!  Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux

  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umulh
  #include <intrin.h>
  #define mulhi64 __umulh

#elif defined(_M_IA64) // || defined(_M_ARM)       // MSVC again
  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umul128
  // incorrectly say that _umul128 is available for ARM
  // which would be weird because there's no single insn on AArch32
  #include <intrin.h>
  static inline
  uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int64 HighProduct;
     (void)_umul128(a, b, &HighProduct);
     return HighProduct;
  }

#else

# undef HAVE_FAST_mul64
  uint64_t mulhi64(uint64_t a, uint64_t b);  // non-inline prototype
  // or you might want to define @craigster0's version here so it can inline.
#endif

对于x86-64、AArch 64和PowerPC 64（以及其他），这将编译为一条mul指令和几条mov指令来处理调用约定（在此内联之后应该会优化掉）。从Godbolt编译器资源管理器（对于x86-64、PowerPC 64和AArch 64，使用source + asm）：

# x86-64 gcc7.3.  clang and ICC are the same.  (x86-64 System V calling convention)
     # MSVC makes basically the same function, but with different regs for x64 __fastcall
    mov     rax, rsi
    mul     rdi              # RDX:RAX = RAX * RDI
    mov     rax, rdx
    ret

(or使用clang -march=haswell启用BMI 2：mov rdx, rsi/mulx rax, rcx, rdi将高半部分直接放入RAX中。gcc是哑的，仍然使用额外的mov。）
对于AArch 64（带有gcc unsigned __int128或带有__umulh的MSVC）：

test_var:
    umulh   x0, x0, x1
    ret

使用编译时常数2的乘方乘法器，我们通常会得到预期的右移来获取一些高位，但是gcc有趣地使用了shld（参见Godbolt链接）。

不幸的是，当前的编译器 * 没有 * 优化@craigster0的便携版本。您将获得8倍shr r64,32、4倍imul r64,r64、以及一串用于x86-64的add/mov指令，即，它编译成大量32 × 32 =〉64位乘法并解包结果。因此，如果您想要利用64位CPU的东西，您需要一些#ifdef。

全乘mul 64指令在Intel CPU上为2个uop，但仍然只有3个周期的延迟，与imul r64,r64相同，imul r64,r64只产生64位结果。因此，__int128/ intrinsic版本在延迟和吞吐量方面要便宜5到10倍（对周围代码的影响）在现代x86-64上比便携式版本，从基于http://agner.org/optimize/的快速眼球猜测。
在Godbolt编译器资源管理器的上面链接中查看它。
gcc在乘以16时完全优化了该函数，但是：得到一个右移，比unsigned __int128乘法更有效。

赞(0）回复(0）举报 2022-12-20

g6ll5ycj4#

这是我今晚提出的一个单元测试版本，它提供了完整的128位产品。通过检查，它似乎比大多数其他在线解决方案（例如Botan库和其他答案）更简单，因为它利用了代码注解中解释的中间部分不会溢出的优点。
我为这个github项目写了这段代码：https://github.com/catid/fp61

//------------------------------------------------------------------------------
// Portability Macros

// Compiler-specific force inline keyword
#ifdef _MSC_VER
# define FP61_FORCE_INLINE inline __forceinline
#else
# define FP61_FORCE_INLINE inline __attribute__((always_inline))
#endif

//------------------------------------------------------------------------------
// Portable 64x64->128 Multiply
// CAT_MUL128: r{hi,lo} = x * y

// Returns low part of product, and high part is set in r_hi
FP61_FORCE_INLINE uint64_t Emulate64x64to128(
    uint64_t& r_hi,
    const uint64_t x,
    const uint64_t y)
{
    const uint64_t x0 = (uint32_t)x, x1 = x >> 32;
    const uint64_t y0 = (uint32_t)y, y1 = y >> 32;
    const uint64_t p11 = x1 * y1, p01 = x0 * y1;
    const uint64_t p10 = x1 * y0, p00 = x0 * y0;
    /*
        This is implementing schoolbook multiplication:

                x1 x0
        X       y1 y0
        -------------
                   00  LOW PART
        -------------
                00
             10 10     MIDDLE PART
        +       01
        -------------
             01 
        + 11 11        HIGH PART
        -------------
    */

    // 64-bit product + two 32-bit values
    const uint64_t middle = p10 + (p00 >> 32) + (uint32_t)p01;

    /*
        Proof that 64-bit products can accumulate two more 32-bit values
        without overflowing:

        Max 32-bit value is 2^32 - 1.
        PSum = (2^32-1) * (2^32-1) + (2^32-1) + (2^32-1)
             = 2^64 - 2^32 - 2^32 + 1 + 2^32 - 1 + 2^32 - 1
             = 2^64 - 1
        Therefore it cannot overflow regardless of input.
    */

    // 64-bit product + two 32-bit values
    r_hi = p11 + (middle >> 32) + (p01 >> 32);

    // Add LOW PART and lower half of MIDDLE PART
    return (middle << 32) | (uint32_t)p00;
}

#if defined(_MSC_VER) && defined(_WIN64)
// Visual Studio 64-bit

# include <intrin.h>
# pragma intrinsic(_umul128)
# define CAT_MUL128(r_hi, r_lo, x, y) \
    r_lo = _umul128(x, y, &(r_hi));

#elif defined(__SIZEOF_INT128__)
// Compiler supporting 128-bit values (GCC/Clang)

# define CAT_MUL128(r_hi, r_lo, x, y)                   \
    {                                                   \
        unsigned __int128 w = (unsigned __int128)x * y; \
        r_lo = (uint64_t)w;                             \
        r_hi = (uint64_t)(w >> 64);                     \
    }

#else
// Emulate 64x64->128-bit multiply with 64x64->64 operations

# define CAT_MUL128(r_hi, r_lo, x, y) \
    r_lo = Emulate64x64to128(r_hi, x, y);

#endif // End CAT_MUL128

赞(0）回复(0）举报 2022-12-20

o2g1uqev5#

长乘法应该可以表现。
将a*b拆分为(hia+loa)*(hib+lob)，得到4个32位乘法和一些移位，用64位进行，手工进位，得到高位部分。
注意，高部分的近似值可以用较少的乘法来完成--用1次乘法精确到2^33左右，用3次乘法精确到1以内。
我不认为有一个便携式的替代品。

赞(0）回复(0）举报 2022-12-20