c++ CUDA推力迭代器:如何使用迭代器实现对device_vectors高效填充和复制?

vktxenjb  于 2023-08-09  发布在  其他
关注(0)|答案(1)|浏览(123)

我的项目包含了很多fillcopy等基本操作。然而,我是CUDA编程的新手,我目前的实现只是使用for循环来操作device_vector,这远不如使用迭代器有效。

我的问题是:如何使用迭代器(例如counting/permutation_iterator等)实现以下函数?

1.批量从指定索引中获取fillsequencetransform值。
玩具示例:

len1 = 3, len2 = 5;
vecA.size() = (len1 + 2) * len2;
vecB.size() = len2;
// {} are just to help show the difference

// fill vecA in batch (len1) from index 1 using each value of vecB, vecA` is original vector
vecB = [    1,               5,               2,               4,               2         ]
vecA`= [1, {1, 1, 1}, 1, 1, {1, 1, 1}, 1, 1, {1, 1, 1}, 1, 1, {1, 1, 1}, 1, 1, {1, 1, 1}, 1]
vecA = [1, {1, 1, 1}, 1, 1, {5, 5, 5}, 1, 1, {2, 2, 2}, 1, 1, {4, 4, 4}, 1, 1, {2, 2, 2}, 1]

// multiply values in vecA with 2 in batch (len1) from index 1, vecC` is original vector
vecC.size() = (len1 + 2) * len2;
vecC`= [1, {1, 1, 1}, 1, 1, {2, 2, 2}, 1, 1, {3, 3, 3}, 1, 1, {4, 4, 4}, 1, 1,  {5,  5,  5}, 1]
vecC = [1, {2, 2, 2}, 1, 1, {4, 4, 4}, 1, 1, {6, 6, 6}, 1, 1, {8, 8, 8}, 1, 1, {10, 10, 10}, 1]

// sequence vecD(len1 * len2) in batch (len1)
vecD = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]

字符串
下面的代码使用了一个for循环,我猜它的效率远远低于它的效率:

size_t len1 = 3, len2 = 5;
thrust::device_vector<int> vecA((len1 +2) * len2);
thrust::device_vector<int> vecC((len1 +2) * len2);

thrust::device_vector<int> vecB(len2);

int offset_sta = 0;
int offset_end = 0;

for (size_t i = 0; i < len2; i++)
{
   offset1 = i * (len1 + 2) + 1; // +1 means from specified index 1 in each batch

   thrust::fill_n(vecA.begin() + offset1, len1, vecB.begin()[i]);

   thrust::transform(vecC.begin() + offset1, vecC.begin() + offset1 + len1, vecC.begin() + offset1, scalar_mult_functor<int>(2));
}

// sequence
thrust::device_vector<int> vecD(len1 * len2);
for (size_t i = 0; i < len2; i++)
{
   offset1 = i * len1;
   offset2 = (i + 1) * len1;
   thrust::sequence(vecD.begin() + offset1, vecD.begin() + offset2);
}

  1. copy子向量批量转换为另一个向量。
    玩具示例:
len1 = 2, len2 = 4, len3 = 5;
// copy values of vecA(len1 * len3) in batch (len1) to vecB(len2 * len3)
vecA = [1, 2,       3, 4,       5, 6,       7, 8,       9, 10       ]
vecB = [1, 2, 1, 2, 3, 4, 3, 4, 5, 6, 5, 6, 7, 8, 7, 8, 9, 10, 9, 10]


为了实现这一点,我简单地使用两个for循环来复制值,但显然这是低效的。

  1. reduce_sum批量(len1)中的向量值(不是一个值乘以一个值)。
    玩具示例:
len1 = 4, len2 = 3;
vecA.size() = len1 * len2;
vecB.size() = len1
vecA = [1, 2, 3, 4, 1, 1, 1, 1, 2, 2, 2, 2]
vecB = {1, 2, 3, 4} + {1, 1, 1, 1} + {2, 2, 2, 2} = [4, 5, 6, 7]


上述操作更像是在CPU上使用STL向量对2D向量执行的操作。我查看了一些教程,并试图用迭代器实现它们,但得到了错误的结果。
1.但我还有一个问题对vecB中的批处理执行reduce sum(参见下面的详细示例),就像在此操作中添加新的for循环(双for循环)。
哪种方法比较好?我们应该如何修改迭代器代码?
根据之前的玩具示例进行修改:

len1 = 4, len2 = 3; len3 = 2;
vecA.size() = len1 * len2 * len3;
vecB.size() = len1 * len3;
vecA = [1, 2, 3, 4, 1, 1, 1, 1, 2, 2, 2, 2,          3, 2, 1, 0, 1, 1, 1, 1, 2, 2, 2, 2]
vecB = [{1, 2, 3, 4} + {1, 1, 1, 1} + {2, 2, 2, 2},  {3, 2, 1, 0} + {1, 1, 1, 1} + {2, 2, 2, 2}] 
     = [4, 5, 6, 7,   6, 5, 4, 3]


与前面的例子不同,在这个例子中,我们在批次len3 = 2中对vecB进行了两次reduce sum。

首次更新
更新了简单for循环和Thrust迭代器的比较

我比较了 simple for loopthe method(below)(由@paleonix提供)之间的性能差距。
这种比较可以帮助初学者了解差异,或者在面临相同需求时做出选择。
此测试在配备Tesla V100 GPU的服务器上进行(我们的目标是探索差距,因此机器并不重要)。
注意事项:
1.对于数据大小,我只是设置了一个粗略的基线,因为每个大小在测试中是不同的。更重要的是,我应该测试不同的大小(iidoEe.,不同的len1len2等。)(太多的组合...)。如果我有时间的话,我以后会做的。
1.我没有多次测试以获得平均结果。
1.对于fancy iteratorsfor_each_n之间的比较,可能需要更大的数据大小来比较它们。
以下是一个粗略的比较:x1c 0d1x的数据
从这个比较中,我们可以看到:
1.“simple for loop”方法比Thrust迭代器慢是意料之中的。特别是,当数据大小变得更大时,其性能显著下降。
1.在Reduce中,for_each_nreduce_by_key_batch慢。

第二次更新
更新了len1大而len2小时使用不同迭代器进行reduce的比较

回想一下@paleonix提到的结果是不同的:
对于len 1非常大的特殊情况(对于现代GPU来说足够的并行性),这可能意味着len 2相当小,可以使用另一种基于for_each的类似内核的解决方案,该解决方案仅对并行值进行求和并进行合并访问:
我做了一些实验,粗略比较的结果如表所示:


结果证实了@paleonix的评论。

第三次更新
更新了len1 * len3较大时使用迭代器和for_each进行reduce sum的比较

在这种情况下,在vecB中分批(即,len3)执行归约求和。



wmtdaxz3

wmtdaxz31#

这个答案中所有关于性能的陈述都假设数据大小足够大,可以使用默认的CUDA设备后端/系统(相对于OMPTBB)充分利用现代GPU和Thrust。

批量填充和填充变换

对于用1 s填充批处理的示例,可以只对thrust::transform使用花哨的迭代器,但由于对齐问题和代码复杂性,这可能是个坏主意。相反,可以使用thrust::transform_if,其中“模板”由一个简单的计数迭代器给出,所有逻辑都放入 predicate 函子中。为了得到一个批量填充,仍然需要另一个更复杂的迭代器来从输入向量(vecB)中读取。

const auto batched_indices_it =
    thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [len1_padded]
        __host__ __device__ (int idx) {
            return (idx - 1) / len1_padded; // corresponds to i
        });
const auto batched_fill_it = 
    thrust::make_permutation_iterator(
        vecB.cbegin(),
        batched_indices_it);

const auto is_valid =
    [len1, len1_padded]
    __host__ __device__ (int idx) {
        return (idx != 0) && ((idx - 1) % len1_padded < len1);
    };

// copy_if doesn't work here as it "compactifies" the output
// so use transform_if with identity operation
thrust::transform_if(
    batched_fill_it, batched_fill_it + vecA.size(),
    thrust::make_counting_iterator(0),
    vecA.begin(),
    thrust::identity<int>{},
    is_valid);

thrust::transform_if(
    vecC.cbegin(), vecC.cend(),
    thrust::make_counting_iterator(0),
    vecC.begin(),
    scalar_mult_functor<int>{2},
    is_valid);

字符串
或者,可以只使用thrust::for_each,而不使用任何复杂的花哨迭代器,这可能更容易阅读/理解,并且对于这些琐碎的并行操作来说,性能应该相当/相同(由于下面的示例代码中实现的内核融合,它可能比两个转换更快):

const auto A_ptr = vecA.data();
const auto B_ptr = vecB.data();
const auto C_ptr = vecC.data();
const auto op = scalar_mult_functor<int>(2);

thrust::for_each_n(
    thrust::make_counting_iterator(0), vecA.size(),
    [len1, len1_padded, A_ptr, B_ptr, C_ptr, op]
    __host__ __device__ (int idx) {
        if ((idx != 0) && ((idx - 1) % len1_padded < len1)) {
            A_ptr[idx] = B_ptr[(idx - 1) / len1_padded];
            C_ptr[idx] = op(C_ptr[idx]);
        }
    });

批量序列-/Iota-/Counting-Iterator

批处理序列相对简单地实现为花哨的迭代器,甚至不应该复制到内存中,而是在需要它作为输入的下一个操作中使用(内核融合):

const auto batched_sequence_it = 
    thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [len1] __device__ (const int idx) { return idx % len1; });

批量重复复制

批量重复稍微复杂一点,但基本形式与批量填充使用的花哨迭代器相同:

const auto batched_transpose_idx_it =
    thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [len1, len2] __host__ __device__ (const int idx) {
            const int batch_id = idx / len2;
            const int batch_el = idx % len1;
            return batch_id * len1 + batch_el;
        });
const auto batched_A_transpose_it = 
    thrust::make_permutation_iterator(
        vecA.cbegin(),
        batched_transpose_idx_it);


它也准备好在即将到来的操作中进行延迟计算。

批量缩减

这是最难的一个,至少在获得良好的性能方面。以下使用thrust::reduce_by_key的实现通常适用,但可能不是很快,因为置换迭代器会扰乱coalescing,这对带宽限制内核的性能非常重要。

const auto transpose_idx_it =
    thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [len1, len2] __host__ __device__ (int idx) { // "transpose"
            const int batch_id = idx / len2;
            const int batch_el = idx % len2;
            return batch_el * len1 + batch_id;
        });
const auto A_transpose_it =
    thrust::make_permutation_iterator(
        vecA.cbegin(),
        transpose_idx_it);

thrust::reduce_by_key(
    thrust::make_counting_iterator(0),
    thrust::make_counting_iterator(len1 * len2),
    A_transpose_it,
    thrust::make_discard_iterator(),
    vecB.begin(),
    [len2] __host__ __device__ (int idx, int idy) {
        return idx / len2 == idy / len2;
    });


A_transpose_it迭代器对vecA中的元素进行置换,使得应该求和的元素在使用它的归约算法中看起来彼此相邻。因此, predicate functor/lambda比较“keys”(计数迭代器)看起来就像是这种情况。
我不认为一个一般的,性能降低这个问题是可以实现的,目前的算法在推力。您可能会更幸运地使用实际上是为了容纳多维数据而编写的库,如MatX(C++ API)或cuTENSOR(C API)。
对于len1非常大的特殊情况(对于现代GPU来说足够的并行性),这可能意味着len2相当小,可以使用另一种基于for_each的类似内核的解决方案,该解决方案仅对并行值进行求和并进行合并访问:

const auto A_ptr = vecA.data();
const auto B_ptr = vecB.data();

thrust::for_each_n(
    thrust::make_counting_iterator(0), len1,
    [len1, len2, A_ptr, B_ptr]
    __host__ __device__ (int idx) {
        int sum = 0;
        // "grid-stride loop"
        for (int i = idx; i < len1 * len2; i += len1) {
            sum += A_ptr[i];
        }
        B_ptr[idx] = sum;
    });

更新:批量批量缩减

看到OP的基准测试结果,这比我预期的还要清楚(10,000在GPU世界中并不是特别大),我会使用与之前减少相同的合并for_each策略。在这种情况下,假设len1 * len3非常大,但理想情况下len1也是32的倍数。即使不是这种情况,由于一般的复杂性/开销,它仍然可能比分段减少(reduce_by_key)好得多。

const auto A_ptr = vecA.data();
const auto B_ptr = vecB.data();

thrust::for_each_n(
    thrust::make_counting_iterator(0), len1 * len3,
    [len1, len2, A_ptr, B_ptr]
    __host__ __device__ (int idx) {
        const int outer_batch_id = idx / len1;
        const int batch_el = idx % len1;
        const int begin = outer_batch_id * len1 * len2 + batch_el;
        const int end = begin + len1 * len2;
        int sum = 0;
        for (int i = begin; i < end; i += len1) {
            sum += A_ptr[i];
        }
        B_ptr[idx] = sum;
    });


为了完整起见,我还在下面的完整代码中添加了一个使用reduce_by key的通用版本。

完整源代码

使用nvcc -extended-lambda -std=c++17编译(在CUDA 12中,由于某种原因,我需要-rdc=True)。

#include <cassert>

#include <iostream>
#include <iterator>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/transform.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>
#include <thrust/reduce.h>

#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/discard_iterator.h>

template <typename T>
class scalar_mult_functor {
    T scalar_;
    public:
    scalar_mult_functor(T scalar) : scalar_{scalar} {}
    __host__ __device__ T operator()(T in) const {
        return scalar_ * in;
    }
};

void part1() {
    const int len1 = 3, len2 = 5;
    const int len1_padded = len1 + 2;
    thrust::device_vector<int> vecA(len1_padded * len2, 1);
    thrust::device_vector<int> vecB(
        thrust::make_counting_iterator(2),
        thrust::make_counting_iterator(len2 + 2));

    thrust::device_vector<int> vecC(len1_padded * len2, 1);

    // not sure if compiler is able to merge "/" and "%" from
    // different functors
    const auto batched_indices_it =
        thrust::make_transform_iterator(
            thrust::make_counting_iterator(0),
            [len1_padded]
            __host__ __device__ (int idx) {
                return (idx - 1) / len1_padded; // corresponds to i
            });
    const auto batched_fill_it = 
        thrust::make_permutation_iterator(
            vecB.cbegin(),
            batched_indices_it);

    const auto is_valid =
        [len1, len1_padded]
        __host__ __device__ (int idx) {
            return (idx != 0) && ((idx - 1) % len1_padded < len1);
        };
    
    // copy_if doesn't work here as it "compactifies" the output
    // so use transform_if with identity operation
    thrust::transform_if(
        batched_fill_it, batched_fill_it + vecA.size(),
        thrust::make_counting_iterator(0),
        vecA.begin(),
        thrust::identity<int>{},
        is_valid);

    thrust::transform_if(
        vecC.cbegin(), vecC.cend(),
        thrust::make_counting_iterator(0),
        vecC.begin(),
        scalar_mult_functor<int>{2},
        is_valid);

#if 0

    // can't use device_vector in device code
    const auto A_ptr = vecA.data();
    const auto B_ptr = vecB.data();
    const auto C_ptr = vecC.data();
    const auto op = scalar_mult_functor<int>(2);
    // probably the easiest to read, no fancy iterators
    // fused both fill and transform
    thrust::for_each_n(
        thrust::make_counting_iterator(0), vecA.size(),
        [len1, len1_padded, A_ptr, B_ptr, C_ptr, op]
        __host__ __device__ (int idx) {
            if ((idx != 0) && ((idx - 1) % len1_padded < len1)) {
                A_ptr[idx] = B_ptr[(idx - 1) / len1_padded];
                C_ptr[idx] = op(C_ptr[idx]);
            }
        });

#endif

    // sequence
    thrust::device_vector<int> vecD(len1 * len2);

    auto batched_sequence_it = 
        thrust::make_transform_iterator(
            thrust::make_counting_iterator(0),
            [len1] __host__ __device__ (const int idx) { return idx % len1; });
    // I wouldn't acually do this copy in real code, as you could just use 
    // batched_sequence_in in following operations instead of writing
    // the values to memory (and reading it again later)
    thrust::copy_n(batched_sequence_it, vecD.size(), vecD.begin());

    // device2host transfer and printing
    thrust::host_vector<int> hA(vecA);
    thrust::host_vector<int> hC(vecC);
    thrust::host_vector<int> hD(vecD);
    auto out = std::ostream_iterator<int>(std::cout, ", ");
    thrust::copy(hA.cbegin(), hA.cend(), out); std::cout << '\n';
    thrust::copy(hC.cbegin(), hC.cend(), out); std::cout << '\n';
    thrust::copy(hD.cbegin(), hD.cend(), out); std::cout << '\n';
}

void part2() {
    const int len1 = 2, len2 = 4, len3 = 5;
    thrust::device_vector<int> vecA(
        thrust::make_counting_iterator(1),
        thrust::make_counting_iterator(1 + len1 * len3));

    const auto batched_transpose_idx_it =
        thrust::make_transform_iterator(
            thrust::make_counting_iterator(0),
            [len1, len2] __host__ __device__ (const int idx) {
                const int batch_id = idx / len2;
                const int batch_el = idx % len1;
                return batch_id * len1 + batch_el;
            });
    const auto batched_A_transpose_it = 
        thrust::make_permutation_iterator(
            vecA.cbegin(),
            batched_transpose_idx_it);
    // Again, if used only once, one might not want to write this to
    // memory at all, but if one does, one can avoid the unnecessary
    // initialization of vecB to 0, by passing the fancy iterator to
    // the constructor
    thrust::device_vector<int> vecB(
        batched_A_transpose_it,
        batched_A_transpose_it + len2 * len3);

    // device2host transfer and printing
    thrust::host_vector<int> hB(vecB);
    auto out = std::ostream_iterator<int>(std::cout, ", ");
    thrust::copy(hB.cbegin(), hB.cend(), out); std::cout << '\n';
}

void part3() {
    constexpr int A[] = {1, 2, 3, 4, 1, 1, 1, 1, 2, 2, 2, 2};
    thrust::host_vector<int> hA(A, A + sizeof(A) / sizeof(A[0]));
    const int len1 = 4, len2 = 3;
    assert(hA.size() == len1 * len2);

    thrust::device_vector<int> vecA(hA);
    thrust::device_vector<int> vecB(len1);

    // Due to this permutation iterator accesses to A are not aligned
    // which is bad for performance, but there is not much we can do
    // about that when len1 isn't big other than using a different
    // data layout from the start
    const auto transpose_idx_it =
        thrust::make_transform_iterator(
            thrust::make_counting_iterator(0),
            [len1, len2] __host__ __device__ (int idx) { // "transpose"
                const int batch_id = idx / len2;
                const int batch_el = idx % len2;
                return batch_el * len1 + batch_id;
            });
    const auto A_transpose_it =
        thrust::make_permutation_iterator(
            vecA.cbegin(),
            transpose_idx_it);

    thrust::reduce_by_key(
        thrust::make_counting_iterator(0),
        thrust::make_counting_iterator(len1 * len2),
        A_transpose_it,
        thrust::make_discard_iterator(),
        vecB.begin(),
        [len2] __host__ __device__ (int idx, int idy) {
            return idx / len2 == idy / len2;
        });
        
#if 0

    const auto A_ptr = vecA.data();
    const auto B_ptr = vecB.data();
    // This sequential summation actually coalesces well
    // But only efficient for very big len1
    thrust::for_each_n(
        thrust::make_counting_iterator(0), len1,
        [len1, len2, A_ptr, B_ptr]
        __host__ __device__ (int idx) {
            int sum = 0;
            // "grid-stride loop"
            for (int i = idx; i < len1 * len2; i += len1) {
                sum += A_ptr[i];
            }
            B_ptr[idx] = sum;
        });

#endif

    // device2host transfer and printing
    thrust::host_vector<int> hB(vecB);
    auto out = std::ostream_iterator<int>(std::cout, ", ");
    thrust::copy(hB.cbegin(), hB.cend(), out); std::cout << '\n';
}

void part4() {
    constexpr int A[] = {1, 2, 3, 4, 1, 1, 1, 1, 2, 2, 2, 2,
                         3, 2, 1, 0, 1, 1, 1, 1, 2, 2, 2, 2};
    thrust::host_vector<int> hA(A, A + sizeof(A) / sizeof(A[0]));
    const int len1 = 4, len2 = 3, len3 = 2;
    assert(hA.size() == len1 * len2 * len3);

    thrust::device_vector<int> vecA(hA);
    thrust::device_vector<int> vecB(len1 * len3);

#if 0
    // Due to this permutation iterator accesses to A are not aligned
    // which is bad for performance, but there is not much we can do
    // about that when len1 isn't big other than using a different
    // data layout from the start
    const auto transpose_idx_it =
        thrust::make_transform_iterator(
            thrust::make_counting_iterator(0),
            [len1, len2] __host__ __device__ (int idx) { // "transpose"
                const int inner_len = len1 * len2;
                const int outer_batch_id = idx / inner_len;
                const int inner_idx = idx % inner_len;
                const int inner_batch_id = inner_idx / len2;
                const int batch_el = inner_idx % len2;
                return (outer_batch_id * len2 + batch_el) * len1 + inner_batch_id;
            });
    const auto A_transpose_it =
        thrust::make_permutation_iterator(
            vecA.cbegin(),
            transpose_idx_it);

    thrust::reduce_by_key(
        thrust::make_counting_iterator(0),
        thrust::make_counting_iterator(len1 * len2 * len3),
        A_transpose_it,
        thrust::make_discard_iterator(),
        vecB.begin(),
        [len2] __host__ __device__ (int idx, int idy) {
            return idx / len2 == idy / len2;
        });

#endif

    const auto A_ptr = vecA.data();
    const auto B_ptr = vecB.data();
    // This sequential summation actually coalesces well, if len1 is a
    // multiple of 32 (warp size)
    // But only efficient for very big len1 * len3
    thrust::for_each_n(
        thrust::make_counting_iterator(0), len1 * len3,
        [len1, len2, A_ptr, B_ptr]
        __host__ __device__ (int idx) {
            const int outer_batch_id = idx / len1;
            const int batch_el = idx % len1;
            const int begin = outer_batch_id * len1 * len2 + batch_el;
            const int end = begin + len1 * len2;
            int sum = 0;
            for (int i = begin; i < end; i += len1) {
                sum += A_ptr[i];
            }
            B_ptr[idx] = sum;
        });

    // device2host transfer and printing
    thrust::host_vector<int> hB(vecB);
    auto out = std::ostream_iterator<int>(std::cout, ", ");
    thrust::copy(hB.cbegin(), hB.cend(), out); std::cout << '\n';
}

int main() {
    part1();
    part2();
    part3();
    part4();
}

相关问题