c++ CUDA AtomicCAS死锁 _大数据知识库

我有一个值为0的数组matrix，我想将其中的一些元素递增1。我想要递增的matrix的索引存储在数组indices中。我需要多次递增一些元素，因此我尝试为matrix中的每个元素使用一个互斥锁数组。但是当我启动我的代码时，程序挂起，我得到死锁。
我被这个问题困住了。我最终想做的是使用CUDA绘制一个重叠的连续笔触，因此我需要并行访问画布图像的相同像素。
下面是我的代码：

#include <iostream>
using namespace std;

__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x; // thread id
    int ind = indices[index]; // indices of target array A to increment    

    if (index < nof_indices) {
        while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
        matrix[ind] += 1;
        atomicExch(&d_semaphores[ind], 0);
        __syncthreads();
    }
}

int main()
{
    int nof_indices = 6; // length of an array B
    int indices[6] = { 0,1,2,3,4,1 }; // array B; stores indices of an array A which to increment
    int canvas[10]; // array A
    int semaphores[10]; // mutex array with individual mutexes for each of array A elements

    int* d_canvas;
    int* d_indices;
    int* d_semaphores;

    memset(canvas, 0, sizeof(canvas)); // set all array A elements to 0
    memset(semaphores, 0, sizeof(semaphores)); // set all array A elements to 0    

    cudaMalloc(&d_canvas, sizeof(canvas));
    cudaMalloc(&d_semaphores, sizeof(semaphores));
    cudaMalloc(&d_indices, sizeof(indices));

    cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
    cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);

    add_kernel << <1, 6 >> > (d_canvas, d_indices, d_semaphores, nof_indices);

    cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);

    for (int it = 0; it < nof_indices; it++) {
        cout << canvas[it] << endl;
    }

    cudaFree(d_canvas);
    cudaFree(d_indices);
    cudaFree(d_semaphores);

    return 0;
}

在这个例子中，结果数组matrix应该看起来像这样：{1, 2 ,1 ,1,1,0}，但我只有在运行维度为<< 6,1 >>的内核时才能得到它。
我用的是CUDA 12.1，Geforce RTX 3060
谢谢你
（只有当我将每个块的线程大小设置为1时才有效，但这不是我想要的）

在volta之前的执行模型中，这行代码是/可能是有问题的：

while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);

这个主题通常在this blog“独立线程调度”以及各种SO问题（如此问题和此问题）中讨论。
然而，正如在博客（和其他地方）中指出的那样，volta执行模型应该允许更灵活的范例。我相信这里的问题是由于nvcc的一个特性引起的：
为了帮助迁移，同时实现独立线程调度中详细说明的纠正操作，Volta开发人员可以使用编译器选项组合-arch=compute_60 -code=sm_70选择加入Pascal的线程调度。
如果你编译一个pre-volta架构，你就向编译器表明你想要pre-volta语义。这可能会对代码的执行行为产生影响，例如，在volta或更新的架构上执行，但编译的是volta之前的目标。
根据我的测试，如果我在CUDA 12.1上使用默认开关进行编译，则sm_75上的代码会死锁，默认情况下会选择sm_52目标（包括PTX）。但是，如果我为sm_75目标编译，代码将“正常”运行。
我认为如果你编译Volta或更新的目标，你的代码不会在RTX 3060上死锁。除非你有理由不这样做，一般的建议是编译时指定你希望运行的目标。

c++ CUDA AtomicCAS死锁

1条答案

相关问题

热门标签

最新问答