好了,问题来了:
使用CUDA 1.1计算GPU,我试图为每个线程维护一组索引(可能数量不同,这里固定为4个),我将其作为结构变量的成员保留。
我的问题是,当访问成员数组时,获取对结构体的引用会导致错误的结果:我用0初始化成员数组值,当我使用原始结构变量读取数组值时,我得到正确的值(0),但当我使用对结构变量的引用读取它时,我得到垃圾(-8193)。即使使用class
而不是struct
,也会发生这种情况。
为什么tmp
小于/不等于0?
C++不是我的主要语言,所以这可能是一个概念问题,也可能是在CUDA中工作的一个怪癖。
struct DataIdx {
int numFeats;
int* featIdx;
};
extern __shared__ int sharedData[];
__global__ void myFn(){
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
DataIdx myIdx; //instantiate the struct var in the context of the current thread
myIdx.numFeats = 4;
size_t idxArraySize = sizeof(int)*4;
//get a reference to my array for this thread. Parallel Nsight debugger shows myIdx.featIdx address = 0x0000000000000000e0
myIdx.featIdx = (int*)(&sharedData[tidx*idxArraySize]);
myIdx.featIdx[0] = 0x0; //set first value to 0
int tmp = myIdx.featIdx[0]; // tmp is correctly eq to 0 in Nsight debugger -- As Expected!!
tmp = 2*tmp; antIdx.featIdx[0] = tmp; //ensure compiler doesn't elide out tmp
DataIdx *tmpIdx = &myIdx; //create a reference to my struct var
tmp = tmpIdx.featIdx[0]; // expected 0, but tmp = -8193 in debugger !! why? debugger shows address of tmpIdx.featIdx = __devicea__ address=8
tmpIdx.featIdx[0] = 0x0;
tmp = tmpIdx.featIdx[0]; // tmp = -1; cant even read what we just set
//forcing the same reference as myIdx.featIdx, still gives a problem! debugger shows address of tmpIdx.featIdx = __devicea__ address=8
tmpIdx->featIdx = (int*)(&sharedData[tidx*idxArraySize]);
tmp = tmpIdx.featIdx[0]; //tmp = -8193!! why != 0?
DataIdx tmpIdxAlias = myIdx;
tmp = tmpIdx.featIdx[0]; //aliasing the original var gives correct results, tmp=0
myIdx.featIdx[0] = 0x0;
mySubfn(&myIdx); //this is a problem because it happens when passing the struct by reference to subfns
mySubfn2(myIdx);
}
__device__ mySubfn(struct DataIdx *myIdx){
int tmp = myIdx->featIdx[0]; //tmp == -8193!! should be 0
}
__device__ mySubfn2(struct DataIdx &myIdx){
int tmp = myIdx.featIdx[0]; //tmp == -8193!! should be 0
}
1条答案
按热度按时间dtcbnfnu1#
我不得不修改你的代码来编译。排队
编译器无法理解指针指向共享内存。它不是存储到共享内存(
R2G
),而是存储到越界的全局地址0x10
。Nsight CUDA内存检查器捕获到全局内存的越界存储。
如果你编译
compute_10,sm_10
(实际上<= 1.3),你应该看到下面的警告,每一行编译器不能确定访问是共享内存:如果在启动后添加
cudaDeviceSynchronize
,您应该会看到由越界内存访问引起的错误代码cudaErrorUnknown
。__shared__
是一个可变内存限定符,而不是类型限定符,所以我知道你会如何告诉编译器featIdx
将始终指向共享内存。在CC >= 2.0中,编译器应该将(int*)(&sharedData[tidx*idxArraySize])
转换为泛型指针。