我正在使用cublasSgemmStridedBatched API进行所谓的“Tensor收缩”。我有形状为60000*20*9
的TensorA和形状为9*32
的TensorB,它们都是行主的。根据定义,C = A * B
应该给予形状为60000*20*32
的结果TensorC。我写的代码如下:
int batch_count = 60000;
int M = 20;
int K = 9;
int N = 32;
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.0;
float beta = 0.0;
int strideA = 20 * 9;
int strideB = 0;
int strideC = 20 * 32;
// A(60000 * 20 * 9) * B(9 * 32) = C(60000 * 20 * 32)
cublasStatus_t ret = cublasSgemmStridedBatched(
handle,
CUBLAS_OP_N, //transposed, since in row-major
CUBLAS_OP_N, //transposed, since in row-major
N,
M,
K,
&alpha,
B.data<float>(), //already in GPU
N, // lda, transposed
strideB,
A.data<float>(), //already in GPU
K, // ldb, transposed
strideA,
&beta,
C.data<float>(),//already in GPU
N, // ldc
strideC,
batchCount);
cublasDestroy(handle);
if(ret != CUBLAS_STATUS_SUCCESS){
printf("cublasSgemmStridedBatched failed %d line (%d)\n", ret, __LINE__);
}
上面的代码无法完成工作,并一直显示cublasSgemmStridedBatched failed 7
,根据manual,CUBLAS_STATUS_INVALID_VALUE
代表CUBLAS_STATUS_INVALID_VALUE
。任何帮助或建议是赞赏!
1条答案
按热度按时间2ul0zpep1#
下面是一个最小的版本,它可以工作并测试结果:
报告最大相对误差为2.5e-7