scipy 尝试通过cython提高一些矢量化numpy操作的性能

zbsbpyhn 于 2022-11-10 发布在其他

关注(0)|答案(1)|浏览(161)

我试图通过Cytonize来加速矢量化的numpy操作（或者至少看看我是否可以）。代码计算了给定两个距离矩阵（target_distances和map_distances从某个展平的坐标矢量计算得出）和一些关于距离类型的信息（它的范围可以从0到3，但在本例中我第一次尝试将其全部设为0）的某种压力。numpy版本是www.example.com

import scipy
import numpy as np
from scipy.spatial import distance

def decay(x, s=10):

    return scipy.special.expit(s*x)

def stress(z, target_distances, dim, distance_types, step,
           nrows, ncols):

    row_coords = np.reshape(z[:dim*nrows],(nrows,dim))
    col_coords = np.reshape(z[dim*nrows:dim*(nrows+ncols)],(ncols,dim))

    map_distances = distance.cdist(row_coords, col_coords).copy()
    error = target_distances - map_distances

    I0 = (distance_types==0) | (distance_types==3)
    I2 = distance_types == 2

    return np.sum(error[I0]**2) + np.sum((error[I2] + step)**2*decay(error[I2] + step))

在cythonized版本中，我的pyx文件如下所示（calculus.pyx）

import cython
cimport cython
from libc.stdlib cimport malloc, free

cdef extern from "ctools.h":
    double stress (double*, double**, int**, double, int, int, int)

@cython.boundscheck(False) 
@cython.wraparound(False)
@cython.nonecheck(False)     
def stress_cython(double[:] z, double[:,:] target_distances, int [:,:]
           distance_types, double step, int dim, int nrows, int ncols):

    cdef int i
    cdef int j
    cdef int N = (nrows+ncols)*dim

    z_C = <double*>malloc(sizeof(double)*N)
    target_distances_C = <double**>malloc(sizeof(double*)*nrows)
    distance_types_C = <int**>malloc(sizeof(int*)*nrows)

    for i in range(N):
        z_C[i] = z[i]

    for i in range(nrows):
        target_distances_C[i] = <double *>malloc(sizeof(double)*ncols)
        distance_types_C[i] = <int *>malloc(sizeof(int)*ncols)

        for j in range(ncols):
            target_distances_C[i][j] = target_distances[i,j]
            distance_types_C[i][j] = distance_types[i,j]

    stress_val = stress(z_C, target_distances_C, 
                        distance_types_C, step, nrows, ncols, dim)

    for i in range(nrows):
        free(target_distances_C[i])
        free(distance_types_C[i])

    free(target_distances_C)
    free(distance_types_C)

    return stress_val

外部ctools.c为


# include<stdio.h>

# include<math.h>

double decay(double x){

    return 1/(1+exp(-10*x));

    }

double**dist_pairs(double* z, int nrows, int ncols, int dims)
    {

    int i,j,d;
    double coord1, coord2;
    double**dist;

    dist = (double**)malloc(sizeof(double*)*nrows);
    for (i=0; i<nrows; i++){
        dist[i] = (double *)malloc(sizeof(double)*ncols);
        }

    for (i=0; i<nrows; i++){
       for (j=0; j<ncols; j++){

            dist[i][j] = 0;

            for (d=0; d<dims; d++){

                coord1 = z[d*(nrows+ncols) + i];
                coord2 = z[d*(nrows+ncols) + nrows + j];
                dist[i][j] += pow(coord1-coord2,2.0) ;
                }

            dist[i][j] = sqrt(dist[i][j]);
        }
    }

    return dist;

}

double stress(double* z, double**target_distances, int**distance_types, 
              double step, int nrows, int ncols, int dim){

    int i,j;
    double stress = 0.0;
    double err = 0.0;
    double**map_distances = dist_pairs(z, nrows, ncols, dim);

    for (i=0; i<nrows; i++){
        for (j=0; j<ncols; j++){

            if (distance_types[i][j]==0 || distance_types[i][j]==3){
                stress += pow(target_distances[i][j] - map_distances[i][j],2.0);
                }
            else if (distance_types[i][j]==2){
                err = target_distances[i][j] - map_distances[i][j] + step;
                stress += pow(step,2)*decay(step);

            }

        }

    }

    for (i=0; i<nrows; i++){
        free(map_distances[i]);
    }
    free(map_distances);

    return stress;

}

我做了下面的测试（test.py）

import numpy as np
import time 
from calculus import stress_cython
from pycalculus import stress

nrows = 3000
ncols = 2000
dim = 2
N = 100
is_discrete = 1.0

dt0 = 0
dt1 = 0
difs = []
for i in range(N):
    coordinates = np.random.rand(dim, nrows+ncols)
    coordinates_flat = coordinates.flatten()

    target_distances = np.random.rand(nrows, ncols)
    distance_types = np.zeros((nrows,ncols), dtype='i')

    t0 = time.time()
    stress1 = stress_cython(coordinates_flat, target_distances, distance_types, is_discrete, dim, nrows,
                   ncols)
    t1 = time.time()

    stress2 = stress(coordinates.T.flatten(), target_distances, dim, distance_types, is_discrete,
                     nrows,ncols)

    t2 = time.time()

    dt0 += t1-t0
    dt1 += t2-t1

    difs.append(stress1-stress2)

print(f'cython:{dt0:.2f} python:{dt1:.2f}')

而在我的机器中，计时结果非常相似（~ 4. 47 vs ~ 4. 62）。我不太明白为什么，是不是有一些并行计算在后台进行numpy和scipy？
当我检查我的核心使用情况时，似乎不是这样。我尝试通过

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1

在运行www.example.com之前，在shell中test.py但没有产生任何影响。我在这里缺少的是矢量化操作期间的并行性，还是我只是在特灵优化已经优化的例程，以便在这些库中进行最佳扩展？
ps1：我用参数“-ffast-math”编译了cython代码
ps2：最后，我会让所有内核都处于不同的初始启动条件下，所以如果numpy自动并行化矢量化操作，它对我没有帮助。这就是我试图了解发生了什么的原因之一。

scipy

来源：https://stackoverflow.com/questions/73830642/trying-to-improve-performance-of-some-vectorized-numpy-operations-via-cython

1条答案

按热度按时间

k4aesqcs1#

编译器优化和指令集

首先，Cython通常使用-O2编译，这通常不是最佳的（除了最新版本的GCC外，矢量化功能未启用）。-O3可以给予更好的性能。此外，SSE指令集在x86-64处理器上默认使用，而现在几乎所有处理器都支持AVX（通常每个周期能够多计算2倍的项目）。（否则目标程序可能会崩溃）。你可以使用-march=native让编译器生成一个针对你的处理器优化的不可移植程序。Numpy可以在运行时检测到这一点，并在需要时使用AVX指令（尽管并非所有操作都是这样）。

代码问题

像pow(coord1-coord2, 2.0)这样的操作并不能保证被目标编译器优化。使用pow(coord1-coord2, 2)，甚至(coord1-coord2) * (coord1-coord2)肯定更好。
double**数据结构对于存储矩阵效率不高。最好分配一个大小为width * height * sizeof(double)的大型double*flatten数组。实际上，后者的好处是在内存中是连续的，阅读/写连续数据通常更快。此外，dist[i][j]会导致内存中的间接寻址效率低下，而dist[i*ncols+j]会导致更少/更快的指令。实际上，这在并行编码中更为重要，因为malloc往往不能与多个内核一起扩展，因此您应该避免多次并行调用它。
如果dims非常小（例如〈8），for (d=0; d<dims; d++)效率不高。展开循环以获得特定值可以显著提高性能。或者，可以将此循环与基于j的循环交换，以生成更快的代码（可能是矢量化的）。将其放在第一位可能会更快。
是否有一些并行计算在后台进行numpy和scipy？
目前，Numpy不应该并行化任何基于BLAS的操作（例如矩阵乘法）。至于Scipy，除了一些方法可以使用多个工作者（非常罕见，可调，并在文档中指定）之外，它是一样的。
你的代码可能是 * 内存绑定的 *（更具体地说是由于低效的缓存访问），所以访问内存中数据的方式非常重要。Numpy可以比你的代码进行更高效的访问。
请注意，如果代码受到内存带宽的限制，那么它将无法很好地扩展（因为在内存已经饱和的情况下，使用更多内核并不能使内存访问更快）。

进一步优化（更新）

Cython代码花了很多时间复制数组，因为它们在这里是连续的。你可以把它传递给C函数，因为数组是未修改的。复制数据通常很慢，尤其是大数组。
原来编译器（GCC）即使提供了-O3，实际上也无法对dist_pairs进行矢量化（而Numpy确实对代码进行了矢量化）。我试图修改代码以使矢量化变得简单，但即使在这种情况下，生成的代码也是低效的。一个解决方案是手动编写矢量化代码，即使这会使代码不再具有可移植性（仅在支持AVX的x86-64处理器上运行，即几乎所有使用时间不到10年的台式PC）。以下是一个（未经检查的）代码示例：
第一个
下面是生成的时间（使用-O3 -mavx -mfma：

cython:1.29 python:4.45

请注意，dist_pairs花费了60%的时间计算平方根。其余时间基本上是从内存加载/存储到内存。加快平方根计算的唯一方法是1.使用近似值或使用精度较低的数字（即float）或2.使用多线程。
通过合并dist_pairs和stress的循环，并逐行（以交错方式）操作，可以减少整体计算对内存的限制。计算出的行可以存储在快速CPU缓存中，而不是较慢的主RAM中。生成的代码应该是计算限制型的，并具有很好的伸缩性。
如果这还不够，您可以尝试在能够高效计算双精度运算的 * 服务器端 * GPU上运行此程序，因为平方根的计算速度应该更快，设备内存通常也更快。

赞(0）回复(0）举报 2022-11-10

我来回答

scipy 尝试通过cython提高一些矢量化numpy操作的性能

1条答案

编译器优化和指令集

代码问题

进一步优化（更新）

相关问题

热门标签

最新问答