通过矢量化代码和避免Pandas应用提高性能

insrf1ej  于 2023-02-02  发布在  其他
关注(0)|答案(1)|浏览(119)
import pandas as pd
import numpy as np

def impute_row_median(
    s: pd.Series,
    threshold: float
) -> pd.Series:
    '''For a vector of values, impute nans with median if %nan is below threshold'''
    nan_mask = s.isna()
    if nan_mask.any() and ((nan_mask.sum() / s.size) * 100) < threshold:
        s_median = s.median(skipna=True)
        s[nan_mask] = s_median
    return s  # dtype: float

df = pd.DataFrame(np.random.uniform(0, 1, size=(1000, 5)))
df = df.mask(df < 0.5)
df.apply(impute_row_median, axis=1, threshold=80)  # slow

下面的apply相当慢(我没有使用timeit,因为我没有什么可以比较的)。我通常的方法是避免apply,而是使用向量化函数,如np.where,但我目前无法在这里想出一种方法来做到这一点。有人有什么建议吗?谢谢!

8hhllhi2

8hhllhi21#

对于缺失值的计数百分比,使用带布尔掩码的mean,通过广播将2d mask与numpy中的1d mask链接,并替换DataFrame.mask中的缺失值:

threshold = 80

mask = df.isna()
m = mask.mean(axis=1) * 100 < threshold 
df1 = df.mask(mask & m.to_numpy()[:, None], df.median(axis=1, skipna=True), axis=0)

numpy.where类似:
x一个一个一个一个x一个一个二个一个x一个一个三个一个
性能比较(10k行,50列):

np.random.seed(2023)
df = pd.DataFrame(np.random.uniform(0, 1, size=(10000, 50)))
df = df.mask(df < 0.5)
In [130]: %timeit df.apply(impute_row_median, axis=1, threshold=80)
2.12 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [131]: %%timeit
     ...: a = df.to_numpy()
     ...: 
     ...: mask = np.isnan(a)
     ...: m = np.mean(mask, axis=1) * 100 < threshold
     ...: arr = np.where(mask & m[:, None], np.nanmedian(a, axis=1)[:, None], df)
     ...: 
     ...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
     ...: 
29.5 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [132]: %%timeit
     ...: threshold = 80
     ...: 
     ...: mask = df.isna()
     ...: m = mask.mean(axis=1) * 100 < threshold 
     ...: df1 = df.mask(mask & m.to_numpy()[:, None],df.median(axis=1, skipna=True),axis=0)
     ...: 
18.6 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [133]: %%timeit
     ...: mask = df.isna()
     ...: m = mask.mean(axis=1) * 100 < threshold
     ...: arr = np.where(mask & m.to_numpy()[:, None], 
     ...:                df.median(axis=1, skipna=True).to_numpy()[:, None], 
     ...:                df)
     ...: 
     ...: df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
     ...: 
     ...: 
10.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

相关问题