python-3.x Pandas之间的交流太慢了,下面的代码有什么更快的替代方法?

gorkyyrv  于 2023-10-21  发布在  Python
关注(0)|答案(1)|浏览(102)

我有一个大约60万行的数据集。由于使用了pandas iterrows(),下面的代码需要很长时间才能运行。是否有适合下面所示特定代码的替代方案

  1. %%time
  2. import numpy as np
  3. df_inputed = df # dataframe with many missing values
  4. for index, row in df_to_inpute.iterrows():
  5. sic = row['sic']
  6. year = row['year']
  7. quarter = row['quarter']
  8. for col in cols_to_check: #columns except for date and pk columns
  9. value = row[col]
  10. if np.isnan(value):
  11. median = get_median(sic, year, quarter) #assume operation is O(1) time
  12. if not np.isnan(median):
  13. df_inputed.at[index, col] = median
kqqjbcuj

kqqjbcuj1#

使用df.apply + pd.Series.fillna方法的组合:

  1. def fill_with_median(x):
  2. if x[cols_to_check].isna().any(): # if filling is needed
  3. med = x[cols_median].median()
  4. if not np.isnan(med):
  5. x[cols_to_check] = x[cols_to_check].fillna(med)
  6. return x
  7. cols_median = ['sic', 'year', 'quarter']
  8. df = df.apply(fill_with_median, axis=1)

另一种方法是使用 boolean masks 来过滤填充所需的切片:

  1. m = df[cols_to_check].isna().any(axis=1)
  2. med_vals = df[cols_median][m].median(1)
  3. df.loc[m & med_vals.notna(), cols_to_check] = med_vals
展开查看全部

相关问题