我有一个大约60万行的数据集。由于使用了pandas iterrows(),下面的代码需要很长时间才能运行。是否有适合下面所示特定代码的替代方案
%%time
import numpy as np
df_inputed = df # dataframe with many missing values
for index, row in df_to_inpute.iterrows():
sic = row['sic']
year = row['year']
quarter = row['quarter']
for col in cols_to_check: #columns except for date and pk columns
value = row[col]
if np.isnan(value):
median = get_median(sic, year, quarter) #assume operation is O(1) time
if not np.isnan(median):
df_inputed.at[index, col] = median
1条答案
按热度按时间kqqjbcuj1#
使用
df.apply
+pd.Series.fillna
方法的组合:另一种方法是使用 boolean masks 来过滤填充所需的切片: