pandas groupby之后从一列到另一列的分位数Map

egdjgwm8  于 2023-03-11  发布在  其他
关注(0)|答案(3)|浏览(152)

我需要根据分位数将一列Map到另一列。下面是一个示例:

df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4],
                   'B': [9, 6, 7, 1, 5, 4, 0, 9, 5, 4, 5, 6, 1, 3, 1, 8, 5, 6, 10, 2, 3],
                   'C': [3.0, nan, 4.0, 2.0, 6.0, 4.0, 5.0, nan, 1.0, nan, 2.0, 3.0, 9.0, 8.0, 5.0, nan, nan, 0.0, nan, 1.0, 2.0]})

df为:

A   B    C
0   1   9  3.0
1   1   6  NaN
2   1   7  4.0
3   1   1  2.0
4   1   5  6.0
5   2   4  4.0
6   2   0  5.0
7   2   9  NaN
8   3   5  1.0
9   3   4  NaN
10  3   5  2.0
11  3   6  3.0
12  3   1  9.0
13  3   3  8.0
14  4   1  5.0
15  4   8  NaN
16  4   5  NaN
17  4   6  0.0
18  4  10  NaN
19  4   2  1.0
20  4   3  2.0

现在我添加一列按A分组的B分位数:
df['B_q']=df.groupby('A')['B'].rank(pct=True)
这里,df为:

A   B    C       B_q
0   1   9  3.0  1.000000
1   1   6  NaN  0.600000
2   1   7  4.0  0.800000
3   1   1  2.0  0.200000
4   1   5  6.0  0.400000
5   2   4  4.0  0.666667
6   2   0  5.0  0.333333
7   2   9  NaN  1.000000
8   3   5  1.0  0.750000
9   3   4  NaN  0.500000
10  3   5  2.0  0.750000
11  3   6  3.0  1.000000
12  3   1  9.0  0.166667
13  3   3  8.0  0.333333
14  4   1  5.0  0.142857
15  4   8  NaN  0.857143
16  4   5  NaN  0.571429
17  4   6  0.0  0.714286
18  4  10  NaN  1.000000
19  4   2  1.0  0.285714
20  4   3  2.0  0.428571

我想做的是用分位数估计来填充C中的那些NaN,我们取组A =1:该组在C列中包含一个NaN,为了填充这个NaN,我找到它对应的B_q,即0.6,并找到分位数=0.6的C列值,即pd.Series([2,3,4,6]).quantile(0.6)=3.8
我不知道如何实现这个想法。我试过这样的方法:

features_daily_con_all_a['NPTTM_quantile']=features_daily_con_all_a.groupby(['TradingDay','Industry'])['NPParentCompanyOwnersTTM'].rank(pct=True)
def quantile_get(gp):
    '''gp has columns NPTTM_quantile, Con_Earning, Con_at_quantile'''
    gp['Con_at_quantile']=gp['NPTTM_quantile'].apply(lambda q:(gp['Con_Earning'].dropna().quantile(q) if (gp['Con_Earning'].dropna().shape[0]!=0 and 0<=q and q<=1) else np.nan))
    return gp
features_daily_con_all_a['Con_at_quantile']=np.nan
features_daily_con_all_a['Con_at_quantile']=features_daily_con_all_a.groupby(['TradingDay','Industry']).apply(quantile_get)

但是这个方法很费时间。2我想知道有没有其他的解决办法。3谢谢!

gfttwv5a

gfttwv5a1#

一个选项使用一个groupby来获取分位数列表,另一个选项用于计算分位数:

m = df['C'].isna()
qs = df[m].groupby('A')['B_q'].agg(list)
vals = df[~m].groupby('A')['C'].apply(lambda g: g.quantile(qs[g.name]))

df['C'] = df['C'].fillna(df[['A', 'B_q']].merge(vals, left_on=['A', 'B_q'],
                                                right_index=True, how='left')
                         ['C'])

输出:

A   B         C       B_q
0   1   9  3.000000  1.000000
1   1   6  3.800000  0.600000
2   1   7  4.000000  0.800000
3   1   1  2.000000  0.200000
4   1   5  6.000000  0.400000
5   2   4  4.000000  0.666667
6   2   0  5.000000  0.333333
7   2   9  5.000000  1.000000
8   3   5  1.000000  0.750000
9   3   4  3.000000  0.500000
10  3   5  2.000000  0.750000
11  3   6  3.000000  1.000000
12  3   1  9.000000  0.166667
13  3   3  8.000000  0.333333
14  4   1  5.000000  0.142857
15  4   8  3.714286  0.857143
16  4   5  1.714286  0.571429
17  4   6  0.000000  0.714286
18  4  10  5.000000  1.000000
19  4   2  1.000000  0.285714
20  4   3  2.000000  0.428571
mxg2im7a

mxg2im7a2#

您可以采用以下方法:

import pandas as pd
import numpy as np
df=pd.DataFrame({'A':[1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4],'B':[9,6,7,1,5,4,0,9,5,4,5,6,1,3,1,8,5,6,10,2,3],'C':[9,np.nan,7,1,5,4,0,np.nan,5,np.nan,5,6,1,3,1,np.nan,np.nan,6,np.nan,2,3]})

df['B_q'] = df.groupby('A')['B'].rank(pct=True)

def quantile_get(gp):
    for i, c in gp.iterrows():
        if pd.isna(c['C']) and not pd.isna(c['B']):
            q = c['B_q']
            gp.at[i, 'C'] = gp['C'].dropna().quantile(q) if gp['C'].notna().any() else np.nan
    return gp

df['B_q'] = df.groupby('A')['B'].rank(pct=True)
df = df.groupby('A').apply(quantile_get).reset_index(drop=True)
print(df)

其给出:

A   B         C       B_q
0   1   9  9.000000  1.000000
1   1   6  6.600000  0.600000
2   1   7  7.000000  0.800000
3   1   1  1.000000  0.200000
4   1   5  5.000000  0.400000
5   2   4  4.000000  0.666667
6   2   0  0.000000  0.333333
7   2   9  4.000000  1.000000
8   3   5  5.000000  0.750000
9   3   4  5.000000  0.500000
10  3   5  5.000000  0.750000
11  3   6  6.000000  1.000000
12  3   1  1.000000  0.166667
13  3   3  3.000000  0.333333
14  4   1  1.000000  0.142857
15  4   8  4.714286  0.857143
16  4   5  3.489796  0.571429
17  4   6  6.000000  0.714286
18  4  10  6.000000  1.000000
19  4   2  2.000000  0.285714
20  4   3  3.000000  0.428571
9udxz4iz

9udxz4iz3#

您可以使用布尔掩码B_qpd.quantile计算分位数估计值:

b_q = gp['B_q'][gp['C'].isna()].values[0]
quantile_value = gp['C'][gp['C'].notna()].quantile(q=b_q)

然后用户可以使用fillna将C列中的NaNs替换为quantile_value

gp['C'] = gp['C'].fillna(quantile_value)

整个def看起来像这样:

def quantile_get(gp):
    b_q = gp['B_q'][gp['C'].isna()].values[0]
    quantile_value = gp['C'][gp['C'].notna()].quantile(q=b_q)
    gp['C'] = gp['C'].fillna(quantile_value)
    return gp

out = df.groupby('A').apply(quantile_get).reset_index(drop=True)

输出:

A   B   C   B_q
0   1   9   9.000000    1.000000
1   1   6   6.600000    0.600000
2   1   7   7.000000    0.800000
3   1   1   1.000000    0.200000
4   1   5   5.000000    0.400000
5   2   4   4.000000    0.666667
6   2   0   0.000000    0.333333
7   2   9   4.000000    1.000000
8   3   5   5.000000    0.750000
9   3   4   5.000000    0.500000
10  3   5   5.000000    0.750000
11  3   6   6.000000    1.000000
12  3   1   1.000000    0.166667
13  3   3   3.000000    0.333333
14  4   1   1.000000    0.142857
15  4   8   4.714286    0.857143
16  4   5   4.714286    0.571429
17  4   6   6.000000    0.714286
18  4   10  4.714286    1.000000
19  4   2   2.000000    0.285714
20  4   3   3.000000    0.428571

相关问题