Pandas条件回填Pandas

mbyulnm0  于 2022-12-09  发布在  其他
关注(0)|答案(3)|浏览(140)

我的 Dataframe 如下所示:

Name    Date    Condition   Status
A   17-12-2021      
A   18-12-2022      
A   19-12-2023      
A   20-12-2023  0   Attack
A   21-12-2023      
A   22-12-2024      
B   17-12-2021      
B   18-12-2022      
B   19-12-2023      
B   20-12-2023  2   Sprain
B   21-12-2023      
B   22-12-2024      
C   18-12-2022      
C   19-12-2023      
C   20-12-2023  1   Nausea
C   21-12-2023      
C   22-12-2024

建模假设是状态在实际显示效果之前开始,因此根据用于Map的条件回填什么:dict_map = {0:2,1:1,2:2,3:2,4:2}...即如果条件为0,则需要按2回填,而如果条件为1,则需要按1回填。
理想的输出如下。但是,当我做bfill时,它会一直填充。无论如何,要条件bfill()?:

Name    Date    Condition   Status
A   17-12-2021      
A   18-12-2022      Attack
A   19-12-2023      Attack
A   20-12-2023  0   Attack
A   21-12-2023      
A   22-12-2024      
B   17-12-2021      
B   18-12-2022      Sprain
B   19-12-2023      Sprain
B   20-12-2023  2   Sprain
B   21-12-2023      
B   22-12-2024      
C   18-12-2022      
C   19-12-2023      Nausea
C   20-12-2023  1   Nausea
C   21-12-2023      
C   22-12-2024
kknvjkwl

kknvjkwl1#

您可以使用反转的groupby.cumcount计算掩码,并在将字典中的值Map到map之后与bfill“条件”进行比较。将此掩码与where一起使用,可以仅将回填的状态保持在所需的限制范围内:

dict_map = {0:2, 1:1, 2:2, 3:2, 4:2}

mask = (df[::-1].groupby(df.loc[::-1, 'Condition'].notna().cumsum())
        .cumcount()
        .le(df['Condition'].map(dict_map).bfill())
       )

df['Status'] = df['Status'].bfill().where(mask)

输出量:

Name        Date  Condition  Status
0     A  17-12-2021        NaN     NaN
1     A  18-12-2022        NaN  Attack
2     A  19-12-2023        NaN  Attack
3     A  20-12-2023        0.0  Attack
4     A  21-12-2023        NaN     NaN
5     A  22-12-2024        NaN     NaN
6     B  17-12-2021        NaN     NaN
7     B  18-12-2022        NaN  Sprain
8     B  19-12-2023        NaN  Sprain
9     B  20-12-2023        2.0  Sprain
10    B  21-12-2023        NaN     NaN
11    B  22-12-2024        NaN     NaN
12    C  18-12-2022        NaN     NaN
13    C  19-12-2023        NaN  Nausea
14    C  20-12-2023        1.0  Nausea
15    C  21-12-2023        NaN     NaN
16    C  22-12-2024        NaN     NaN

中间体:

Name        Date  Condition  Status  cumsum/group  cumcount  condition_bfill   mask status_bfill
0     A  17-12-2021        NaN     NaN             3         3              2.0  False          NaN
1     A  18-12-2022        NaN    None             3         2              2.0   True       Attack
2     A  19-12-2023        NaN    None             3         1              2.0   True       Attack
3     A  20-12-2023        0.0  Attack             3         0              2.0   True       Attack
4     A  21-12-2023        NaN    None             2         5              2.0  False          NaN
5     A  22-12-2024        NaN    None             2         4              2.0  False          NaN
6     B  17-12-2021        NaN    None             2         3              2.0  False          NaN
7     B  18-12-2022        NaN    None             2         2              2.0   True       Sprain
8     B  19-12-2023        NaN    None             2         1              2.0   True       Sprain
9     B  20-12-2023        2.0  Sprain             2         0              2.0   True       Sprain
10    B  21-12-2023        NaN    None             1         4              1.0  False          NaN
11    B  22-12-2024        NaN    None             1         3              1.0  False          NaN
12    C  18-12-2022        NaN    None             1         2              1.0  False          NaN
13    C  19-12-2023        NaN    None             1         1              1.0   True       Nausea
14    C  20-12-2023        1.0  Nausea             1         0              1.0   True       Nausea
15    C  21-12-2023        NaN    None             0         1              NaN  False          NaN
16    C  22-12-2024        NaN    None             0         0              NaN  False          NaN
h5qlskok

h5qlskok2#

Series.map创建helper列new,其中DataFrame.iloc用于交换顺序,然后通过将非缺失值与累积和进行比较并按GroupBy.cumcount将计数器与向前填充缺失值进行比较(因为交换了顺序),按helper组创建掩码。最后向后填充缺失值并在Series.where中设置NaN s:

df1 = df.assign(new=df['Condition'].map(dict_map)).iloc[::-1]

m = df1.groupby(df1['new'].notna().cumsum()).cumcount().le(df1['new'].ffill()).iloc[::-1]
    
df['Status'] = df['Status'].bfill().where(m)
print (df)
   Name        Date  Condition  Status
0     A  17-12-2021        NaN     NaN
1     A  18-12-2022        NaN  Attack
2     A  19-12-2023        NaN  Attack
3     A  20-12-2023        0.0  Attack
4     A  21-12-2023        NaN     NaN
5     A  22-12-2024        NaN     NaN
6     B  17-12-2021        NaN     NaN
7     B  18-12-2022        NaN  Sprain
8     B  19-12-2023        NaN  Sprain
9     B  20-12-2023        2.0  Sprain
10    B  21-12-2023        NaN     NaN
11    B  22-12-2024        NaN     NaN
12    C  18-12-2022        NaN     NaN
13    C  19-12-2023        NaN  Nausea
14    C  20-12-2023        1.0  Nausea
15    C  21-12-2023        NaN     NaN
16    C  22-12-2024        NaN     NaN
mum43rcc

mum43rcc3#

另一种可能的解决方案:

dict_map = {0:2, 1:1, 2:2, 3:2, 4:2}

df.Status = df.groupby('Name', sort=False).apply(lambda x: x.Status.bfill(
    limit=dict_map[x['Condition'].max()])).reset_index(drop=True)

输出量:

Name        Date  Condition  Status
0     A  17-12-2021       <NA>     NaN
1     A  18-12-2022       <NA>  Attack
2     A  19-12-2023       <NA>  Attack
3     A  20-12-2023          0  Attack
4     A  21-12-2023       <NA>     NaN
5     A  22-12-2024       <NA>     NaN
6     B  17-12-2021       <NA>     NaN
7     B  18-12-2022       <NA>  Sprain
8     B  19-12-2023       <NA>  Sprain
9     B  20-12-2023          2  Sprain
10    B  21-12-2023       <NA>     NaN
11    B  22-12-2024       <NA>     NaN
12    C  18-12-2022       <NA>     NaN
13    C  19-12-2023       <NA>  Nausea
14    C  20-12-2023          1  Nausea
15    C  21-12-2023       <NA>     NaN
16    C  22-12-2024       <NA>     NaN

相关问题