Pandas Groupby仅限相同ID且列值为false时

lx0bsm1f 于 2023-04-04 发布在其他

关注(0)|答案(1)|浏览(116)

我有下面的问题，我正在试图解决目前.一个dataframe有很多的数字列组成的“serial_number”列，作为ID.一列“update”，要么是True或False.还有一些数字列，我需要和/除以以下：将行为“update”=False的所有数值列求和/除以“update”为True的下一行（包括“update”= True的行）。
为了给予你一个额外的上下文，这些条目用于训练机器学习模型，但是对于“update”=false的行，我没有目标变量。因此，我需要对下一个“update”=true行的值求和或取平均值。
先谢谢你了！
例如，这将是输入表：
| 序列号|模型|数值平均|数字1和|数字2和|更新|
| --------------|--------------|--------------|--------------|--------------|--------------|
| a|2023年1月1日|五|10个|二十|假的|
| a|2023年1月2日|10个|十五岁|10个|假的|
| a|2023年1月3日|十五岁|十五岁|10个|真的|
| B|2023年1月1日|10个|十五岁|10个|假的|
| B|2023年1月2日|十五岁|十五岁|10个|真的|
| B|2023年1月3日|十五岁|十五岁|10个|假的|
| B|2023年1月4日|十五岁|十五岁|10个|真的|
| B|2023年1月5日|十五岁|十五岁|10个|假的|
| c|2023年1月4日|十五岁|十五岁|10个|真的|
结果输出应该如下所示：
| 序列号|日期|数值平均|数字1和|数字2和|更新|
| --------------|--------------|--------------|--------------|--------------|--------------|
| a|2023年1月3日|10个|四十|四十|真的|
| B|2023年1月2日|十二点五|三十|二十|真的|
| B|2023年1月4日|十五岁|三十|二十|真的|
| c|2023年1月4日|十五岁|十五岁|10个|真的|
输出表中的行数与输入表中“update”=True的行数相同。因此，基本上，我尝试将第一个“update”=false和第一个“update”=true行之间相同serial_number中的所有行相加或取平均值。

pandas

来源：https://stackoverflow.com/questions/75868196/pandas-groupby-only-same-id-and-when-column-value-is-false

1条答案

按热度按时间

h6my8fg21#

编号

# filter the columns that you would like to aggregate
c1 = df.filter(like='_sum')
c2 = df.filter(like='_mean')
# create a agg dictionary which maps column names
# to the corresponding aggregation functions
agg_dict = {
    'model': 'last', 
    'update': 'any',
    **dict.fromkeys(c1, 'sum'), 
    **dict.fromkeys(c2, 'mean'),
}
# grouper to identify different blocks of rows followed by True
b = df[::-1]['update'].cumsum()
# group the dataframe by serial_number and blocks and aggregate
result = df.groupby(['serial_number', b]).agg(agg_dict)
# Query the results to remove the rows that do 
# not have any subsequent rows with 'update=true',
# for example, (b, 2023-01-05).
result = result.droplevel(1).query('update').reset_index()

结果

serial_number       model  update  numerical_1_sum  numerical_2_sum  numerical_mean
0             a  2023-01-03    True               40               40            10.0
1             b  2023-01-04    True               30               20            15.0
2             b  2023-01-02    True               30               20            12.5
3             c  2023-01-04    True               15               10            15.0

展开查看全部

赞(0）回复(0）举报 2023-04-04

我来回答

Pandas Groupby仅限相同ID且列值为false时

1条答案

结果

相关问题

热门标签

最新问答