如何在polars或pandas中对条件值进行累积求和？[举例]

qnyhuwrf 于 2023-04-10 发布在其他

关注(0)|答案(3)|浏览(195)

我有一个问题，我试图解决最好使用polars，但Pandas也很好。假设我们有以下数据集（示例）：

{
  "date" : [2022-01-01, 2022-01-02, 2022-01-03, 2022-01-04, 2022-01-05],
  "customers" : [3, 4, 5, 3, 2],
  "is_reporting_day?" : [True, False, False, False, True]
}

为了使它更清楚一点，这里有一个表格格式
| 日期|客户服务|报告日是什么时候？|
| --------------|--------------|--------------|
| 2022-01-01 2022-01-01|三|真的|
| 2022-01-02 2022-01-02|四|假的|
| 2022-01-03 2022-01-03|五|假的|
| 2022年1月4日|三|假的|
| 2022-01-05 2022-01-05|二|真的|
我想得到的是：如果reporting_day is True保持客户数量不变，如果reporting_day is False我想将所有客户（4，5，3 = 12 + 2 = 14）相加并将其添加到下一个True value reporting day
因此，在应用转换之后，它应该看起来像这样：
| 日期|客户服务|报告日是什么时候？|客户服务|
| --------------|--------------|--------------|--------------|
| 2022-01-01 2022-01-01|三|真的|三|
| 2022-01-05 2022-01-05|二|真的|十四岁|
我试过在polars中使用pl.when语句来使用cumsum()函数，但这是不正确的逻辑，因为它从一开始就总结，即第一天（大约有700天）。

注意：解决方案应该是动态的，即有时reporting_day和non-reporting_day之间差距是1天、2天等。

任何想法或投入是高度赞赏！提前感谢！
‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎

pandas

来源：https://stackoverflow.com/questions/75939533/how-can-i-sum-cumulatively-values-on-condition-in-polars-or-pandas-with-exampl

3条答案

按热度按时间

wmvff8tz1#

使用@mozway的方法，在polars中几乎是一样的：

(df
 .groupby(
    pl.col("is_reporting_day?")
      .shift_and_fill(False, periods=1)
      .cumsum().alias("group"),
    maintain_order=True)
 .agg(
    pl.all().last(),
    sum = pl.sum("customers"))
)

shape: (2, 5)
┌───────┬─────────────────────┬───────────┬───────────────────┬─────┐
│ group ┆ date                ┆ customers ┆ is_reporting_day? ┆ sum │
│ ---   ┆ ---                 ┆ ---       ┆ ---               ┆ --- │
│ u32   ┆ datetime[ns]        ┆ i64       ┆ bool              ┆ i64 │
╞═══════╪═════════════════════╪═══════════╪═══════════════════╪═════╡
│ 0     ┆ 2022-01-01 00:00:00 ┆ 3         ┆ true              ┆ 3   │
│ 1     ┆ 2022-01-05 00:00:00 ┆ 2         ┆ true              ┆ 14  │
└───────┴─────────────────────┴───────────┴───────────────────┴─────┘

df.groupby(
   pl.when(pl.col("is_reporting_day?"))
     .then(pl.col("date"))
     .backward_fill()
     .alias("group"),
   maintain_order=True
).agg(
   pl.all().last(),
   sum = pl.sum("customers")
)

shape: (2, 5)
┌─────────────────────┬─────────────────────┬───────────┬───────────────────┬─────┐
│ group               ┆ date                ┆ customers ┆ is_reporting_day? ┆ sum │
│ ---                 ┆ ---                 ┆ ---       ┆ ---               ┆ --- │
│ datetime[ns]        ┆ datetime[ns]        ┆ i64       ┆ bool              ┆ i64 │
╞═════════════════════╪═════════════════════╪═══════════╪═══════════════════╪═════╡
│ 2022-01-01 00:00:00 ┆ 2022-01-01 00:00:00 ┆ 3         ┆ true              ┆ 3   │
│ 2022-01-05 00:00:00 ┆ 2022-01-05 00:00:00 ┆ 2         ┆ true              ┆ 14  │
└─────────────────────┴─────────────────────┴───────────┴───────────────────┴─────┘

如果希望保留原始行，可以使用.over()。

df.with_columns(
   pl.cumsum("customers").over(
      pl.when(pl.col("is_reporting_day?"))
        .then(pl.col("date"))
        .backward_fill())
   .alias("cumsum")
)

shape: (5, 4)
┌─────────────────────┬───────────┬───────────────────┬────────┐
│ date                ┆ customers ┆ is_reporting_day? ┆ cumsum │
│ ---                 ┆ ---       ┆ ---               ┆ ---    │
│ datetime[ns]        ┆ i64       ┆ bool              ┆ i64    │
╞═════════════════════╪═══════════╪═══════════════════╪════════╡
│ 2022-01-01 00:00:00 ┆ 3         ┆ true              ┆ 3      │
│ 2022-01-02 00:00:00 ┆ 4         ┆ false             ┆ 4      │
│ 2022-01-03 00:00:00 ┆ 5         ┆ false             ┆ 9      │
│ 2022-01-04 00:00:00 ┆ 3         ┆ false             ┆ 12     │
│ 2022-01-05 00:00:00 ┆ 2         ┆ true              ┆ 14     │
└─────────────────────┴───────────┴───────────────────┴────────┘

赞(0）回复(0）举报 2023-04-10

z9smfwbn2#

假设日期已经排序，使用groupby.agg：

out = (df.groupby(df['is_reporting_day?'].shift(fill_value=False).cumsum(), as_index=False)
         .agg({'date': 'max', 'customers': 'sum', 'is_reporting_day?': 'max'})
      )

输出：

date  customers  is_reporting_day?
0  2022-01-01          3               True
1  2022-01-05         14               True

如果您需要“客户”的初始值和总和：

out = (df.groupby(df['is_reporting_day?'].shift(fill_value=False).cumsum(), as_index=False)
         .agg(**{'date': ('date', 'max'),
                 'customers': ('customers', 'last'),
                 'is_reporting_day?': ('is_reporting_day?', 'max'),
                 'customers_sum': ('customers', 'sum'),
                })
      )

输出：

date  customers  is_reporting_day?  customers_sum
0  2022-01-01          3               True              3
1  2022-01-05          2               True             14

备选方案：

out = (
 df.assign(date=df['date'].where(df['is_reporting_day?']).bfill())
   .groupby('date', as_index=False)
         .agg(**{'date': ('date', 'max'),
                 'customers': ('customers', 'last'),
                 'is_reporting_day?': ('is_reporting_day?', 'max'),
                 'customers_sum': ('customers', 'sum'),
                })
)

赞(0）回复(0）举报 2023-04-10

flvlnr443#

col1=(df1.is_reporting_day.eq(False)&df1.is_reporting_day.shift().eq(True)).cumsum()

df1.groupby(col1,group_keys=False).apply(lambda dd:dd.tail(1)
                                         .assign(customers2=dd['customers'].sum()))

输出：

date  customers  is_reporting_day  customers2
0  2022-01-01          3              True           3
4  2022-01-05          2              True          14

赞(0）回复(0）举报 2023-04-10

我来回答

如何在polars或pandas中对条件值进行累积求和？[举例]

3条答案

相关问题

热门标签

最新问答