python—为什么groupby和sum中未提及的列将被删除？

pvcm50d1 于 2021-09-08 发布在 Java

关注(0)|答案(1)|浏览(377)

我有这个数据框：

InvoiceID   PaymentDate          TotalRevenue   Discount     Discount_Revenue
0   72A04E22    2020-07-03 17:25:13   1650000.0      0.0          1650000.0
1   54FCFCB9    2021-03-17 14:26:08   5500000.0      0.0          5500000.0
...

在下面的聚合之后，列 PaymentDate 已删除：

df.groupby(by=['InvoiceID'])[['TotalRevenue','Discount','Discount_Revenue']].sum().reset_index(drop=True, inplace=True)

我怎么还能保留group by或聚合函数中未提及的列？

python pandas pandas-groupby

来源：https://stackoverflow.com/questions/68315319/why-column-not-mention-in-groupby-and-sum-will-be-dropped

1条答案

按热度按时间

wljmcqd81#

当你在做一件事的时候 groupby 具有 sum 这意味着您正在聚合数据：来自多个具有相同属性的行 InvoiceID 只保留一个，其中包含所有行的值之和 df .
假设这是两次具有相同行的 Dataframe ：

InvoiceID          PaymentDate  TotalRevenue  Discount  Discount_Revenue
0  72A04E22  2020-07-03 17:25:13     1650000.0       0.0         1650000.0
1  54FCFCB9  2021-03-17 14:26:08     5500000.0       0.0         5500000.0
2  54FCFCB9  2021-03-17 14:26:08     5500000.0       1.0         5500000.0

然后你可以在求和时看到这个效果 Discount 例如：

>>> df.groupby('InvoiceID')['Discount'].sum()
InvoiceID
54FCFCB9    1.0
72A04E22    0.0
Name: Discount, dtype: float64

具体回答你的问题：专栏 PaymentDate 已删除，因为您未指定如何聚合它
对于没有意义添加的列，例如。 PaymentDate ，您需要定义另一个要使用的聚合函数。您想保留第一次付款日期吗？最后一个？
请注意 InvoiceID 没有在上面的示例中消失，您是在使用 .reset_index(drop=True) 假设我们选择保留最后一个付款日期，然后使用 reset_index 没有 drop=True 为了保留发票ID，我们有：

>>> invoice_groups = df.groupby('InvoiceID')
>>> invoices = invoice_groups.sum().join(invoice_groups['PaymentDate'].max()).reset_index()
>>> invoices
  InvoiceID  TotalRevenue  Discount  Discount_Revenue         PaymentDate
0  54FCFCB9    11000000.0       1.0        11000000.0 2021-03-17 14:26:08
1  72A04E22     1650000.0       0.0         1650000.0 2020-07-03 17:25:13

这就是所有的列，它们都以某种方式（sum或max）从原始 Dataframe 中的行聚合而来。

赞(0）回复(0）举报 2021-09-08

我来回答

python—为什么groupby和sum中未提及的列将被删除？

1条答案

相关问题

热门标签

最新问答