在 Pandas 中创建日期范围的总和

xriantvc 于 2022-11-20 发布在其他

关注(0)|答案(2)|浏览(153)

我有下面的DataFrame，其中有超过300万行：

VALID_FROM   VALID_TO  VALUE
0 2022-01-01 2022-01-02      5
1 2022-01-01 2022-01-03      2
2 2022-01-02 2022-01-04      7
3 2022-01-03 2022-01-06      3

我想创建一个大的date_range，其中包含每个时间戳的值的总和。
对于上面的DataFrame，会得出：

dates  val
0 2022-01-01    7
1 2022-01-02   14
2 2022-01-03   12
3 2022-01-04   10
4 2022-01-05    3
5 2022-01-06    3

但是，由于DataFrame有300万多行，我不想对每一行都进行迭代，而且我不知道如何在不进行迭代的情况下进行迭代。有什么建议吗？
目前我的程式码如下所示：

new_df = pd.DataFrame()
for idx, row in dummy_df.iterrows():
    dr = pd.date_range(row["VALID_FROM"], end = row["VALID_TO"], freq = "D")
    tmp_df = pd.DataFrame({"dates": dr, "val": row["VALUE"]})
    new_df = pd.concat(objs=[new_df, tmp_df], ignore_index=True)
new_df.groupby("dates", as_index=False, group_keys=False).sum()

groupby的结果就是我想要的输出。

pandas

来源：https://stackoverflow.com/questions/74460294/creating-sum-of-date-ranges-in-pandas

2条答案

按热度按时间

qvtsj1bj1#

如果性能很重要，则对新行使用Index.repeat和DataFrame.loc，创建date列，计数器为GroupBy.cumcount，最后聚合sum：

df['VALID_FROM'] = pd.to_datetime(df['VALID_FROM'])
df['VALID_TO'] = pd.to_datetime(df['VALID_TO'])
df1 = df.loc[df.index.repeat(df['VALID_TO'].sub(df['VALID_FROM']).dt.days + 1)]
df1['dates'] = df1['VALID_FROM'] + pd.to_timedelta(df1.groupby(level=0).cumcount(),unit='d')
df1 = df1.groupby('dates', as_index=False)['VALUE'].sum()
print (df1)
       dates  VALUE
0 2022-01-01      7
1 2022-01-02     14
2 2022-01-03     12
3 2022-01-04     10
4 2022-01-05      3
5 2022-01-06      3

赞(0）回复(0）举报 2022-11-20

js4nwp542#

一种选择是构建一个日期列表，从原始 Dataframe 的最小值到最大值，使用带有conditional_join的非相等连接来获得匹配，最后使用groupby和sum：

# pip install pyjanitor
import pandas as pd
import janitor
# build the date pandas object:
dates = df.filter(like='VALID').to_numpy()
dates = pd.date_range(dates.min(), dates.max(), freq='1D')
dates = pd.Series(dates, name='dates')
# compute the inequality join between valid_from and valid_to, 
# followed by the aggregation on a groupby:
(df
.conditional_join(
    dates, 
    ('VALID_FROM', 'dates', '<='),
    ('VALID_TO','dates', '>='), 
    # if you have numba installed, 
    # it can improve performance
    use_numba=False, 
    df_columns='VALUE')
.groupby('dates')
.VALUE
.sum()
) 
dates
2022-01-01     7
2022-01-02    14
2022-01-03    12
2022-01-04    10
2022-01-05     3
2022-01-06     3
Name: VALUE, dtype: int64

展开查看全部

赞(0）回复(0）举报 2022-11-20

我来回答

在 Pandas 中创建日期范围的总和

2条答案

相关问题

热门标签

最新问答

在 Pandas 中 创建 日期 范围 的 总和

2条答案

相关问题

热门标签

最新问答

在 Pandas 中创建日期范围的总和