dataframe值是使用lambda函数的日期之间的和

nbysray5  于 2021-09-29  发布在  Java
关注(0)|答案(3)|浏览(473)

我有两个 Dataframe df1和df2。df1是一个统一的日期范围,另一列中的默认值为0。所以它看起来像:

  1. df1:
  2. date Val
  3. 0 2020-02-01 0
  4. 1 2020-02-02 0
  5. 2 2020-02-03 0
  6. .
  7. .
  8. .

同时,df2只记录发生的事件

  1. df2:
  2. date Val
  3. 0 2020-02-01-10:00:00 98
  4. 1 2020-02-01-13:54:09 55
  5. 2 2020-02-02-11:33:17 32
  6. .
  7. .
  8. .

我想要这个 df1.val[i] 是两个日期之间df2中所有值的总和 df1.date[i] and df1.date[i+1] .
我编写了以下lambda函数,但它不起作用:

  1. df1['val'] = df1.apply( lambda row: df2[ (df2.date < df1.date[row.index])].sum() )

帮助

busg9geu

busg9geu1#

使用 pd.cut :

  1. >>> df2.groupby(pd.cut(df2['date'], bins=df1['date'], labels=df1['date'][:-1])) \
  2. ['Val'].sum().reset_index()
  3. date Val
  4. 0 2020-02-01 153
  5. 1 2020-02-02 32
l0oc07j2

l0oc07j22#

一些带有解释的代码。我知道有一些更快更容易的方法,但这一种很容易理解。只需将虚拟数据替换为您的数据。

  1. import pandas as pd
  2. import numpy as np
  3. if __name__ == "__main__" :
  4. # create pandas dataframe with custom series input with column names date and Val
  5. df1 = pd.DataFrame({'date':
  6. pd.date_range(start='1/1/2000', end='31/12/2000',periods=8, normalize=True),
  7. 'Val': np.zeros(8)})
  8. df2 = pd.DataFrame({'date':
  9. pd.date_range(start='1/1/2000', end='31/12/2000', periods=8, normalize=True),
  10. 'Val': np.random.randint(0, 100, 8)})
  11. # I want the df1.val[i] to be the sum of all the values in df2 between the dates df1.date[i] and df1.date[i+1] .
  12. # I want the output to be in a new dataframe with column names date and sumVal.
  13. # The output dataframe should only have the rows for which there is a match between the dates in df1[i] and df1[i+1].
  14. # The output dataframe should be sorted by date.
  15. # The output dataframe should have the same number of rows as df1.
  16. # The output dataframe should have the same number of columns as df1.
  17. # The output dataframe should have the same index as df1.
  18. print('-' * 50)
  19. print('df1')
  20. print(df1)
  21. print('-' * 50)
  22. print('df2')
  23. print(df2)
  24. for i in range(0, len(df1) - 1):
  25. # add sum of all the values in df2 bfore the end date of df1 (i.e. i + 1)
  26. df1.loc[i, 'Val'] += df2.loc[df2['date']
  27. <= df1['date'][i+1], 'Val'].sum()
  28. # remove all sums from below date threshold (i.e. values under i)
  29. df1.loc[i, 'Val'] += df2.loc[df2['date']
  30. < df1['date'][i], 'Val'].sum()
  31. print('-' * 50)
  32. print('df1')
  33. print(df1)
展开查看全部
pkbketx9

pkbketx93#

我可以推荐duckdb执行以下任务:

  1. import pandas as pd
  2. import duckdb
  3. df1 = pd.DataFrame()
  4. df2 = pd.DataFrame()
  5. df1['date'] = ['2020-02-01', '2020-02-02', '2020-02-03']
  6. df1['Val'] = [0,0,0]
  7. df1['date'] = pd.to_datetime(df1['date'])
  8. df1.loc[0:len(df1)-2, 'date_2'] = df1.loc[1:len(df1)-1, 'date'].values
  9. df2['date'] = ['2020-02-01-10:00:00', '2020-02-01-13:54:09', '2020-02-02-11:33:17 ']
  10. df2['Val'] = [98,55,32]
  11. df2['date'] = pd.to_datetime(df2['date'])
  12. test_df = duckdb.query(
  13. """select t.date, sum(t.Val) as Val
  14. from (
  15. select t1.date, t2.Val, t2.date as date_validation
  16. from df1 as t1
  17. left join df2 as t2 on t1.date < t2.date and t1.date_2 >= t2.date
  18. ) t
  19. group by t.date
  20. order by t.date"""
  21. ).to_df()
展开查看全部

相关问题