如何在Pandas中按对象在组中应用滚动功能

nbysray5  于 2022-12-21  发布在  其他
关注(0)|答案(4)|浏览(107)

我很难解决dataframe或者groupby中的回看或翻转问题。
下面是我的 Dataframe 的一个简单示例:

fruit    amount    
   20140101   apple     3
   20140102   apple     5
   20140102   orange    10
   20140104   banana    2
   20140104   apple     10
   20140104   orange    4
   20140105   orange    6
   20140105   grape     1
   …
   20141231   apple     3
   20141231   grape     2

我需要计算平均值的'量'的每种水果在过去3天为每天,并创建以下数据框架:

fruit     average_in_last 3 days
   20140104   apple      4
   20140104   orange     10
   ...

例如20140104,前3天为20140101、20140102、20140103(注意数据框中日期不连续,20140103不存在),苹果的平均数量为(3+5)/2 = 4,橙子为10/1=10,其余为0。
示例数据框架非常简单,但实际数据框架要复杂得多,也要大得多。希望有人能在这方面有所启发,提前感谢!

bxjv4tth

bxjv4tth1#

假设一开始我们有一个这样的 Dataframe

>>> df
             fruit  amount
2017-06-01   apple       1
2017-06-03   apple      16
2017-06-04   apple      12
2017-06-05   apple       8
2017-06-06   apple      14
2017-06-08   apple       1
2017-06-09   apple       4
2017-06-02  orange      13
2017-06-03  orange       9
2017-06-04  orange       9
2017-06-05  orange       2
2017-06-06  orange      11
2017-06-07  orange       6
2017-06-08  orange       3
2017-06-09  orange       3
2017-06-10  orange      13
2017-06-02   grape      14
2017-06-03   grape      16
2017-06-07   grape       4
2017-06-09   grape      15
2017-06-10   grape       5

>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]

>>> temp = (df.groupby('fruit')['amount']
    .apply(lambda x: x.reindex(dates)  # fill in the missing dates for each group)
                      .fillna(0)   # fill each missing group with 0
                      .rolling(3)
                      .sum()) # do a rolling sum
    .reset_index()
    .rename(columns={'amount': 'sum_of_3_days', 
                     'level_1': 'date'}))  # rename date index to date col

>>> temp.head()
   fruit        date  amount
0  apple  2017-06-01     NaN
1  apple  2017-06-02     NaN
2  apple  2017-06-03    17.0
3  apple  2017-06-04    28.0
4  apple  2017-06-05    36.0

# converts the date index into date column 
>>> df = df.reset_index().rename(columns={'index': 'date'})  
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
          date   fruit  amount  sum_of_3_days
0   2017-06-01   apple       1                NaN
1   2017-06-03   apple      16               17.0
2   2017-06-04   apple      12               28.0
3   2017-06-05   apple       8               36.0
4   2017-06-06   apple      14               34.0
5   2017-06-08   apple       1               15.0
6   2017-06-09   apple       4                5.0
7   2017-06-02  orange      13                NaN
8   2017-06-03  orange       9               22.0
9   2017-06-04  orange       9               31.0
10  2017-06-05  orange       2               20.0
11  2017-06-06  orange      11               22.0
12  2017-06-07  orange       6               19.0
13  2017-06-08  orange       3               20.0
14  2017-06-09  orange       3               12.0
15  2017-06-10  orange      13               19.0
16  2017-06-02   grape      14                NaN
17  2017-06-03   grape      16               30.0
18  2017-06-07   grape       4                4.0
19  2017-06-09   grape      15               19.0
20  2017-06-10   grape       5               20.0
vawmfj5a

vawmfj5a2#

我也想使用groupby滚动,这就是为什么我登陆这个页面,但我相信我有一个比前面的建议更好的解决方案。
您可以执行以下操作:

pivoted_df = pd.pivot_table(df, index='date', columns='fruits', values='amount')
average_fruits = pivoted_df.rolling(window=3).mean().stack().reset_index()

.stack()不是必需,但它会将数据透视表转换回常规df

w1jd8yoj

w1jd8yoj3#

你可以这样做:

>>> df
>>>
           fruit  amount
20140101   apple       3
20140102   apple       5
20140102  orange      10
20140104  banana       2
20140104   apple      10
20140104  orange       4
20140105  orange       6
20140105   grape       1

>>> g= df.set_index('fruit', append=True).groupby(level=1)
>>> res = g['amount'].apply(pd.rolling_mean, 3, 1).reset_index('fruit')
>>> res

           fruit          0
20140101   apple   3.000000
20140102   apple   4.000000
20140102  orange  10.000000
20140104  banana   2.000000
20140104   apple   6.000000
20140104  orange   7.000000
20140105  orange   6.666667
20140105   grape   1.000000
    • 更新**

正如@cphlewis在评论中提到的,我的代码不会给出你想要的结果,我检查了不同的方法,到目前为止我发现的是这样的(尽管不确定性能):

>>> df.index = [pd.to_datetime(str(x), format='%Y%m%d') for x in df.index]
>>> df.reset_index(inplace=True)
>>> def avg_3_days(x):
        return df[(df['index'] >= x['index'] - pd.DateOffset(3)) & (df['index'] < x['index']) & (df['fruit'] == x['fruit'])].amount.mean()

>>> df['res'] = df.apply(avg_3_days, axis=1)
>>> df

       index   fruit  amount  res
0 2014-01-01   apple       3  NaN
1 2014-01-02   apple       5    3
2 2014-01-02  orange      10  NaN
3 2014-01-04  banana       2  NaN
4 2014-01-04   apple      10    4
5 2014-01-04  orange       4   10
6 2014-01-05  orange       6    7
7 2014-01-05   grape       1  NaN
yx2lnoni

yx2lnoni4#

df1.index=pd.to_datetime(df1.index,format='%Y%m%d')
def function1(ss:pd.Series):
    return ss.loc[ss.index<ss.index.max()].mean()

df1.reset_index().assign(av=df1.reset_index().groupby('fruit')
            .apply(lambda dd:dd.rolling('4d',on='index').amount.apply(function1))
            .droplevel(0)).set_index('index')

          fruit  amount    av
index                           
2014-01-01   apple       3   NaN
2014-01-02   apple       5   3.0
2014-01-02  orange      10   NaN
2014-01-04  banana       2   NaN
2014-01-04   apple      10   4.0
2014-01-04  orange       4  10.0
2014-01-05  orange       6   7.0
2014-01-05   grape       1   NaN
2014-12-31   apple       3   NaN
2014-12-31   grape       2   NaN

相关问题