Python Pandas groupby有限累积和

bnlyeluc  于 2023-02-02  发布在  Python
关注(0)|答案(3)|浏览(122)

这是我的数据框

import pandas as pd
import numpy as np

data = {'c1':[-1,-1,1,1,np.nan,1,1,1,1,1,np.nan,-1],\
        'c2':[1,1,1,-1,1,1,-1,-1,1,-1,1,np.nan]}

index = pd.date_range('2000-01-01','2000-03-20', freq='W')

df = pd.DataFrame(index=index, data=data)

>>> df
             c1   c2
2000-01-02 -1.0  1.0
2000-01-09 -1.0  1.0
2000-01-16  1.0  1.0
2000-01-23  1.0 -1.0
2000-01-30  NaN  1.0
2000-02-06  1.0  1.0
2000-02-13  1.0 -1.0
2000-02-20  1.0 -1.0
2000-02-27  1.0  1.0
2000-03-05  1.0 -1.0
2000-03-12  NaN  1.0
2000-03-19 -1.0  NaN

这是按月累计的总和

df2 = df.groupby(df.index.to_period('m')).cumsum()

>>> df2
             c1   c2
2000-01-02 -1.0  1.0
2000-01-09 -2.0  2.0
2000-01-16 -1.0  3.0
2000-01-23  0.0  2.0
2000-01-30  NaN  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0 -1.0
2000-02-27  4.0  0.0
2000-03-05  1.0 -1.0
2000-03-12  NaN  0.0
2000-03-19  0.0  NaN

我需要做的是忽略大于3或小于0的增量,类似于下面的函数

def cumsum2(arr, low=-float('Inf'), high=float('Inf')):
    arr2 = np.copy(arr)
    sm = 0
    for index, elem in np.ndenumerate(arr):
        if not np.isnan(elem):
            sm += elem
            if sm > high:
                sm = high
            if sm < low:
                sm = low
        arr2[index] = sm
    return arr2

所期望的结果是

c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0

我尝试使用apply和lambda,但不起作用,而且对于大型 Dataframe 来说速度很慢。

df.groupby(df.index.to_period('m')).apply(lambda x: cumsum2(x, 0, 3))

怎么了?有更快的路吗?

cbjzeqam

cbjzeqam1#

可以从itertools中尝试accumulate,并使用一个自定义函数来裁剪0到3之间的值:

from itertools import accumulate

lb = 0  # lower bound
ub = 3  # upper bound

def cumsum2(dfm):
    def clip(bal, val):
        return np.clip(bal + val, lb, ub)
    return list(accumulate(dfm.to_numpy(), clip, initial=0))[1:]

out = df.fillna(0).groupby(df.index.to_period('m')).transform(cumsum2)

输出:

>>> out
             c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0
5jdjgkvh

5jdjgkvh2#

在这种复杂的情况下,我们可以求助于pandas.Series.rolling,窗口大小为2,将每个窗口连接到一个自定义函数,以将每个临时累积保持在某个阈值内:

def cumsum_tsh(x, low=-float('Inf'), high=float('Inf')):
    def f(w):
        w[-1] = min(high, max(low, w[0] if w.size == 1 else w[0] + w[1]))
        return w[-1]
    return x.apply(lambda s: s.rolling(2, min_periods=1).apply(f))

res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
    .apply(lambda x: cumsum_tsh(x, 0, 3))
c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0
8wigbo56

8wigbo563#

我尝试过各种解决方案,由于某种原因,最快的是操作groupby创建的单列框架。

def cumsum2(frame, low=-float('Inf'), high=float('Inf')):
    for col in frame.columns:
        sm = 0
        xs = []
        for e in frame[col]:
            sm += e
            if sm > high:
                sm = high
            if sm < low:
                sm = low
            xs.append(sm)
        frame[col] = xs
    return frame

res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
                                            .apply(cumsum2,0,3)

相关问题