numpy 是否可以使用步长大于1的panda.DataFrame.rolling？

cnh2zyt3 于 2022-11-23 发布在其他

关注(0)|答案(6)|浏览(209)

在R中，您可以使用指定的窗口计算滚动平均值，该窗口每次可以移动指定的量。
然而，也许我只是没有找到它的任何地方，但它似乎不像你可以在Pandas或其他一些Python库？
有人知道解决这个问题的方法吗？我给予个例子来说明我的意思：

这里我们有两周一次的数据，我计算的是两个月的移动平均值，每移动1个月，即2行。
所以在R中我会这样做：two_month__movavg=rollapply(mydata,4,mean,by = 2,na.pad = FALSE)在Python中没有等价物吗？
编辑1：

DATE  A DEMAND   ...     AA DEMAND  A Price
    0  2006/01/01 00:30:00  8013.27833   ...     5657.67500    20.03
    1  2006/01/01 01:00:00  7726.89167   ...     5460.39500    18.66
    2  2006/01/01 01:30:00  7372.85833   ...     5766.02500    20.38
    3  2006/01/01 02:00:00  7071.83333   ...     5503.25167    18.59
    4  2006/01/01 02:30:00  6865.44000   ...     5214.01500    17.53

numpy

来源：https://stackoverflow.com/questions/54301042/is-it-possible-to-use-pandas-dataframe-rolling-with-a-step-greater-than-1

6条答案

按热度按时间

daupos2t1#

因此，我知道这个问题已经提出很久了，因为我遇到了同样的问题，当处理长时间序列时，您确实希望避免对您不感兴趣的值进行不必要的计算。由于Pandas滚动方法不实现step参数，因此我使用numpy编写了一个解决方案。
它基本上是this link中的解和BENY提出的索引的组合。

def apply_rolling_data(data, col, function, window, step=1, labels=None):
    """Perform a rolling window analysis at the column `col` from `data`

    Given a dataframe `data` with time series, call `function` at
    sections of length `window` at the data of column `col`. Append
    the results to `data` at a new columns with name `label`.

    Parameters
    ----------
    data : DataFrame
        Data to be analyzed, the dataframe must stores time series
        columnwise, i.e., each column represent a time series and each
        row a time index
    col : str
        Name of the column from `data` to be analyzed
    function : callable
        Function to be called to calculate the rolling window
        analysis, the function must receive as input an array or
        pandas series. Its output must be either a number or a pandas
        series
    window : int
        length of the window to perform the analysis
    step : int
        step to take between two consecutive windows
    labels : str
        Name of the column for the output, if None it defaults to
        'MEASURE'. It is only used if `function` outputs a number, if
        it outputs a Series then each index of the series is going to
        be used as the names of their respective columns in the output

    Returns
    -------
    data : DataFrame
        Input dataframe with added columns with the result of the
        analysis performed

    """

    x = _strided_app(data[col].to_numpy(), window, step)
    rolled = np.apply_along_axis(function, 1, x)

    if labels is None:
        labels = [f"metric_{i}" for i in range(rolled.shape[1])]

    for col in labels:
        data[col] = np.nan

    data.loc[
        data.index[
            [False]*(window-1)
            + list(np.arange(len(data) - (window-1)) % step == 0)],
        labels] = rolled

    return data

def _strided_app(a, L, S):  # Window len = L, Stride len/stepsize = S
    """returns an array that is strided
    """
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(
        a, shape=(nrows, L), strides=(S*n, n))

赞(0）回复(0）举报 2022-11-23

oxf4rvwz2#

你可以再次使用滚动，只需要一点点工作与你分配索引
这里by = 2

by = 2

df.loc[df.index[np.arange(len(df))%by==1],'New']=df.Price.rolling(window=4).mean()
df
    Price    New
0      63    NaN
1      92    NaN
2      92    NaN
3       5  63.00
4      90    NaN
5       3  47.50
6      81    NaN
7      98  68.00
8     100    NaN
9      58  84.25
10     38    NaN
11     15  52.75
12     75    NaN
13     19  36.75

赞(0）回复(0）举报 2022-11-23

wrrgggsh3#

如果数据大小不是太大，这里有一个简单的方法：

by = 2
win = 4
start = 3 ## it is the index of your 1st valid value.
df.rolling(win).mean()[start::by] ## calculate all, choose what you need.

赞(0）回复(0）举报 2022-11-23

vuktfyat4#

现在，这对于一维数据数组来说有点大材小用了，但是你可以简化它并提取出你所需要的。由于Pandas可以依赖numpy，你可能想检查一下它们的滚动/跨步功能是如何实现的。20个连续数字的结果。一个7天的窗口，跨步/滑动2

z = np.arange(20)
    z   #array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
    s = stride(z, (7,), (2,))

np.mean(s, axis=1)  # array([ 3.,  5.,  7.,  9., 11., 13., 15.])

这里是我使用的代码，没有文档的主要部分。它是从numpy中的strided函数的许多实现中派生出来的，可以在这个网站上找到。有变体和化身，这只是另一个。

def stride(a, win=(3, 3), stepby=(1, 1)):
    """Provide a 2D sliding/moving view of an array.
    There is no edge correction for outputs. Use the `pad_` function first."""
    err = """Array shape, window and/or step size error.
    Use win=(3,) with stepby=(1,) for 1D array
    or win=(3,3) with stepby=(1,1) for 2D array
    or win=(1,3,3) with stepby=(1,1,1) for 3D
    ----    a.ndim != len(win) != len(stepby) ----
    """
    from numpy.lib.stride_tricks import as_strided
    a_ndim = a.ndim
    if isinstance(win, int):
        win = (win,) * a_ndim
    if isinstance(stepby, int):
        stepby = (stepby,) * a_ndim
    assert (a_ndim == len(win)) and (len(win) == len(stepby)), err
    shp = np.array(a.shape)    # array shape (r, c) or (d, r, c)
    win_shp = np.array(win)    # window      (3, 3) or (1, 3, 3)
    ss = np.array(stepby)      # step by     (1, 1) or (1, 1, 1)
    newshape = tuple(((shp - win_shp) // ss) + 1) + tuple(win_shp)
    newstrides = tuple(np.array(a.strides) * ss) + a.strides
    a_s = as_strided(a, shape=newshape, strides=newstrides, subok=True).squeeze()
    return a_s

我没有指出，您可以创建一个输出，并将其作为列追加到panda中。

nans = np.full_like(z, np.nan, dtype='float')  # z is the 20 number sequence
means = np.mean(s, axis=1)   # results from the strided mean
# assign the means to the output array skipping the first and last 3 and striding by 2

nans[3:-3:2] = means        

nans # array([nan, nan, nan,  3., nan,  5., nan,  7., nan,  9., nan, 11., nan, 13., nan, 15., nan, nan, nan, nan])

赞(0）回复(0）举报 2022-11-23

8qgya5xd5#

滚动后使用Pandas.asfreq()

赞(0）回复(0）举报 2022-11-23

wh6knrhe6#

从pands 1.5.0开始，rolling()有一个step参数，应该可以达到这个目的。请参阅：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

赞(0）回复(0）举报 2022-11-23