Pandas自定义Groupby Shift跳过地平线

mutmk8jj  于 2023-11-15  发布在  其他
关注(0)|答案(2)|浏览(85)

我想一个自定义的groupby移位函数,首先跳过前n天获取滞后1,2,3等。重要的是要注意,有失踪的日子,我们要跳过失踪的日子来获取滞后。
下面是一个示例df:

import pandas as pd
import numpy as np

# Sample data
data = {
    'group': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
    'date': ['2023-01-01', '2023-01-03', '2023-01-04', '2023-02-01', '2023-02-02', '2023-02-05', '2023-02-06', 
             '2023-03-02', '2023-03-04'],
    'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}

horizon = 2

df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])

display(df)

字符串
给定horizon=2,或者换句话说,在开始移位操作之前跳过1天,我希望输出如下所示:


的数据
以下是我失败的尝试:

def custom_shift(group, lag):
    values = (group
              .reindex(pd.date_range(start=group.index.min(), end=group.index.max()), fill_value=np.nan)
              .shift(horizon-1)
              .dropna()
              .values
             )
    values = np.insert(values, 0, [np.nan]*(len(group.index) - len(values)))
    return pd.Series(values, index=group.index).shift(lag)

df['value_lag1'] = (df
            .set_index('date')
            .groupby('group')['value']
            .transform(custom_shift, lag=1)
            .reset_index(drop=True)
           )
df['value_lag2'] = (df
            .set_index('date')
            .groupby('group')['value']
            .transform(custom_shift, lag=2)
            .reset_index(drop=True)
           )
display(df)


km0tfn4u

km0tfn4u1#

据我所知,你应该从是否有两天(horizon = 2)的差异,并根据滞后,采取以前的值安装从行有2天的差异开始。我可以建议以下解决方案:按“组”分组并设置所需的值。

import pandas as pd
import numpy as np

dfg = df.groupby('group')

df['day'] = dfg['date'].diff().dt.days
df[['value_lag1', 'value_lag2']] = np.nan

horizon = 2

def f(x, lag, name):
    ind = x[x >= horizon].index
    if len(ind) > 0 and (ind[0] - lag) >= x.index[0]:
        ind = ind[0]
        df.loc[ind:x.index[-1], name] = df.loc[ind - lag, 'value']

dfg['day'].apply(f, 1, 'value_lag1')
dfg['day'].apply(f, 2, 'value_lag2')

字符串
输出量:

group       date  value  day  value_lag1  value_lag2
0     A 2023-01-01      1  NaN         NaN         NaN
1     A 2023-01-03      2  2.0         1.0         NaN
2     A 2023-01-04      3  1.0         1.0         NaN
3     B 2023-02-01      4  NaN         NaN         NaN
4     B 2023-02-02      5  1.0         NaN         NaN
5     B 2023-02-05      6  3.0         5.0         4.0
6     B 2023-02-06      7  1.0         5.0         4.0
7     C 2023-03-02      8  NaN         NaN         NaN
8     C 2023-03-04      9  2.0         8.0         NaN

更新

dfg = df.groupby('group')

df['day'] = dfg['date'].diff().dt.days

df.loc[df['day'] >= horizon, 'gr'] = 1
df['gr'] = df['gr'].fillna(0)
df['gr'] = dfg['gr'].cumsum().replace(0, np.nan)

def f(x, lag, name):
    ind = x.index[0] - lag
    if ind >= 0 and df.loc[ind, 'group'] == x.loc[x.index[0], 'group']:
        df.loc[x.index, name] = df.loc[ind, 'value']

for i in range(1, 3):# up to 2 lags
    df.groupby(['group', 'gr']).apply(f, i, 'value_lag'+str(i))


你的改良版

col = [f'date_diff_{i}' for i in range(1, horizon+1)]
arr = np.arange(1, horizon+1)
dfg = df.groupby('group')

# Create date_diff columns
for i in arr:
    df[col[i-1]] = dfg['date'].diff(i).dt.days

conditions = (df[col] >= horizon).values.T
values1 = dfg['value'].shift(arr).values.T
values2 = dfg['value'].shift(arr + 1).values.T

df['value_lag1'] = np.select(conditions, values1, default=np.nan)
df['value_lag2'] = np.select(conditions, values2, default=np.nan)

df = df.drop(col, axis=1)

yhxst69z

yhxst69z2#

这是我的解决方案。但不幸的是效率不高

import pandas as pd
import numpy as np

# Sample data
data = {
    'group': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
    'date': ['2023-01-01', '2023-01-03', '2023-01-04', '2023-02-01', '2023-02-02', '2023-02-05', '2023-02-06', 
             '2023-03-02', '2023-03-04'],
    'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}

horizon = 2

df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])

display(df)

# Create date_diff columns
for i in range(1, horizon+1):
    df[f'date_diff_{i}'] = df.groupby('group')['date'].diff(i).dt.days

# Construct conditions and values for np.select
conditions = [df[f'date_diff_{i}'] >= horizon for i in range(1, horizon+1)]
values1 = [df.groupby('group')['value'].shift(i) for i in range(1, horizon+1)]
values2 = [df.groupby('group')['value'].shift(i+1) for i in range(1, horizon+1)]

df['value_lag1'] = np.select(conditions, values1, default=np.nan)
df['value_lag2'] = np.select(conditions, values2, default=np.nan)
df = df.drop([f'date_diff_{i}' for i in range(1, horizon+1)], axis=1)
df

字符串

相关问题