pandas 如何在30分钟窗口的最后一个值和第一个值之间取差,并保留其他列?

nuypyhwy  于 2023-08-01  发布在  其他
关注(0)|答案(2)|浏览(112)

parquet :Link to the sample dataset
CSV格式:Link to the sample dataset
预期结果应具有30分钟样本窗口最后一个值和第一个值之间的差值结果来自上述样本数据),即:

Item    Value   AgentStatus OriginalTimestamp   AgentTimeStamp
0   Channel1.Device1.Tag1   847 good    2023-07-28T13:09:00.0098328+09:00   2023-07-28T13:09:00.0000000
1   Channel1.Device1.Tag2   0   good    2023-07-28T13:09:00.0408696+09:00   2023-07-28T13:09:00.0000000
2   Channel1.Device1.Tag1   848 good    2023-07-28T13:09:05.0138770+09:00   2023-07-28T13:09:05.0000000
3   Channel1.Device1.Tag2   0   good    2023-07-28T13:09:05.0454734+09:00   2023-07-28T13:09:05.0000000
4   Channel1.Device1.Tag1   849 good    2023-07-28T13:09:10.0073605+09:00   2023-07-28T13:09:10.0000000
5   Channel1.Device1.Tag2   0   good    2023-07-28T13:09:10.0379516+09:00   2023-07-28T13:09:10.0000000
6   Channel1.Device1.Tag1   850 good    2023-07-28T13:09:15.0074263+09:00   2023-07-28T13:09:15.0000000
7   Channel1.Device1.Tag2   0   good    2023-07-28T13:09:15.0387691+09:00   2023-07-28T13:09:15.0000000
8   Channel1.Device1.Tag1   851 good    2023-07-28T13:09:20.0176840+09:00   2023-07-28T13:09:20.0000000
9   Channel1.Device1.Tag2   0   good    2023-07-28T13:09:20.0329268+09:00   2023-07-28T13:09:20.0000000
10  Channel1.Device1.Tag1   852 good    2023-07-28T13:09:25.0070191+09:00   2023-07-28T13:09:25.0000000
11  Channel1.Device1.Tag2   0   good    2023-07-28T13:09:25.0384699+09:00   2023-07-28T13:09:25.0000000
12  Channel1.Device1.Tag1   853 good    2023-07-28T13:09:30.0109244+09:00   2023-07-28T13:09:30.0000000
13  Channel1.Device1.Tag2   0   good    2023-07-28T13:09:30.0417249+09:00   2023-07-28T13:09:30.0000000
14  Channel1.Device1.Tag1   854 good    2023-07-28T13:09:35.0118763+09:00   2023-07-28T13:09:35.0000000
15  Channel1.Device1.Tag2   0   good    2023-07-28T13:09:35.0429050+09:00   2023-07-28T13:09:35.0000000
16  Channel1.Device1.Tag1   855 good    2023-07-28T13:09:40.0027594+09:00   2023-07-28T13:09:40.0000000
17  Channel1.Device1.Tag2   0   good    2023-07-28T13:09:40.0340476+09:00   2023-07-28T13:09:40.0000000
18  Channel1.Device1.Tag1   856 good    2023-07-28T13:09:45.0029277+09:00   2023-07-28T13:09:45.0000000
19  Channel1.Device1.Tag2   0   good    2023-07-28T13:09:45.0336946+09:00   2023-07-28T13:09:45.0000000
20  Channel1.Device1.Tag1   857 good    2023-07-28T13:09:50.0153041+09:00   2023-07-28T13:09:50.0000000
21  Channel1.Device1.Tag2   0   good    2023-07-28T13:09:50.0459796+09:00   2023-07-28T13:09:50.0000000
22  Channel1.Device1.Tag1   858 good    2023-07-28T13:09:55.0103680+09:00   2023-07-28T13:09:55.0000000
23  Channel1.Device1.Tag2   0   good    2023-07-28T13:09:55.0412343+09:00   2023-07-28T13:09:55.0000000
24  Channel1.Device1.Tag1   859 good    2023-07-28T13:10:00.0095407+09:00   2023-07-28T13:10:00.0000000
25  Channel1.Device1.Tag2   0   good    2023-07-28T13:10:00.0395870+09:00   2023-07-28T13:10:00.0000000
26  Channel1.Device1.Tag1   860 good    2023-07-28T13:10:05.0069727+09:00   2023-07-28T13:10:05.0000000
27  Channel1.Device1.Tag2   0   good    2023-07-28T13:10:05.0374699+09:00   2023-07-28T13:10:05.0000000
28  Channel1.Device1.Tag1   861 good    2023-07-28T13:10:10.0113827+09:00   2023-07-28T13:10:10.0000000
29  Channel1.Device1.Tag2   0   good    2023-07-28T13:10:10.0431140+09:00   2023-07-28T13:10:10.0000000
30  Channel1.Device1.Tag1   862 good    2023-07-28T13:10:15.0024582+09:00   2023-07-28T13:10:15.0000000
31  Channel1.Device1.Tag2   0   good    2023-07-28T13:10:15.0338704+09:00   2023-07-28T13:10:15.0000000
32  Channel1.Device1.Tag2   0   good    2023-07-28T13:11:15.0338704+09:01   2023-07-28T13:11:15.0000000
33  Channel1.Device1.Tag2   0   good    2023-07-28T13:12:15.0338704+09:02   2023-07-28T13:12:15.0000000
34  Channel1.Device1.Tag2   0   good    2023-07-28T13:15:15.0338704+09:03   2023-07-28T13:15:15.0000000
35  Channel1.Device1.Tag2   0   good    2023-07-28T13:20:15.0338704+09:04   2023-07-28T13:20:15.0000000
36  Channel1.Device1.Tag2   0   good    2023-07-28T13:21:15.0338704+09:05   2023-07-28T13:21:15.0000000
37  Channel1.Device1.Tag2   0   good    2023-07-28T13:22:15.0338704+09:06   2023-07-28T13:22:15.0000000
38  Channel1.Device1.Tag2   0   good    2023-07-28T13:25:15.0338704+09:07   2023-07-28T13:25:15.0000000
39  Channel1.Device1.Tag2   878 bad 2023-07-28T13:30:15.0338704+09:08   2023-07-28T13:30:15.0000000
40  Channel1.Device1.Tag2   0   good    2023-07-28T13:31:15.0338704+09:09   2023-07-28T13:31:15.0000000

字符串
预期结果/计算:
在30分钟内采样后,结果将包含以下两个项目的数据:x1c 0d1x的数据

  • 对于每个项目,示例:对于Channel1.Device1.Tag2Last Value -30分钟窗口的第一个Value,即:

最后一行值(878):

39  Channel1.Device1.Tag2   878 bad 2023-07-28T13:30:15.0338704+09:08   2023-07-28T13:30:15.0000000


第一行值(0):

预期结果/数据框:



如所见,值878和3对应于Channel1.Device1.Tag2和Channel1.Device1.Tag1,取(最后值-第一值)之间的差:(878 - 0)和(850-847),而保留其他列值。

我目前所取得的成就:

我已经实现了,直到检索结果,但无法合并或保留groupby/apply函数后的其他列,也无法根据项进行过滤(这里= Channel1.Device1.Tag2等等)
代码:

df = df[['Item', 'Value', 'AgentStatus', 'AgentTimeStamp']].reset_index()
    df
    df['AgentTimeStamp'] = pd.to_datetime(df['AgentTimeStamp'])
    df = df.groupby(pd.Grouper(key='AgentTimeStamp', freq='30min')).first().fillna(0).reset_index()
    value_from_30_min_window_per_day = df.groupby(df['AgentTimeStamp'].dt.day)['Value'].apply(lambda x: x.values[-1]-x.values[0])
    value_from_30_min_window_per_day

lskq00tm

lskq00tm1#

你想要什么还不清楚,也许你可以把groupbyresample结合起来:

r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')

out = (r.last()
        .assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
        .reset_index('AgentTimeStamp')[df.columns]
      )

字符串
输出量:

index                   Item  Value AgentStatus      AgentTimeStamp
Item                                                                                      
Channel1.Device1.Tag1      2  Channel1.Device1.Tag1      1        good 2023-07-28 13:39:00
Channel1.Device1.Tag2     39  Channel1.Device1.Tag2    878         bad 2023-07-28 13:39:00

shstlldc

shstlldc2#

假设您希望每个项目的观察值有30分钟的窗口差,您可以使用自连接来实现它。这个函数适用于我的案例,所以我希望它能有所帮助:

def get_lagged_df(df: pd.DataFrame,
              id_col: str ='id',
              time_col: str = 'time',
              val_col: str = 'val',
              display_na_rows: bool = True) -> pd.DataFrame:

# Self join based on the ID
joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))

# Keep only the rows where end time is above the start time
joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]

# Lambda function for getting differences between end time and start time in minutes
lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
                              .seconds / 60)

# Get the time differences in minutes between end time and start time
joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)

# Kepp only the rows where the time difference is 30 minutes -> 30 min. window
joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]

# Get the difference between the first observation and last observation of 30 min. window
joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']

# Appending N/A rows (i.e., rows which don't have any start times, hence no time and val differences)
if display_na_rows:
    df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})

    joined_df = pd.merge(df_w_na_rows,
                         joined_df.drop(f'{val_col}_end', axis = 1),
                         on = [id_col, f'{time_col}_end'],
                         how = 'left',
                        )

# Adjusting columns' and rows' orders
joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
                        f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
             .sort_values(by=[id_col, f'{time_col}_end'])
             .reset_index(drop = True)
            )

return joined_df

字符串
假设您拥有以下格式的数据:
| 瓦尔|计时器| time |
| --|--| ------------ |
| 0.638129| 2019 -01- 21 00:00:00| 2023-01-01 00:00:00 |
| 047401| 2019 -01- 21 00:01:00| 2023-01-01 00:01:00 |
| 0.781685| 2023-01-01 00:02:00| 2023-01-01 00:02:00 |
| 0.839184| 2023-01-01 00:03:00| 2023-01-01 00:03:00 |
| 0.526894| 2023-01-01 00:04:00| 2023-01-01 00:04:00 |
| ……|..................||
| 0.370669| 2019 -01- 22 00:00:00| 2023-01-01 00:56:00 |
| 0.565692| 2019 -01- 22 00:00:00| 2023-01-01 00:57:00 |
| 0.121998| 2019 -01- 18 00:00:00| 2023-01-01 00:58:00 |
| 0.872352| 2019 -01- 22 00:00:00| 2023-01-01 00:59:00 |
| 0.624171| 2023-01-02 01:00:00| 2023-01-02 01:00:00 |
其中时间戳00:00和00:30中的ID 1的值如下:
| 瓦尔|计时器| time |
| --|--| ------------ |
| 0.638129| 2019 -01- 21 00:00:00| 2023-01-01 00:00:00 |
| 0.184684| 2023-01-01 00:30:00| 2023-01-01 00:30:00 |
当应用函数而不显示缺失值的行(即,没有开始时间的行,因此也没有差异)时,我们得到每个ID的观察值在时间$t$和时间$t-30$之间的差异:

get_lagged_df(df, display_na_rows = False)


| 时间启动|时间结束|时差|瓦尔_start|瓦尔end|瓦尔差| val_difference |
| --|--|--|--|--|--| ------------ |
| 2019 -01- 21 00:00:00| 2023-01-01 00:30:00|三十|0.638129| 0.184684| -0.453445 | -0.453445 |
| 2019 -01- 21 00:01:00| 2023-01-01 00:31:00|三十|047401| 0.613936| 0.566535| 0.566535 |
| 2023-01-01 00:02:00| 2023-01-01 00:32:00|三十|0.781685| 0.865400| 0.083714| 0.083714 |
| 2023-01-01 00:03:00| 2023-01-01 00:33:00|三十|0.839184| 089480| -0.749704 | -0.749704 |
| 2023-01-01 00:04:00| 2023-01-01 00:34:00|三十|0.526894| 0.541837| 0.014943| 0.014943 |
| ......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?| ... |
| 2023-01-01 00:26:00| 2019 -01- 22 00:00:00|三十|0.812442| 0.370669| -0.441773 | -0.441773 |
| 2023-01-01 00:27:00| 2019 -01- 22 00:00:00|三十|0.352933| 0.565692| 0.212759| 0.212759 |
| 2023-01-01 00:28:00| 2019 -01- 18 00:00:00|三十|186111| 0.121998| -0.064113 | -0.064113 |
| 2023-01-01 00:29:00| 2019 -01- 22 00:00:00|三十|0.671967| 0.872352| 0.200385| 0.200385 |
| 2023-01-01 00:30:00| 2023-01-02 01:00:00|三十|008082| 0.624171| 0.616090| 0.616090 |
如果我们想保留没有开始时间的行,只需将display_na_rows参数更改为True

get_lagged_df(df, display_na_rows = True)


| 时间启动|时间结束|时差|瓦尔_start|瓦尔end|瓦尔差| val_difference |
| --|--|--|--|--|--| ------------ |
| NaT| 2019 -01- 21 00:00:00| NaN| NaN| 0.638129| NaN| NaN |
| NaT| 2019 -01- 21 00:01:00| NaN| NaN| 047401| NaN| NaN |
| NaT| 2023-01-01 00:02:00| NaN| NaN| 0.781685| NaN| NaN |
| NaT| 2023-01-01 00:03:00| NaN| NaN| 0.839184| NaN| NaN |
| NaT| 2023-01-01 00:04:00| NaN| NaN| 0.526894| NaN| NaN |
| ......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?| ... |
| 2023-01-01 00:26:00| 2019 -01- 22 00:00:00| 30.0| 0.812442| 0.370669| -0.441773 | -0.441773 |
| 2023-01-01 00:27:00| 2019 -01- 22 00:00:00| 30.0| 0.352933| 0.565692| 0.212759| 0.212759 |
| 2023-01-01 00:28:00| 2019 -01- 18 00:00:00| 30.0| 186111| 0.121998| -0.064113 | -0.064113 |
| 2023-01-01 00:29:00| 2019 -01- 22 00:00:00| 30.0| 0.671967| 0.872352| 0.200385| 0.200385 |
| 2023-01-01 00:30:00| 2023-01-02 01:00:00| 30.0| 008082| 0.624171| 0.616090| 0.616090 |

相关问题