pandas 如何在30分钟窗口的最后一个值和第一个值之间取差,并保留其他列?

nuypyhwy  于 2023-08-01  发布在  其他
关注(0)|答案(2)|浏览(154)

parquet :Link to the sample dataset
CSV格式:Link to the sample dataset
预期结果应具有30分钟样本窗口最后一个值和第一个值之间的差值结果来自上述样本数据),即:

  1. Item Value AgentStatus OriginalTimestamp AgentTimeStamp
  2. 0 Channel1.Device1.Tag1 847 good 2023-07-28T13:09:00.0098328+09:00 2023-07-28T13:09:00.0000000
  3. 1 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:00.0408696+09:00 2023-07-28T13:09:00.0000000
  4. 2 Channel1.Device1.Tag1 848 good 2023-07-28T13:09:05.0138770+09:00 2023-07-28T13:09:05.0000000
  5. 3 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:05.0454734+09:00 2023-07-28T13:09:05.0000000
  6. 4 Channel1.Device1.Tag1 849 good 2023-07-28T13:09:10.0073605+09:00 2023-07-28T13:09:10.0000000
  7. 5 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:10.0379516+09:00 2023-07-28T13:09:10.0000000
  8. 6 Channel1.Device1.Tag1 850 good 2023-07-28T13:09:15.0074263+09:00 2023-07-28T13:09:15.0000000
  9. 7 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:15.0387691+09:00 2023-07-28T13:09:15.0000000
  10. 8 Channel1.Device1.Tag1 851 good 2023-07-28T13:09:20.0176840+09:00 2023-07-28T13:09:20.0000000
  11. 9 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:20.0329268+09:00 2023-07-28T13:09:20.0000000
  12. 10 Channel1.Device1.Tag1 852 good 2023-07-28T13:09:25.0070191+09:00 2023-07-28T13:09:25.0000000
  13. 11 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:25.0384699+09:00 2023-07-28T13:09:25.0000000
  14. 12 Channel1.Device1.Tag1 853 good 2023-07-28T13:09:30.0109244+09:00 2023-07-28T13:09:30.0000000
  15. 13 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:30.0417249+09:00 2023-07-28T13:09:30.0000000
  16. 14 Channel1.Device1.Tag1 854 good 2023-07-28T13:09:35.0118763+09:00 2023-07-28T13:09:35.0000000
  17. 15 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:35.0429050+09:00 2023-07-28T13:09:35.0000000
  18. 16 Channel1.Device1.Tag1 855 good 2023-07-28T13:09:40.0027594+09:00 2023-07-28T13:09:40.0000000
  19. 17 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:40.0340476+09:00 2023-07-28T13:09:40.0000000
  20. 18 Channel1.Device1.Tag1 856 good 2023-07-28T13:09:45.0029277+09:00 2023-07-28T13:09:45.0000000
  21. 19 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:45.0336946+09:00 2023-07-28T13:09:45.0000000
  22. 20 Channel1.Device1.Tag1 857 good 2023-07-28T13:09:50.0153041+09:00 2023-07-28T13:09:50.0000000
  23. 21 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:50.0459796+09:00 2023-07-28T13:09:50.0000000
  24. 22 Channel1.Device1.Tag1 858 good 2023-07-28T13:09:55.0103680+09:00 2023-07-28T13:09:55.0000000
  25. 23 Channel1.Device1.Tag2 0 good 2023-07-28T13:09:55.0412343+09:00 2023-07-28T13:09:55.0000000
  26. 24 Channel1.Device1.Tag1 859 good 2023-07-28T13:10:00.0095407+09:00 2023-07-28T13:10:00.0000000
  27. 25 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:00.0395870+09:00 2023-07-28T13:10:00.0000000
  28. 26 Channel1.Device1.Tag1 860 good 2023-07-28T13:10:05.0069727+09:00 2023-07-28T13:10:05.0000000
  29. 27 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:05.0374699+09:00 2023-07-28T13:10:05.0000000
  30. 28 Channel1.Device1.Tag1 861 good 2023-07-28T13:10:10.0113827+09:00 2023-07-28T13:10:10.0000000
  31. 29 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:10.0431140+09:00 2023-07-28T13:10:10.0000000
  32. 30 Channel1.Device1.Tag1 862 good 2023-07-28T13:10:15.0024582+09:00 2023-07-28T13:10:15.0000000
  33. 31 Channel1.Device1.Tag2 0 good 2023-07-28T13:10:15.0338704+09:00 2023-07-28T13:10:15.0000000
  34. 32 Channel1.Device1.Tag2 0 good 2023-07-28T13:11:15.0338704+09:01 2023-07-28T13:11:15.0000000
  35. 33 Channel1.Device1.Tag2 0 good 2023-07-28T13:12:15.0338704+09:02 2023-07-28T13:12:15.0000000
  36. 34 Channel1.Device1.Tag2 0 good 2023-07-28T13:15:15.0338704+09:03 2023-07-28T13:15:15.0000000
  37. 35 Channel1.Device1.Tag2 0 good 2023-07-28T13:20:15.0338704+09:04 2023-07-28T13:20:15.0000000
  38. 36 Channel1.Device1.Tag2 0 good 2023-07-28T13:21:15.0338704+09:05 2023-07-28T13:21:15.0000000
  39. 37 Channel1.Device1.Tag2 0 good 2023-07-28T13:22:15.0338704+09:06 2023-07-28T13:22:15.0000000
  40. 38 Channel1.Device1.Tag2 0 good 2023-07-28T13:25:15.0338704+09:07 2023-07-28T13:25:15.0000000
  41. 39 Channel1.Device1.Tag2 878 bad 2023-07-28T13:30:15.0338704+09:08 2023-07-28T13:30:15.0000000
  42. 40 Channel1.Device1.Tag2 0 good 2023-07-28T13:31:15.0338704+09:09 2023-07-28T13:31:15.0000000

字符串
预期结果/计算:
在30分钟内采样后,结果将包含以下两个项目的数据:x1c 0d1x的数据

  • 对于每个项目,示例:对于Channel1.Device1.Tag2Last Value -30分钟窗口的第一个Value,即:

最后一行值(878):

  1. 39 Channel1.Device1.Tag2 878 bad 2023-07-28T13:30:15.0338704+09:08 2023-07-28T13:30:15.0000000


第一行值(0):

预期结果/数据框:



如所见,值878和3对应于Channel1.Device1.Tag2和Channel1.Device1.Tag1,取(最后值-第一值)之间的差:(878 - 0)和(850-847),而保留其他列值。

我目前所取得的成就:

我已经实现了,直到检索结果,但无法合并或保留groupby/apply函数后的其他列,也无法根据项进行过滤(这里= Channel1.Device1.Tag2等等)
代码:

  1. df = df[['Item', 'Value', 'AgentStatus', 'AgentTimeStamp']].reset_index()
  2. df
  3. df['AgentTimeStamp'] = pd.to_datetime(df['AgentTimeStamp'])
  4. df = df.groupby(pd.Grouper(key='AgentTimeStamp', freq='30min')).first().fillna(0).reset_index()
  5. value_from_30_min_window_per_day = df.groupby(df['AgentTimeStamp'].dt.day)['Value'].apply(lambda x: x.values[-1]-x.values[0])
  6. value_from_30_min_window_per_day

lskq00tm

lskq00tm1#

你想要什么还不清楚,也许你可以把groupbyresample结合起来:

  1. r = df.groupby('Item').resample('30T', on='AgentTimeStamp', origin='start', label='right')
  2. out = (r.last()
  3. .assign(Value=lambda d: d[['Value']].sub(r[['Value']].first()))
  4. .reset_index('AgentTimeStamp')[df.columns]
  5. )

字符串
输出量:

  1. index Item Value AgentStatus AgentTimeStamp
  2. Item
  3. Channel1.Device1.Tag1 2 Channel1.Device1.Tag1 1 good 2023-07-28 13:39:00
  4. Channel1.Device1.Tag2 39 Channel1.Device1.Tag2 878 bad 2023-07-28 13:39:00

展开查看全部
shstlldc

shstlldc2#

假设您希望每个项目的观察值有30分钟的窗口差,您可以使用自连接来实现它。这个函数适用于我的案例,所以我希望它能有所帮助:

  1. def get_lagged_df(df: pd.DataFrame,
  2. id_col: str ='id',
  3. time_col: str = 'time',
  4. val_col: str = 'val',
  5. display_na_rows: bool = True) -> pd.DataFrame:
  6. # Self join based on the ID
  7. joined_df = pd.merge(df, df, on = id_col, suffixes = ('_end', '_start'))
  8. # Keep only the rows where end time is above the start time
  9. joined_df = joined_df[joined_df[f'{time_col}_end'] > joined_df[f'{time_col}_start']]
  10. # Lambda function for getting differences between end time and start time in minutes
  11. lambda_func = lambda row: int((pd.to_datetime(row[f'{time_col}_end']) - pd.to_datetime(row[f'{time_col}_start']))
  12. .seconds / 60)
  13. # Get the time differences in minutes between end time and start time
  14. joined_df[f'{time_col}_difference'] = joined_df.apply(lambda_func, axis = 1)
  15. # Kepp only the rows where the time difference is 30 minutes -> 30 min. window
  16. joined_df = joined_df[joined_df[f'{time_col}_difference'] == 30]
  17. # Get the difference between the first observation and last observation of 30 min. window
  18. joined_df[f'{val_col}_difference'] = joined_df[f'{val_col}_end'] - joined_df[f'{val_col}_start']
  19. # Appending N/A rows (i.e., rows which don't have any start times, hence no time and val differences)
  20. if display_na_rows:
  21. df_w_na_rows = df.rename(columns = {f'{val_col}': f'{val_col}_end', f'{time_col}':f'{time_col}_end'})
  22. joined_df = pd.merge(df_w_na_rows,
  23. joined_df.drop(f'{val_col}_end', axis = 1),
  24. on = [id_col, f'{time_col}_end'],
  25. how = 'left',
  26. )
  27. # Adjusting columns' and rows' orders
  28. joined_df = (joined_df[[id_col, f'{time_col}_start', f'{time_col}_end', f'{time_col}_difference',
  29. f'{val_col}_start', f'{val_col}_end', f'{val_col}_difference']]
  30. .sort_values(by=[id_col, f'{time_col}_end'])
  31. .reset_index(drop = True)
  32. )
  33. return joined_df

字符串
假设您拥有以下格式的数据:
| 瓦尔|计时器| time |
| --|--| ------------ |
| 0.638129| 2019 -01- 21 00:00:00| 2023-01-01 00:00:00 |
| 047401| 2019 -01- 21 00:01:00| 2023-01-01 00:01:00 |
| 0.781685| 2023-01-01 00:02:00| 2023-01-01 00:02:00 |
| 0.839184| 2023-01-01 00:03:00| 2023-01-01 00:03:00 |
| 0.526894| 2023-01-01 00:04:00| 2023-01-01 00:04:00 |
| ……|..................||
| 0.370669| 2019 -01- 22 00:00:00| 2023-01-01 00:56:00 |
| 0.565692| 2019 -01- 22 00:00:00| 2023-01-01 00:57:00 |
| 0.121998| 2019 -01- 18 00:00:00| 2023-01-01 00:58:00 |
| 0.872352| 2019 -01- 22 00:00:00| 2023-01-01 00:59:00 |
| 0.624171| 2023-01-02 01:00:00| 2023-01-02 01:00:00 |
其中时间戳00:00和00:30中的ID 1的值如下:
| 瓦尔|计时器| time |
| --|--| ------------ |
| 0.638129| 2019 -01- 21 00:00:00| 2023-01-01 00:00:00 |
| 0.184684| 2023-01-01 00:30:00| 2023-01-01 00:30:00 |
当应用函数而不显示缺失值的行(即,没有开始时间的行,因此也没有差异)时,我们得到每个ID的观察值在时间$t$和时间$t-30$之间的差异:

  1. get_lagged_df(df, display_na_rows = False)


| 时间启动|时间结束|时差|瓦尔_start|瓦尔end|瓦尔差| val_difference |
| --|--|--|--|--|--| ------------ |
| 2019 -01- 21 00:00:00| 2023-01-01 00:30:00|三十|0.638129| 0.184684| -0.453445 | -0.453445 |
| 2019 -01- 21 00:01:00| 2023-01-01 00:31:00|三十|047401| 0.613936| 0.566535| 0.566535 |
| 2023-01-01 00:02:00| 2023-01-01 00:32:00|三十|0.781685| 0.865400| 0.083714| 0.083714 |
| 2023-01-01 00:03:00| 2023-01-01 00:33:00|三十|0.839184| 089480| -0.749704 | -0.749704 |
| 2023-01-01 00:04:00| 2023-01-01 00:34:00|三十|0.526894| 0.541837| 0.014943| 0.014943 |
| ......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?| ... |
| 2023-01-01 00:26:00| 2019 -01- 22 00:00:00|三十|0.812442| 0.370669| -0.441773 | -0.441773 |
| 2023-01-01 00:27:00| 2019 -01- 22 00:00:00|三十|0.352933| 0.565692| 0.212759| 0.212759 |
| 2023-01-01 00:28:00| 2019 -01- 18 00:00:00|三十|186111| 0.121998| -0.064113 | -0.064113 |
| 2023-01-01 00:29:00| 2019 -01- 22 00:00:00|三十|0.671967| 0.872352| 0.200385| 0.200385 |
| 2023-01-01 00:30:00| 2023-01-02 01:00:00|三十|008082| 0.624171| 0.616090| 0.616090 |
如果我们想保留没有开始时间的行,只需将display_na_rows参数更改为True

  1. get_lagged_df(df, display_na_rows = True)


| 时间启动|时间结束|时差|瓦尔_start|瓦尔end|瓦尔差| val_difference |
| --|--|--|--|--|--| ------------ |
| NaT| 2019 -01- 21 00:00:00| NaN| NaN| 0.638129| NaN| NaN |
| NaT| 2019 -01- 21 00:01:00| NaN| NaN| 047401| NaN| NaN |
| NaT| 2023-01-01 00:02:00| NaN| NaN| 0.781685| NaN| NaN |
| NaT| 2023-01-01 00:03:00| NaN| NaN| 0.839184| NaN| NaN |
| NaT| 2023-01-01 00:04:00| NaN| NaN| 0.526894| NaN| NaN |
| ......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?|......这是什么?| ... |
| 2023-01-01 00:26:00| 2019 -01- 22 00:00:00| 30.0| 0.812442| 0.370669| -0.441773 | -0.441773 |
| 2023-01-01 00:27:00| 2019 -01- 22 00:00:00| 30.0| 0.352933| 0.565692| 0.212759| 0.212759 |
| 2023-01-01 00:28:00| 2019 -01- 18 00:00:00| 30.0| 186111| 0.121998| -0.064113 | -0.064113 |
| 2023-01-01 00:29:00| 2019 -01- 22 00:00:00| 30.0| 0.671967| 0.872352| 0.200385| 0.200385 |
| 2023-01-01 00:30:00| 2023-01-02 01:00:00| 30.0| 008082| 0.624171| 0.616090| 0.616090 |

展开查看全部

相关问题