pandas 获取状态更改为`[4,5,6,7,8]` => COMPLETED状态[closed]的行

vpfxa7rd  于 2023-10-14  发布在  其他
关注(0)|答案(1)|浏览(130)

已关闭,此问题需要details or clarity。它目前不接受回答。
**想改善这个问题吗?**通过editing this post添加详细信息并澄清问题。

14天前关闭
Improve this question
样本数据:
| 创建时间(_A)|更新_at| ID|事件ID|状态ID|
| --|--|--|--|--|
| 2022-11-18 9:46:21| 2022-11-18 9:46:21| 1 | 76 | 1 |
| 2022-11-18 9:46:22| 2022-11-18 9:46:22| 2 | 78 | 4 |
| 2022-11-18 9:46:22| 2022-11-18 9:56:22| 3 | 78 | 5 |
| 2022-11-18 9:46:23| 2022-11-18 9:46:23| 4 | 80 | 4 |
| 2022-11-18 9:46:23| 2022-11-18 9:46:23| 5 | 81 | 4 |
| 2022-11-18 9:46:24| 2022-11-18 9:46:24| 6 | 82 | 6 |
| 2022-11-18 9:46:24| 2022-11-18 9:46:24| 7 | 83 | 7 |
| 2022-11-18 9:47:57| 2022-11-18 9:47:57| 8 | 00 | 2 |
| 2022-11-18 9:47:57| 2022-11-18 9:47:57| 9 | 01 | 4 |
| 2022-11-18 9:47:58| 2022-11-18 9:47:58| 10 | 03 | 5 |
| 2023-03-24 21:22:34| 2023-03-24 21:22:34| 11 | 48 | 6 |
| 2023-03-24 21:22:34| 2023-03-24 21:22:35| 12 | 48 | 8 |
| 2023-03-24 21:22:34| 2023-03-24 21:22:37| 13 | 48 | 7 |
问题:status_id [4,5,6,7,8] =COMPLETED状态。
我要number of events that went to COMPLETED state grouped by date
但是,如果您看到事件78和48,则分别从4->5和6->8->7更改。由于这些都是从一个已完成状态转换到另一个已完成状态,我希望它们被计为一个事件。
预期结果:
| 更新日期|计数|
| --|--|
| 2022-11-18| 7 |
| 2023-03-24| 1 |

  • SQL可以做到这一点吗?
  • 如果不是SQL,如果我以某种方式将上面的表提取为df,我可以用pandas/panderrames/spark来做这件事吗?
8dtrkrch

8dtrkrch1#

这里有一个pandas方法:

# convert to datetime type if not already is
df['updated_at'] = pd.to_datetime(df['updated_at'])

# we want the date of the event, drop the time
df['updated_date'] = df['updated_at'].dt.normalize()

# query the valid events
complete_states = [4,5,6,7,8]
complete_events = (df.drop_duplicates(['event_id','updated_date'], keep='last')
                     .query('status_id in @complete_states') )

那么complete_events是:

created_at          updated_at  id  event_id  status_id updated_date
2    2022-11-18 9:46:22  2022-11-18 09:56:22   3        78          5   2022-11-18
3    2022-11-18 9:46:23  2022-11-18 09:46:23   4        80          4   2022-11-18
4    2022-11-18 9:46:23  2022-11-18 09:46:23   5        81          4   2022-11-18
5    2022-11-18 9:46:24  2022-11-18 09:46:24   6        82          6   2022-11-18
6    2022-11-18 9:46:24  2022-11-18 09:46:24   7        83          7   2022-11-18
8    2022-11-18 9:47:57  2022-11-18 09:47:57   9         1          4   2022-11-18
9    2022-11-18 9:47:58  2022-11-18 09:47:58  10         3          5   2022-11-18
12  2023-03-24 21:22:34  2023-03-24 21:22:37  13        48          7   2023-03-24

并且这样的事件的数量正好是len(complete_events),在这种情况下是8

更新:现在对于每个日期的完整事件数量,您可以执行以下操作:

complete_events['updated_date'].value_counts()

相关问题