我有这个数据:
id, name, timestamp
1, David, 2022/01/01 10:00
2, David, 2022/01/01 10:30
3, Diego, 2022/01/01 10:59
4, David, 2022/01/01 10:59
5, David, 2022/01/01 11:01
6, Diego, 2022/01/01 12:00
7, David, 2022/01/01 12:00
8, David, 2022/01/01 12:05
9, Diego, 2022/01/01 12:30
字符串
基本上大卫和迭戈在玩一个游戏。他们在那些时间戳上不时地按一个按钮。
游戏可以在他们第一次按下按钮后继续一个小时。之后计数将重置,如果他们再次按下按钮,它将计为他们再次开始玩。
所以我想标记为0
(开始),当他们第一次使用按钮在一个小时内,并与1
(播放),如果他们在该小时内。
所以在我的情况下,我会从结果中排除这个:
id, name, timestamp, status
1, David, 2022/01/01 10:00, 0 <--- David starts playing
2, David, 2022/01/01 10:30, 1 <--- David keeps playing the game that he started at the id 1
3, Diego, 2022/01/01 10:59, 0 <--- Diego starts playing
4, David, 2022/01/01 10:59, 1 <--- David keeps playing the game that he started at the id 1
5, David, 2022/01/01 11:01, 0 <--- David starts playing again
6, Diego, 2022/01/01 12:00, 0 <--- Diego starts playing again
7, David, 2022/01/01 12:00, 1 <--- David keeps playing the game that he started at the id 5
8, David, 2022/01/01 12:05, 0 <--- David start playing again
9, Diego, 2022/01/01 12:05, 1 <--- Diego keeps playing the game that he started at the id 6
型
我需要在pyspark中进行这种转换,只是为了标记我认为是start playing
和keep playing
的东西。
你能帮我做一个SQL查询吗?我最近可以把它调整到pyspark。
它不需要在一个查询/步骤中完成。
1条答案
按热度按时间lrl1mhuk1#
这不是一个完整的解决方案,但有任何想法,我已经尝试过这样的
字符串