我有一个sparkDataframe,它有两列:start\u time和end\u time。我想知道如何使用某种形式的舍入(在本例中取上限)将时间戳拆分并有效地切割为每个id的分钟间隔(在pyspark timestamptype中为start\u time和end\u time之间的时间),并将其指定为pyspark中名为minutes的新列?
# sample data
val df0 = Seq(
("78aa", "2020-04-14", "2020-04-14 19:00:00", "2020-04-14 19:23:59"),
("78aa", "2020-04-14", "2020-04-14 19:24:00", "2020-04-14 19:26:59"),
("78aa", "2020-04-14", "2020-04-14 19:27:00", "2020-04-14 19:35:59"),
("78aa", "2020-04-14", "2020-04-14 19:36:00", "2020-04-14 19:55:00"),
("25aa", "2020-04-15", "2020-04-15 08:00:00", "2020-04-15 08:02:59"),
("25aa", "2020-04-15", "2020-04-15 11:03:00", "2020-04-15 11:11:59"),
("25aa", "2020-04-15", "2020-04-15 11:12:00", "2020-04-15 11:45:59"),
("25aa", "2020-04-15", "2020-04-15 11:46:00", "2020-04-15 11:47:00")
).toDF("id", "date", "start_time", "end_time")
这里是期望的输出
datetime id start_time end_time minutes
1 2020-04-14 78aa 2020-04-14 19:00:00 2020-04-14 19:23:59 2020-04-14 19:00:00
2 2020-04-14 78aa 2020-04-14 19:00:00 2020-04-14 19:23:59 2020-04-14 19:01:00
3 2020-04-14 78aa 2020-04-14 19:00:00 2020-04-14 19:23:59 2020-04-14 19:02:00
4 2020-04-14 78aa 2020-04-14 19:00:00 2020-04-14 19:23:59 2020-04-14 19:03:00
5 2020-04-14 78aa 2020-04-14 19:00:00 2020-04-14 19:23:59 2020-04-14 19:04:00
6 2020-04-14 78aa 2020-04-14 19:00:00 2020-04-14 19:23:59 2020-04-14 19:05:00
1条答案
按热度按时间7fyelxc51#
检查这是否有用-
用scala编写,但可以移植到python,只需很少的修改。
更多解释-这里