pyspark:只在每个日期的特定时间和分钟内向前填充

xzlaal3s  于 2021-05-24  发布在  Spark
关注(0)|答案(1)|浏览(427)

如何仅在日期时间戳为00:00:00时进行正向填充?
date 有一个 00:00:00 由于传感器工作不正常,将出现空值。其他时候会有空的,他们需要保留。

+---+-------------------+-----+
| id|               date|value|
+---+-------------------+-----+
| A1|2016-09-30 23:00:00|    3|
| A1|2016-10-01 00:00:00| Null|
| A1|2016-10-01 01:00:00|    1|
| A1|2016-10-01 02:30:30|    3|
| A9|2016-10-05 23:00:00|    3|
| A9|2016-10-06 00:00:00| Null|
| A9|2016-10-06 02:20:00|    4|
| A9|2016-10-06 03:20:00| Null|
+---+-------------------+-----+

所需Dataframe:

+---+-------------------+-----+
| id|               date|value|
+---+-------------------+-----+
| A1|2016-09-30 23:00:00|    3|
| A1|2016-10-01 00:00:00|    3|
| A1|2016-10-01 01:00:00|    1|
| A1|2016-10-01 02:30:30|    3|
| A9|2016-10-05 23:00:00|    3|
| A9|2016-10-06 00:00:00|    3|
| A9|2016-10-06 02:20:00|    4|
| A9|2016-10-06 03:20:00| Null|
+---+-------------------+-----+
pes8fvy9

pes8fvy91#

你可以用 lag 功能:

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window

w=Window().partitionBy("id").orderBy("date")

df.withColumn("value", F.when(col("date").like("%00:00:00"), \
        F.lag("value").over(w)).otherwise(col("value"))).show()

+---+-------------------+-----+
| id|               date|value|
+---+-------------------+-----+
| A1|2016-09-30 23:00:00|    3|
| A1|2016-10-01 00:00:00|    3|
| A1|2016-10-01 01:00:00|    1|
| A1|2016-10-01 02:30:30|    3|
| A9|2016-10-05 23:00:00|    3|
| A9|2016-10-06 00:00:00|    3|
| A9|2016-10-06 02:20:00|    4|
| A9|2016-10-06 03:20:00| null|
+---+-------------------+-----+

相关问题