我想填补 null
我的窗口上的值和每个窗口的最后一个值。但在某些情况下我认为 last
函数不起作用。以下是我的数据:
l = [
('100', '2012-01-02', None),
('110', '2012-01-02', None),
('100', '2012-01-03', 26),
('110', '2012-01-03', 251),
('100', '2012-01-04', 24),
('110', '2012-01-04', 242),
('100', '2012-01-05', 26),
('110', '2012-01-05', 254),
('100', '2012-01-06', 0),
('110', '2012-01-06', 254),
('100', '2012-01-07', 25),
('110', '2012-01-07', 256),
('100', '2012-01-08', 28),
('110', '2012-01-08', 0),
('100', '2012-01-09', 22),
('110', '2012-01-09', 289),
('100', '2012-01-10', 29),
('110', '2012-01-10', 276),
('100', '2012-01-11', 21),
('110', '2012-01-11', 259),
('100', '2012-01-12', 32),
('110', '2012-01-12', 280),
('100', '2012-01-13', 39),
('110', '2012-01-13', 290)
]
df = spark.createDataFrame(l, ['number', 'date', 'count'])
这是我的预期输出:
+------+----------+-----+
|number| date|count|
+------+----------+-----+
| 100|2012-01-02| 39|
| 100|2012-01-03| 26|
| 100|2012-01-04| 24|
| 100|2012-01-05| 26|
| 100|2012-01-06| 0|
| 100|2012-01-07| 25|
| 100|2012-01-08| 28|
| 100|2012-01-09| 22|
| 100|2012-01-10| 29|
| 100|2012-01-11| 21|
| 100|2012-01-12| 32|
| 100|2012-01-13| 39|
| 110|2012-01-02| 290|
| 110|2012-01-03| 251|
| 110|2012-01-04| 242|
| 110|2012-01-05| 254|
| 110|2012-01-06| 254|
| 110|2012-01-07| 256|
| 110|2012-01-08| 0|
| 110|2012-01-09| 289|
+------+----------+-----+
only showing top 20 rows
当我使用上一个函数时 null
值并不像您看到的那样填充:
my_window = Window.partitionBy('number').orderBy(df['date'])
df.withColumn('count', F.when(df['count'].isNull(), F.last(df['count']).over(my_window)).otherwise(df['count'])).show()
+------+----------+-----+
|number| date|count|
+------+----------+-----+
| 110|2012-01-02| null|
| 110|2012-01-03| 251|
| 110|2012-01-04| 242|
| 110|2012-01-05| 254|
| 110|2012-01-06| 254|
| 110|2012-01-07| 256|
| 110|2012-01-08| 0|
| 110|2012-01-09| 289|
| 110|2012-01-10| 276|
| 110|2012-01-11| 259|
| 110|2012-01-12| 280|
| 110|2012-01-13| 290|
| 100|2012-01-02| null|
| 100|2012-01-03| 26|
| 100|2012-01-04| 24|
| 100|2012-01-05| 26|
| 100|2012-01-06| 0|
| 100|2012-01-07| 25|
| 100|2012-01-08| 28|
| 100|2012-01-09| 22|
+------+----------+-----+
only showing top 20 rows
但当我改变了我的命令,由描述和使用 first
它的作用是:
my_window = Window.partitionBy('number').orderBy(df['date'].desc())
df.withColumn('count', F.when(df['count'].isNull(), F.first(df['count']).over(my_window)).otherwise(df['count'])).show()
+------+----------+-----+
|number| date|count|
+------+----------+-----+
| 110|2012-01-13| 290|
| 110|2012-01-12| 280|
| 110|2012-01-11| 259|
| 110|2012-01-10| 276|
| 110|2012-01-09| 289|
| 110|2012-01-08| 0|
| 110|2012-01-07| 256|
| 110|2012-01-06| 254|
| 110|2012-01-05| 254|
| 110|2012-01-04| 242|
| 110|2012-01-03| 251|
| 110|2012-01-02| 290|
| 100|2012-01-13| 39|
| 100|2012-01-12| 32|
| 100|2012-01-11| 21|
| 100|2012-01-10| 29|
| 100|2012-01-09| 22|
| 100|2012-01-08| 28|
| 100|2012-01-07| 25|
| 100|2012-01-06| 0|
+------+----------+-----+
only showing top 20 rows
有人能帮我解决这个问题吗??
1条答案
按热度按时间ghhaqwfi1#
把你的Windows换成这个。
您的窗口将第一行与当前行之间的行加在一起,在本例中,最后一行与当前行相同。