在pyspark中使用last函数时出现问题

3zwtqj6y  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(458)

我想填补 null 我的窗口上的值和每个窗口的最后一个值。但在某些情况下我认为 last 函数不起作用。以下是我的数据:

l = [
     ('100', '2012-01-02', None),
     ('110', '2012-01-02', None),
     ('100', '2012-01-03', 26),
     ('110', '2012-01-03', 251),
     ('100', '2012-01-04', 24),
     ('110', '2012-01-04', 242),
     ('100', '2012-01-05', 26),
     ('110', '2012-01-05', 254),
     ('100', '2012-01-06', 0),
     ('110', '2012-01-06', 254),
     ('100', '2012-01-07', 25),
     ('110', '2012-01-07', 256),
     ('100', '2012-01-08', 28),
     ('110', '2012-01-08', 0),
     ('100', '2012-01-09', 22),
     ('110', '2012-01-09', 289),
     ('100', '2012-01-10', 29),
     ('110', '2012-01-10', 276),
     ('100', '2012-01-11', 21),
     ('110', '2012-01-11', 259),
     ('100', '2012-01-12', 32),
     ('110', '2012-01-12', 280),
     ('100', '2012-01-13', 39),
     ('110', '2012-01-13', 290)
    ]
df = spark.createDataFrame(l, ['number', 'date', 'count'])

这是我的预期输出:

+------+----------+-----+
|number|      date|count|
+------+----------+-----+
|   100|2012-01-02|   39|
|   100|2012-01-03|   26|
|   100|2012-01-04|   24|
|   100|2012-01-05|   26|
|   100|2012-01-06|    0|
|   100|2012-01-07|   25|
|   100|2012-01-08|   28|
|   100|2012-01-09|   22|
|   100|2012-01-10|   29|
|   100|2012-01-11|   21|
|   100|2012-01-12|   32|
|   100|2012-01-13|   39|
|   110|2012-01-02|  290|
|   110|2012-01-03|  251|
|   110|2012-01-04|  242|
|   110|2012-01-05|  254|
|   110|2012-01-06|  254|
|   110|2012-01-07|  256|
|   110|2012-01-08|    0|
|   110|2012-01-09|  289|
+------+----------+-----+
only showing top 20 rows

当我使用上一个函数时 null 值并不像您看到的那样填充:

my_window = Window.partitionBy('number').orderBy(df['date'])
df.withColumn('count', F.when(df['count'].isNull(), F.last(df['count']).over(my_window)).otherwise(df['count'])).show()

+------+----------+-----+
|number|      date|count|
+------+----------+-----+
|   110|2012-01-02| null|
|   110|2012-01-03|  251|
|   110|2012-01-04|  242|
|   110|2012-01-05|  254|
|   110|2012-01-06|  254|
|   110|2012-01-07|  256|
|   110|2012-01-08|    0|
|   110|2012-01-09|  289|
|   110|2012-01-10|  276|
|   110|2012-01-11|  259|
|   110|2012-01-12|  280|
|   110|2012-01-13|  290|
|   100|2012-01-02| null|
|   100|2012-01-03|   26|
|   100|2012-01-04|   24|
|   100|2012-01-05|   26|
|   100|2012-01-06|    0|
|   100|2012-01-07|   25|
|   100|2012-01-08|   28|
|   100|2012-01-09|   22|
+------+----------+-----+
only showing top 20 rows

但当我改变了我的命令,由描述和使用 first 它的作用是:

my_window = Window.partitionBy('number').orderBy(df['date'].desc())
df.withColumn('count', F.when(df['count'].isNull(), F.first(df['count']).over(my_window)).otherwise(df['count'])).show()

+------+----------+-----+
|number|      date|count|
+------+----------+-----+
|   110|2012-01-13|  290|
|   110|2012-01-12|  280|
|   110|2012-01-11|  259|
|   110|2012-01-10|  276|
|   110|2012-01-09|  289|
|   110|2012-01-08|    0|
|   110|2012-01-07|  256|
|   110|2012-01-06|  254|
|   110|2012-01-05|  254|
|   110|2012-01-04|  242|
|   110|2012-01-03|  251|
|   110|2012-01-02|  290|
|   100|2012-01-13|   39|
|   100|2012-01-12|   32|
|   100|2012-01-11|   21|
|   100|2012-01-10|   29|
|   100|2012-01-09|   22|
|   100|2012-01-08|   28|
|   100|2012-01-07|   25|
|   100|2012-01-06|    0|
+------+----------+-----+
only showing top 20 rows

有人能帮我解决这个问题吗??

ghhaqwfi

ghhaqwfi1#

把你的Windows换成这个。

my_window = Window.partitionBy('number').orderBy(df['date']).rowsBetween(Window.currentRow, Window.unboundedFollowing)

您的窗口将第一行与当前行之间的行加在一起,在本例中,最后一行与当前行相同。

相关问题