我想对dataframe执行一个“filldown”类型的操作,以删除空值,并确保最后一行是一种摘要行,包含基于 timestamp
,按 itemId
. 当我使用azuresynapse笔记本时,它的语言可以是scala、pyspark、sparksql甚至c。然而,这里的问题是,真正的解决方案最多有数百万行和数百列,所以我需要一个可以利用spark的动态解决方案。我们可以提供一个大的集群,以确保我们充分利用它?
样本数据:
// Assign sample data to dataframe
val df = Seq(
( 1, "10/01/2021", 1, "abc", null ),
( 2, "11/01/2021", 1, null, "bbb" ),
( 3, "12/01/2021", 1, "ccc", null ),
( 4, "13/01/2021", 1, null, "ddd" ),
( 5, "10/01/2021", 2, "eee", "fff" ),
( 6, "11/01/2021", 2, null, null ),
( 7, "12/01/2021", 2, null, null )
).
toDF("eventId", "timestamp", "itemId", "attrib1", "attrib2")
df.show
将第4行和第7行作为摘要行的预期结果:
+-------+----------+------+-------+-------+
|eventId| timestamp|itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
| 1|10/01/2021| 1| abc| null|
| 2|11/01/2021| 1| abc| bbb|
| 3|12/01/2021| 1| ccc| bbb|
| 4|13/01/2021| 1| ccc| ddd|
| 5|10/01/2021| 2| eee| fff|
| 6|11/01/2021| 2| eee| fff|
| 7|12/01/2021| 2| eee| fff|
+-------+----------+------+-------+-------+
我已经检查了这个选项,但是在适应我的用例时遇到了困难。
spark/scala:用最后一次观察向前填充
我有一种可以工作的sparksql解决方案,但是对于大量的列来说,它会非常冗长,希望有更容易维护的东西:
%%sql
WITH cte (
SELECT
eventId,
itemId,
ROW_NUMBER() OVER( PARTITION BY itemId ORDER BY timestamp ) AS rn,
attrib1,
attrib2
FROM df
)
SELECT
eventId,
itemId,
CASE rn WHEN 1 THEN attrib1
ELSE COALESCE( attrib1, LAST_VALUE(attrib1, true) OVER( PARTITION BY itemId ) )
END AS attrib1_xlast,
CASE rn WHEN 1 THEN attrib2
ELSE COALESCE( attrib2, LAST_VALUE(attrib2, true) OVER( PARTITION BY itemId ) )
END AS attrib2_xlast
FROM cte
ORDER BY eventId
1条答案
按热度按时间eni9jsuy1#
对很多人来说
columns
你可以创造一个expression
如下所示更新:
结果: