Spark Windowing如何避免重新 Shuffle

ax6ht2ek  于 2023-05-29  发布在  Apache
关注(0)|答案(1)|浏览(140)

有以下几个关于Spark Windowing重新 Shuffle 的问题:
1.如果一个DataFrame已经在一个列(比如“id")上进行了重新分区,并且如果同一列在Window.partitionBy(“id”)中使用,是否会有重新 Shuffle 发生?我们如何才能避免在这里重新 Shuffle ?
1.如果我们有2个Windows,请说Window.partitionBy(“id","name”).orderBy(“salary”)和Window.partitionBy(“id","age”).orderBy(“salary”)。其中第一个分区列是相同的。在第二种情况下,希望不会有任何重新 Shuffle ,但排序只发生在partitionBy和orderBy中的列上,或者DataFrame分区中的所有列将再次排序?

okxuctiv

okxuctiv1#

使用.explain()并查看physical plan

Q1:无明显 Shuffle 。
Q2: Shuffle 明显。站在理性,因为复杂分裂出来,不能并行做,我怀疑。

使用这个:

val w = org.apache.spark.sql.expressions.Window.partitionBy("id", "line").orderBy("xtra")
val w2 = org.apache.spark.sql.expressions.Window.partitionBy("id", "xtra").orderBy("line")

val df3 = df2.withColumn("next", lead("line", 1, null).over(w)).withColumn("next2", lead("line", 1, null).over(w2)).explain(true)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false

 +- Window [id#524, line#525, xtra#526, next#530, lead(line#525, 1, null) 
   windowspecdefinition(id#524, xtra#526, line#525 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS next2#535], [id#524, xtra#526], [line#525 ASC NULLS FIRST]
   +- Sort [id#524 ASC NULLS FIRST, xtra#526 ASC NULLS FIRST, line#525 ASC NULLS FIRST], false, 0
  +- Window [id#524, line#525, xtra#526, lead(line#525, 1, null) windowspecdefinition(id#524, line#525, xtra#526 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS next#530], [id#524, line#525], [xtra#526 ASC NULLS FIRST]
     +- Sort [id#524 ASC NULLS FIRST, line#525 ASC NULLS FIRST, xtra#526 ASC NULLS FIRST], false, 0
        +- Exchange hashpartitioning(id#524, 20), REPARTITION_BY_NUM, [id=#955]
           +- LocalTableScan [id#524, line#525, xtra#526]

相关问题