我想对数据进行预分区,下面是我使用的代码示例:
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val students = List (
(1,"vasili"),
(2,"ivan")
)
val addresses = List(
(1,"UKR"),
(1,"SG"),
(2,"DE")
)
val departments = List(
(1,"CS"),
(1,"MATH"),
(2,"HISTORY")
)
val studentsDF = students.toDF("student_id", "name") // I want to hash-partition is before hand,so that it dont re-partition for joining dept,address
val departmentsDF = departments.toDF("student_id", "department")
val addressesDF = addresses.toDF("student_id", "address")
val frame: DataFrame = studentsDF.join(departmentsDF, studentsDF.col("student_id") equalTo departmentsDF.col("student_id"))
val frame2: DataFrame = studentsDF.join(addressesDF, studentsDF.col("student_id") equalTo addressesDF.col("student_id"))
我正在尝试预先分区一个df,但没有像我期望的那样工作
val studentsDF = students.toDF("student_id", "name").repartition(col("student_id"))
重点是:我将addressdf和departmentdf都与studentdf合并,每次studentdf被hashpartitioned时,我都想用col(“student\u id”)对其进行预分区
暂无答案!
目前还没有任何答案,快来回答吧!