用于连接的预分区Dataframe

mspsb9vt  于 2021-05-29  发布在  Spark
关注(0)|答案(0)|浏览(188)

我想对数据进行预分区,下面是我使用的代码示例:

sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
  val students = List (
    (1,"vasili"),
    (2,"ivan")
  )
  val addresses = List(
    (1,"UKR"),
    (1,"SG"),
    (2,"DE")
  )
  val departments = List(
    (1,"CS"),
    (1,"MATH"),
    (2,"HISTORY")
  )
   val studentsDF = students.toDF("student_id", "name") // I want to hash-partition is before hand,so that it dont re-partition for joining dept,address
   val departmentsDF = departments.toDF("student_id", "department")
   val addressesDF = addresses.toDF("student_id", "address")

   val frame: DataFrame = studentsDF.join(departmentsDF, studentsDF.col("student_id") equalTo departmentsDF.col("student_id"))
   val frame2: DataFrame = studentsDF.join(addressesDF, studentsDF.col("student_id") equalTo addressesDF.col("student_id"))

我正在尝试预先分区一个df,但没有像我期望的那样工作

val studentsDF = students.toDF("student_id", "name").repartition(col("student_id"))

重点是:我将addressdf和departmentdf都与studentdf合并,每次studentdf被hashpartitioned时,我都想用col(“student\u id”)对其进行预分区

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题