使用不同的Dataframe创建包含信息的新Dataframe(scala spark)

qaxu7uf2  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(287)

我有一个带有游戏的数据框和三个来自不同评论的每个游戏的值,每个值都在另一个数据框中转换,如您所见:

  1. Df_reviews
  2. +--------+-------+-------+--------+
  3. |Game | rev_1 | rev_2 | rev_3 |
  4. +- ------+-------+-------+--------+
  5. |CA |XX+ | K2 | L1 |
  6. |FT |Z- | K1+ | L3 |
  7. Df_rev1
  8. +----------+-------------+
  9. | review_1 | Equivalence |
  10. +----------+-------------+
  11. |XX+ | 9 |
  12. |Y | 6 |
  13. |Z- | 3 |
  14. Df_rev2
  15. +----------+-------------+
  16. | review_2 | Equivalence |
  17. +----------+-------------+
  18. |K2 | 7 |
  19. |K1+ | 6 |
  20. |K3 | 10 |
  21. Df_rev3
  22. +----------+-------------+
  23. | review_3 | Equivalence |
  24. +----------+-------------+
  25. |L3 | 10 |
  26. |L2 | 9 |
  27. |L1 | 8 |

我必须在一个新的Dataframe中使用traduced的值,并添加一个具有第二个最佳值的列,例如:

  1. Df_output
  2. +--------+---------+---------+----------+-------------+
  3. |Game | rev_1_t | rev_2_t | rev_3_t | second_best |
  4. +--------+---------+---------+----------+-------------+
  5. |CA | 9 | 7 | 8 | 8 |
  6. |FT | 3 | 6 | 10 | 6 |

为了减少它,我尝试用左连接,但我太迷路了。我该怎么处理?

####### 第二部分####################如何将一个Dataframe中的列转换为另一个Dataframe中的列,将多个列?例如:

  1. Df_revuews
  2. +--------+-------+-------+--------+
  3. |Game | rev_1 | rev_2 | rev_3 |
  4. +- ------+-------+-------+--------+
  5. |CA |XX+ | K2 | L1 |
  6. |FT |Z- | K1+ | L3 |
  7. Df_equiv
  8. +--------+-------+
  9. |Valorat | num |
  10. +- ------+-------+
  11. |X |3 |
  12. |XX+ |5 |
  13. |Z |7 |
  14. |Z- |6 |
  15. |K1+ |6 |
  16. |K2 |4 |
  17. |L1 |5 |
  18. |L2 |6 |
  19. |L3 |7 |
  20. Output
  21. +--------+-------+-------+--------+
  22. |Game | rev_1 | rev_2 | rev_3 |
  23. +- ------+-------+-------+--------+
  24. |CA |5 | 4 | 5 |
  25. |FT |6 | 6 | 7 |

正如你所看到的,我正在这样做:

  1. val joined = df_reviews
  2. .join(df_equiv, df_reviews("rev_1") === df_equiv("num") && df_reviews("rev_2") === df_equiv("num")
  3. && df_reviews("rev_3") === df_equiv("num"), "left")
  4. .select(df_reviews("Game"),
  5. df_equiv("num").as("rev_1_t"),
  6. df_equiv("num").as("rev_2_t"),
  7. df_equiv("num").as("rev_3_t")
  8. )

提前谢谢!

cotxawn7

cotxawn71#

您可以执行一些左连接,并使用 sort_array :

  1. val joined = df_reviews
  2. .join(df_rev1, df_reviews("rev_1") === df_rev1("review_1"), "left")
  3. .join(df_rev2, df_reviews("rev_2") === df_rev2("review_2"), "left")
  4. .join(df_rev3, df_reviews("rev_3") === df_rev3("review_3"), "left")
  5. .select(df_reviews("Game"),
  6. df_rev1("Equivalence").as("rev_1_t"),
  7. df_rev2("Equivalence").as("rev_2_t"),
  8. df_rev3("Equivalence").as("rev_3_t")
  9. )
  10. val array_sort_udf = udf((x: Seq[Int]) => x.sortBy(_ != null))
  11. val result = joined.withColumn(
  12. "second_best",
  13. coalesce(
  14. array_sort_udf(
  15. array(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
  16. )(1),
  17. greatest(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
  18. )
  19. )
  20. result.show
  21. +----+-------+-------+-------+-----------+
  22. |Game|rev_1_t|rev_2_t|rev_3_t|second_best|
  23. +----+-------+-------+-------+-----------+
  24. | CA| 9| 7| 8| 8|
  25. | FT| 3| 6| 10| 6|
  26. +----+-------+-------+-------+-----------+

关于第二个问题:

  1. val joined = df_reviews.as("r1")
  2. .join(df_equiv.as("e1"), expr("r1.rev_1 = e1.Valorat"), "left")
  3. .selectExpr("Game", "e1.num as rev_1", "rev_2", "rev_3")
  4. .as("r2")
  5. .join(df_equiv.as("e2"), expr("r2.rev_2 = e2.Valorat"), "left")
  6. .selectExpr("Game", "rev_1", "e2.num as rev_2", "rev_3")
  7. .as("r3")
  8. .join(df_equiv.as("e3"), expr("r3.rev_3 = e3.Valorat"), "left")
  9. .selectExpr("Game", "rev_1", "rev_2", "e3.num as rev_3")
  10. joined.show
  11. +----+-----+-----+-----+
  12. |Game|rev_1|rev_2|rev_3|
  13. +----+-----+-----+-----+
  14. | CA| 5| 4| 5|
  15. | FT| 6| 6| 7|
  16. +----+-----+-----+-----+
展开查看全部

相关问题