使用不同的Dataframe创建包含信息的新Dataframe(scala spark)

qaxu7uf2  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(265)

我有一个带有游戏的数据框和三个来自不同评论的每个游戏的值,每个值都在另一个数据框中转换,如您所见:

Df_reviews
+--------+-------+-------+--------+
|Game    | rev_1 | rev_2 | rev_3  |
+- ------+-------+-------+--------+
|CA      |XX+    | K2    | L1     |
|FT      |Z-     | K1+   | L3     |

Df_rev1
+----------+-------------+ 
| review_1 | Equivalence | 
+----------+-------------+
|XX+       | 9           |      
|Y         | 6           |
|Z-        | 3           |   

Df_rev2
+----------+-------------+
| review_2 | Equivalence | 
+----------+-------------+
|K2        | 7           |      
|K1+       | 6           |
|K3        | 10          | 

Df_rev3
+----------+-------------+
| review_3 | Equivalence | 
+----------+-------------+
|L3        | 10          |      
|L2        | 9           |
|L1        | 8           |

我必须在一个新的Dataframe中使用traduced的值,并添加一个具有第二个最佳值的列,例如:

Df_output
+--------+---------+---------+----------+-------------+
|Game    | rev_1_t | rev_2_t | rev_3_t  | second_best |
+--------+---------+---------+----------+-------------+
|CA      |  9      |  7      | 8        |     8       |
|FT      |  3      |  6      | 10       |     6       |

为了减少它,我尝试用左连接,但我太迷路了。我该怎么处理?

####### 第二部分####################如何将一个Dataframe中的列转换为另一个Dataframe中的列,将多个列?例如:

Df_revuews
+--------+-------+-------+--------+
|Game    | rev_1 | rev_2 | rev_3  |
+- ------+-------+-------+--------+
|CA      |XX+    | K2    | L1     |
|FT      |Z-     | K1+   | L3     |

Df_equiv
+--------+-------+
|Valorat | num   |
+- ------+-------+
|X       |3      |
|XX+     |5      |
|Z       |7      |
|Z-      |6      |
|K1+     |6      |
|K2      |4      |
|L1      |5      |
|L2      |6      |
|L3      |7      |

Output
+--------+-------+-------+--------+
|Game    | rev_1 | rev_2 | rev_3  |
+- ------+-------+-------+--------+
|CA      |5      | 4     | 5      |
|FT      |6      | 6     | 7      |

正如你所看到的,我正在这样做:

val joined = df_reviews
.join(df_equiv, df_reviews("rev_1") === df_equiv("num") && df_reviews("rev_2") === df_equiv("num")
&& df_reviews("rev_3") === df_equiv("num"), "left")
.select(df_reviews("Game"), 
        df_equiv("num").as("rev_1_t"), 
        df_equiv("num").as("rev_2_t"), 
        df_equiv("num").as("rev_3_t")
       )

提前谢谢!

cotxawn7

cotxawn71#

您可以执行一些左连接,并使用 sort_array :

val joined = df_reviews
    .join(df_rev1, df_reviews("rev_1") === df_rev1("review_1"), "left")
    .join(df_rev2, df_reviews("rev_2") === df_rev2("review_2"), "left")
    .join(df_rev3, df_reviews("rev_3") === df_rev3("review_3"), "left")
    .select(df_reviews("Game"), 
            df_rev1("Equivalence").as("rev_1_t"), 
            df_rev2("Equivalence").as("rev_2_t"), 
            df_rev3("Equivalence").as("rev_3_t")
           )

val array_sort_udf = udf((x: Seq[Int]) => x.sortBy(_ != null))

val result = joined.withColumn(
    "second_best",
    coalesce(
        array_sort_udf(
            array(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
        )(1),
        greatest(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
    )
)

result.show
+----+-------+-------+-------+-----------+
|Game|rev_1_t|rev_2_t|rev_3_t|second_best|
+----+-------+-------+-------+-----------+
|  CA|      9|      7|      8|          8|
|  FT|      3|      6|     10|          6|
+----+-------+-------+-------+-----------+

关于第二个问题:

val joined = df_reviews.as("r1")
    .join(df_equiv.as("e1"), expr("r1.rev_1 = e1.Valorat"), "left")
    .selectExpr("Game", "e1.num as rev_1", "rev_2", "rev_3")
    .as("r2")
    .join(df_equiv.as("e2"), expr("r2.rev_2 = e2.Valorat"), "left")
    .selectExpr("Game", "rev_1", "e2.num as rev_2", "rev_3")
    .as("r3")
    .join(df_equiv.as("e3"), expr("r3.rev_3 = e3.Valorat"), "left")
    .selectExpr("Game", "rev_1", "rev_2", "e3.num as rev_3")

joined.show
+----+-----+-----+-----+
|Game|rev_1|rev_2|rev_3|
+----+-----+-----+-----+
|  CA|    5|    4|    5|
|  FT|    6|    6|    7|
+----+-----+-----+-----+

相关问题