将大Dataframe重塑为宽Dataframe

nsc4cvqm  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(412)

我想转换这个Dataframe;

+----+-------+---+---
|col1|col2   |RC1|RC2
+----+-------+---+---
|A   |B      |  1|  0
|C   |D      |  1|  1
+----+-------+---+---

为了这个!

+----+-------+------+
|col1|col2   |newCol|
+----+-------+------+
|A   |B      |  RC1 |
|C   |D      |  RC1 |
|C   |D      |  RC2 |
+----+-------+---+

tidyr似乎很好地回答了这个问题:使用聚集函数,sparkscala难道没有可能的解决方案吗?

svgewumm

svgewumm1#

使用unpivot方法 stack 为实现以下目标,

val df=Seq(("A", "B", 1, 0), ("C", "D", 1, 1)).toDF("col1", "col2", "RC1", "RC2")

+----+----+---+---+
|col1|col2|RC1|RC2|
+----+----+---+---+
|   A|   B|  1|  0|
|   C|   D|  1|  1|
+----+----+---+---+

df.select($"col1", $"col2", expr("stack(2,'RC1', RC1, 'RC2', RC2) as (newCol,RC_VAL)")).where($"RC_VAL" =!= 0).drop("RC_VAL").show()

+----+----+------+
|col1|col2|newCol|
+----+----+------+
|   A|   B|   RC1|
|   C|   D|   RC1|
|   C|   D|   RC2|
+----+----+------+
hwamh0ep

hwamh0ep2#

检查以下代码。

scala> df.show(false)
+----+----+---+---+
|col1|col2|rc1|rc2|
+----+----+---+---+
|A   |B   |1  |0  |
|C   |D   |1  |1  |
+----+----+---+---+

生成表达式。

scala> val colExpr = 
when($"rc1" === 1 && $"rc2" === 1,array(lit("RC1"),lit("RC2")))
.when($"rc1" === 1 && $"rc2" === 0, array(lit("RC1")))
.when($"rc1" === 0 && $"rc2" === 1, array(lit("RC2")))

应用表达式。

scala> 
spark.time { 
    df
    .select($"col1",$"col2",explode(colExpr).as("newcol"))
    .show(false) 
}

+----+----+------+
|col1|col2|newcol|
+----+----+------+
|A   |B   |RC1   |
|C   |D   |RC1   |
|C   |D   |RC2   |
+----+----+------+

Time taken: 914 ms

相关问题