spark-从数组对象中选择多列

lx0bsm1f  于 2021-05-29  发布在  Spark
关注(0)|答案(2)|浏览(379)

我有一个具有以下模式的数据集

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- subEntities: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- subEntityId: long (nullable = true)
 |    |    |-- subEntityName: string (nullable = true)
``` `dataset.select($"id", $"name", $"subEntities.subEntityId", $"subEntities.subEntityName")` 放 `subEntityId` 以及 `subEntityName` 分成不同的阵列。如何选择多个列并将它们放入单个数组中?
68bkxrlz

68bkxrlz1#

.withColumn("status",col("subEntities").getField("status")) .withColumn("subEntityId",col("subEntities").getField("subEntityId")) 从数组中提取值
下面是工作示例

import org.apache.spark.sql.functions._

object ExplodeArrauy {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    val df = List(bean57("1",Array(bean55("aaa",2),bean55("aaa1",21))),
      bean57("2",Array(bean55("bbb",3),bean55("bbb3",31)))).toDF

    df
      .withColumn("status",col("subEntities").getField("status"))
      .withColumn("subEntityId",col("subEntities").getField("subEntityId"))
      .show()

  }

}

case class bean57(id:String,subEntities:Array[bean55])

case class bean55(status: String,subEntityId:Long)
sxpgvts3

sxpgvts32#

如果工作在 Spark >= 2.4 可以使用transform函数生成包含原始数组字段子集的数组:

import org.apache.spark.sql.functions.expr

dataset.withColumn("newArray", expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))"))

// or with select
dataset.select(
        $"id", 
        $"name",
        expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))").as("newArray")
)

相关问题