如何在scala spark中按字母顺序对嵌套数组和结构的架构列进行排序?

plicqrtu  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(420)

我有下面模式的Dataframe。我想所有的列包括嵌套字段都应该按字母顺序排序。我想把它放在scala spark里。

root
 |-- metadata2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute2: string (nullable = true)
 |    |    |-- attribute1: string (nullable = true)
 |-- metadata3: string (nullable = true)
 |-- metadata1: struct (containsNull = true)
 |    |-- attribute2: string (nullable = true)
 |    |-- attribute1: string (nullable = true)

当我使用schema.sortby(\ u.name)排序时,我会在schema下面(嵌套的数组和结构类型字段没有排序)

root
 |-- metadata1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute2: string (nullable = true)
 |    |    |-- attribute1: string (nullable = true)
 |-- metadata2: struct (containsNull = true)
 |    |-- attribute2: string (nullable = true)
 |    |-- attribute1: string (nullable = true)
 |-- metadata3: string (nullable = true)

我想要的模式如下(甚至metadata1(arraytype)和metadata2(structtype)中的列也应该排序)

root
 |-- metadata1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute1: string (nullable = true)
 |    |    |-- attribute2: string (nullable = true)
 |-- metadata2: struct (containsNull = true)
 |    |-- attribute1: string (nullable = true)
 |    |-- attribute2: string (nullable = true)
 |-- metadata3: string (nullable = true)

提前谢谢。

7lrncoxx

7lrncoxx1#

结构类型的版本:

import spark.implicits._
import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType}

val schema = StructType(Seq(
  StructField("metadata2",       StructType(
    Seq(StructField("attribute2", StringType),
      StructField("attribute1", StringType)))),
  StructField("metadata3", StringType),
  StructField("metadata1", ArrayType(StringType)
  )
))

schema.foreach(println _)
//  StructField(metadata2,StructType(StructField(attribute2,StringType,true), StructField(attribute1,StringType,true)),true)
//  StructField(metadata3,StringType,true)
//  StructField(metadata1,ArrayType(StringType,true),true)

val schemaResult = schema.sortBy(_.name).map{c =>
  c.dataType match {
    case structType: StructType => StructField(c.name, StructType(structType.fields.sortBy(_.name)))
    case _ => c
  }
}

schemaResult.foreach(println _)
//  StructField(metadata1,ArrayType(StringType,true),true)
//  StructField(metadata2,StructType(StructField(attribute1,StringType,true), StructField(attribute2,StringType,true)),true)
//  StructField(metadata3,StringType,true)
println(schemaResult)
//  List(StructField(metadata1,ArrayType(StringType,true),true), StructField(metadata2,StructType(StructField(attribute1,StringType,true), StructField(attribute2,StringType,true)),true), StructField(metadata3,StringType,true))

相关问题