如何将配置单元中的array[struct[string,string]]列类型强制转换为array[map[string,string]]?

sigwle7e  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(520)

我在配置单元表中有一列:
列名:过滤器
数据类型:

|-- filters: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: string (nullable = true)

我想从这个列中得到对应名称的值。
到目前为止我所做的:

val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'")

val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect()

Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_]))

注意:强制转换为seq,因为wrappedarray到map的转换是不可能的。
下一步怎么办?

ecr0jaav

ecr0jaav1#

I want to get the value from this column by it's corresponding name.

如果您希望以简单可靠的方式按名称获取所有值,可以使用“分解和筛选”将表展平:

case class Data(name: String, value: String)
case class Filters(filters: Array[Data])

val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c")))))
df.show()
+--------------+
|       filters|
+--------------+
|[[a,b], [a,c]]|
|       [[b,c]]|
+--------------+

df.withColumn("filter", explode($"filters"))
  .select($"filter.name" as "name", $"filter.value" as "value")
  .where($"name" === "a")
  .show()
+----+-----+
|name|value|
+----+-----+
|   a|    b|
|   a|    c|
+----+-----+

您还可以按任何方式收集数据:

val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value")
flatDf.rdd.map(r => Array(r(0), r(1))).collect()
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c))
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not  the best idea, if you have many values per key
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c)))

如果你想投 array[struct]map[string, string] 为了将来保存到某个存储中,情况就不同了,这个问题最好由udf来解决。不管怎样,你必须避免 collect() 尽可能保持代码的可伸缩性。

相关问题