在sparksql中从包含结构的数组中提取字段

aurhwmvo  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(529)

我有一个表,其中有一个名为xyz的字段作为数组,数组中有一个结构,如下所示

array<struct<site_id:int,time:string,abc:array>>

此字段中的值如下所示

[{"site_id":3,"time":"2020-07-26 05:48:21","abc":[{"to_bidder":"val1"}]]

这是一个示例表示,实际上这个字段有很多字段,我的任务是在不使用内联的情况下提取对应于键“”的字段,如果可能的话在sparksql中进行分解以避免内存错误
我尝试了数组包含(xyz,“”),但它给了我一个错误

data type mismatch: Arguments must be an array followed by a value of same type as the array members;

我尝试了@srinivas代码,但它给了我tyoe不匹配的错误

cannot resolve 'flatten(k.`xyz`.`abc`)' due to data type mismatch: The argument should be an array of arrays, but 'k.`xyz`.`abc`' is of array<map<string,string>>
sirbozc5

sirbozc51#

应该是的 array_contains($col_name.{struct_field},{value}) .
你的情况应该是这样的 array_contains(functions.col("xyz."),"value") 仅供参考:值必须与struct\ u字段类型匹配,否则会引发错误。

wkyowqbh

wkyowqbh2#

功能 array_contains 仅限退货 true 或者 false .
要访问struct数组中的特定列,请使用 array_column.field_name 它会回来的 array of field values 检查以下代码。

df
.withColumn("column",$"data.abc") // Extract Column value
.withColumn("column_with_array_contains",array_contains($"data.abc","val1")) // It will return true or false.
.withColumn("column_with_concat",concat_ws(",",$"data.abc")) // It will concat column values.
.show(false)

// Exiting paste mode, now interpreting.

+--------------------------------+------+--------------------------+------------------+
|data                            |column|column_with_array_contains|column_with_concat|
+--------------------------------+------+--------------------------+------------------+
|[[val1, 3, 2020-07-26 05:48:21]]|[val1]|true                      |val1              |
+--------------------------------+------+--------------------------+------------------+

Sparksql

scala> spark.sql("select data, data.abc as column,array_contains(data.abc,'val1') as column_with_array_contains,concat_ws(',',data.abc) as column_with_concat  from sample").show(false)
+--------------------------------+------+--------------------------+------------------+
|data                            |column|column_with_array_contains|column_with_concat|
+--------------------------------+------+--------------------------+------------------+
|[[val1, 3, 2020-07-26 05:48:21]]|[val1]|true                      |val1              |
+--------------------------------+------+--------------------------+------------------+

更新-spark版本-3.0.0,下面的代码可能无法运行spark的较低版本。检查一次。

scala> spark.sql("select data, flatten(data.abc)['to_bidder'] as column, array_contains(flatten(data.abc)['to_bidder'],'val1') as column_with_array_contains,concat_ws(',',flatten(data.abc).to_bidder) as column_with_concat from sample").show(false)
+------------------------------------+------+--------------------------+------------------+
|data                                |column|column_with_array_contains|column_with_concat|
+------------------------------------+------+--------------------------+------------------+
|[[[[val1]], 3, 2020-07-26 05:48:21]]|[val1]|true                      |val1              |
+------------------------------------+------+--------------------------+------------------+

update-2-只有当您的模式和数据与示例数据和示例模式匹配时,下面的解决方案才有效。
样本数据

[{"site_id":3,"time":"2020-07-26 05:48:21","abc":{"to_bidder":"val1"}]

示例架构

array<struct<site_id:int,time:string,abc:struct<to_bidder:string>>>
root
 |-- data: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- abc: struct (nullable = true)
 |    |    |    |-- to_bidder: string (nullable = true)
 |    |    |-- site_id: long (nullable = true)
 |    |    |-- time: string (nullable = true)

scala> spark.sql("select data, data.abc.to_bidder as column, array_contains(data.abc.to_bidder,'val1') as column_with_array_contains,concat_ws(',',data.abc.to_bidder) as column_with_concat from samplea").show(false)
+----------------------------------+------+--------------------------+------------------+
|data                              |column|column_with_array_contains|column_with_concat|
+----------------------------------+------+--------------------------+------------------+
|[[[val1], 3, 2020-07-26 05:48:21]]|[val1]|true                      |val1              |
+----------------------------------+------+--------------------------+------------------+

相关问题