有没有一种方法可以从pyspark中的结构数组创建一个新的结构？

chhqkbe1 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(318)

我现在有一个类似这样的模式

StructType(List(StructField(top_level,StructType(List(StructField(middle_level,StructType(List(StructField(lower_level,ArrayType(StructType(List(StructField(array_col1,StringType,true),StructField(array_col2,StringType,true), StructField(array_coln, StringType, true))))))))

我要查询的字段隐藏在结构中的arraytype对象中。所以当我询问

spark.sql("SELECT top_level.middle_level.lower_level.array_col1")

我将接收每行一个数组的输出，其中空值的数量与原始json中的字段数量相同。

| array_col1            |
|-----------------------|
| [null, null]          |
| [null]                |
| [null, null, "value"] |

我想重新构造这个对象，这样就可以将arraytype对象中的字段组合成一个structtype对象，这样就可以在select语句中更好地调用这些列。这在Pypark中是可能的吗？我研究过以数组为中心的函数，比如explode和collect，但它们似乎不能很好地与struct配合使用。我也看过内联函数，但它为数组中的每个值返回一行。有没有人遇到过类似的事情？
编辑
一定有比这更简单的方法，但我最终将结构发送到了json，删除了{}，将[]s改为前导和尾随{}。然后我使用fromjson和我创建的schema对象。

from pyspark.sql.types import StructType
from pyspark.sql.functions import from_json, col
schema = table('table_name').schema.jsonValue()['fields'][0]['type']['fields'][0]['type']['fields'][0]['type']['elementType']['fields']

new_schema = {}
new_schema["fields"] = schema
new_schema["type"] = "struct"
new_schema["nullable"] = True
new_schema["metadata"] = {}
new_schema["name"] = "fields"
schema_struct = StructType.fromJson(new_schema)

df = spark.sql("""SELECT regexp_replace(regexp_replace(translate(translate(to_json(top_level.medium_level.lower_level), '{', ''), '}', ''), '^(.)', '{'), '(.)$', '}') as col FROM table_name""")

display(df.withColumn('col', from_json(col('col'),schema_struct)))

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/63124583/is-there-a-way-to-create-a-new-struct-from-an-array-of-structs-in-pyspark

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

有没有一种方法可以从pyspark中的结构数组创建一个新的结构？

暂无答案！

相关问题

热门标签

最新问答