pyspark-json将对象数组放入列中

iszxjhcz  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(303)

我将json文件吸收到spark中,在文件的嵌套json中遇到了一个如下所示的对象

"data": {
  "key1" :"v1" 
  "key2" : [
     {"nk1" :"nv1"}, 
     {"nk2" :"nv2" }, 
     {"nk3" :"nv3" } 
  ] 
}

在spark中读取后,它将变为以下格式:

"data": {
  "key1" :"v1" 
  "key2" : [
     {"nk1" :"nv1", "nk2" :null, "nk3" :null}, 
     {"nk1" :null, "nk2" :"nv2", "nk3" :null}, 
     {"nk1" :null, "nk2" :null, "nk3" :"nv3"} 
  ] 
}

我需要它们作为spark数据框中的列
“键1”“nk1”“nk2”“nk3”“v1”“kv1”“kv2”“kv3”
请帮我解决这个问题。我在考虑把它转换成字符串并使用正则表达式。有没有更好的解决办法?

5n0oy7gb

5n0oy7gb1#

可以分解数组和轴键2:

import pyspark.sql.functions as F

df2 = df.select(
    F.col('data.key1').alias('key1'), 
    F.explode('data.key2').alias('key2')
).select(
    'key1', 
    F.map_keys('key2')[0].alias('key'), 
    F.map_values('key2')[0].alias('val')
).groupBy('key1').pivot('key').agg(F.first('val'))

df2.show()
+----+---+---+---+
|key1|nk1|nk2|nk3|
+----+---+---+---+
|  v1|nv1|nv2|nv3|
+----+---+---+---+

相关问题