读取pyspark中的嵌套json结构

nqwrtyyt  于 2021-07-13  发布在  Spark
关注(0)|答案(0)|浏览(378)

我是pyspark的新手。我正在尝试读取json数据的一个嵌套列的值。以下是我的json结构:

-- _index: string (nullable = true)
 |-- _score: string (nullable = true)
 |-- _source: struct (nullable = true)
 |    |-- layers: struct (nullable = true)
 |    |    |-- R1.TEST6: struct (nullable = true)
 |    |    |    |-- R1.TEST1: struct (nullable = true)
 |    |    |    |    |-- R1.TEST1.idx: string (nullable = true)
 |    |    |    |    |-- R1.TEST1.ide: string (nullable = true)
 |    |    |    |-- R1.TEST3: struct (nullable = true)
 |    |    |    |    |-- R1.TEST3.PDU: string (nullable = true)
 |    |    |    |    |-- R1.TEST3.pdu: string (nullable = true)
 |    |    |    |    |-- R1.TEST4: struct (nullable = true)
 |    |    |    |    |    |-- R1.TEST2: struct (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.agg: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.size: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.start: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.beam: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.startIndex: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.regType: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.coreSetType: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.cpType: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column3: string (nullable = true)

正如上面提到的,https://stackoverflow.com/questions/57811415/reading-a-nested-json-file-in-pyspark,我试着做了以下工作:

df2 = df.select(F.array(F.expr("_source.*")).alias("Source"))

现在我的要求是访问r1.test6:tag下的值
但以下代码不起作用:

df2.withColumn("source_data", F.explode(F.arrays_zip("Source"))).select("source_data.Source.R1.TEST6.R1.TEST1.idx").show()

有人能帮助我如何访问这个嵌套json的所有字段并创建一个表吗?因为这个json\u source.r1.test6中存在多个嵌套级别,所以如何在下面的多个级别上使用explode

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题