我是pyspark的新手。我正在尝试读取json数据的一个嵌套列的值。以下是我的json结构:
-- _index: string (nullable = true)
|-- _score: string (nullable = true)
|-- _source: struct (nullable = true)
| |-- layers: struct (nullable = true)
| | |-- R1.TEST6: struct (nullable = true)
| | | |-- R1.TEST1: struct (nullable = true)
| | | | |-- R1.TEST1.idx: string (nullable = true)
| | | | |-- R1.TEST1.ide: string (nullable = true)
| | | |-- R1.TEST3: struct (nullable = true)
| | | | |-- R1.TEST3.PDU: string (nullable = true)
| | | | |-- R1.TEST3.pdu: string (nullable = true)
| | | | |-- R1.TEST4: struct (nullable = true)
| | | | | |-- R1.TEST2: struct (nullable = true)
| | | | | | |-- R1.TEST2.agg: string (nullable = true)
| | | | | | |-- R1.TEST2.size: string (nullable = true)
| | | | | | |-- R1.TEST2.start: string (nullable = true)
| | | | | | |-- R1.TEST2.beam: string (nullable = true)
| | | | | | |-- R1.TEST2.startIndex: string (nullable = true)
| | | | | | |-- R1.TEST2.regType: string (nullable = true)
| | | | | | |-- R1.TEST2.coreSetType: string (nullable = true)
| | | | | | |-- R1.TEST2.cpType: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column3: string (nullable = true)
正如上面提到的,https://stackoverflow.com/questions/57811415/reading-a-nested-json-file-in-pyspark,我试着做了以下工作:
df2 = df.select(F.array(F.expr("_source.*")).alias("Source"))
现在我的要求是访问r1.test6:tag下的值
但以下代码不起作用:
df2.withColumn("source_data", F.explode(F.arrays_zip("Source"))).select("source_data.Source.R1.TEST6.R1.TEST1.idx").show()
有人能帮助我如何访问这个嵌套json的所有字段并创建一个表吗?因为这个json\u source.r1.test6中存在多个嵌套级别,所以如何在下面的多个级别上使用explode
暂无答案!
目前还没有任何答案,快来回答吧!