我在pyspark中读取多行json时遇到问题。例子:
{
"_index": "kl.service-log.2021.04.06",
"_type": "_doc",
"_id": "hZ3SpHgBhp2ht1Q8n8ym",
"_version": 1,
"_score": null,
"_source": {
"publishTime": "2021-04-06T01:36:09.422Z",
"client_ips": "2601:247:c580:3337:45c0:dd63:35e0:9247",
"body": {
"events": "[{\"key\":\"Key Launched\",\"count\":1,\"timestamp\":1617672914673,\"sum\":0},{\"key\":\"Viewed Screen\",\"count\":1,\"timestamp\":1617672969301,\"sum\":0}]",
"sdk_name": "java-native-android",
"tz": "-300"
}
}
}
架构定义如下:
root
|-- _id: string (nullable = true)
|-- _index: string (nullable = true)
|-- _score: string (nullable = true)
|-- _source: struct (nullable = true)
| |-- body: struct (nullable = true)
| | |-- events: string (nullable = true)
| | |-- sdk_name: string (nullable = true)
| | |-- tz: string (nullable = true)
| |-- client_ips: string (nullable = true)
| |-- publishTime: string (nullable = true)
|-- _type: string (nullable = true)
|-- _version: long (nullable = true)
低于 _source.body.events
,我看到数据类型是string,但它是一个包含两个不同记录的dictorial。我想有两个特定的列不同的行他们。
1条答案
按热度按时间8xiog9wr1#
您可以使用
from_json
,并重建\u源列:如果要将数组分解为单独的行,可以对
df2
以上获得: