很难将json转换为sparkDataframe

byqmnocz  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(378)

我一直在尝试将json加载到pysparkDataframe中,但这里有点困难。
这是我迄今为止尝试过的(有或没有多行):

import json
newJson = json.dumps(testjson)
newdf = spark.read.json(sc.parallelize([newJson]))
newdf.display()

json文件:

testjson = [
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
 ('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',), 
]

当试图显示Dataframe时,我得到“corrupt\u record”。我做错什么了?

xienkqul

xienkqul1#

尝试将其转换为字符串列表。spark无法理解字符串元组列表。也 json.dumps 是不必要的,因为spark应该能够理解json输入。

df = spark.read.json(sc.parallelize([i[0] for i in testjson]))

df.show(truncate=False)
+--------------------------------------------------------------------+---+
|address                                                             |id |
+--------------------------------------------------------------------+---+
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
|[432.432.432.432, 432.432.432.432, 432.432.432.432, 432.432.432.432]|434|
+--------------------------------------------------------------------+---+

相关问题