我一直在尝试将json加载到pysparkDataframe中,但这里有点困难。
这是我迄今为止尝试过的(有或没有多行):
import json
newJson = json.dumps(testjson)
newdf = spark.read.json(sc.parallelize([newJson]))
newdf.display()
json文件:
testjson = [
('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',),
('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',),
('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',),
('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',),
('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',),
('{"id":434, "address" : ["432.432.432.432", "432.432.432.432", "432.432.432.432", "432.432.432.432"]}',),
]
当试图显示Dataframe时,我得到“corrupt\u record”。我做错什么了?
1条答案
按热度按时间xienkqul1#
尝试将其转换为字符串列表。spark无法理解字符串元组列表。也
json.dumps
是不必要的,因为spark应该能够理解json输入。