使用pyspark将json读取为Dataframe

zzoitvuj  于 2021-07-09  发布在  Spark
关注(0)|答案(2)|浏览(352)

我试图阅读一个json文档,它看起来像这样

{"id":100, "name":"anna", "hometown":"chicago"}  [{"id":200, "name":"beth", "hometown":"indiana"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":700, "name":"anna", "hometown":"dudley"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1300, "name":"sarah", "hometown":"hoboken"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1100, "name":"don", "hometown":"santa monica"}  [{"id":100, "name":"anna", "hometown":"chicago"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1200, "name":"jane", "hometown":"freemont"},{"id":1600, "name":"john", "hometown":"downtown"},{"id":1500, "name":"glenn", "hometown":"uptown"}]
{"id":1400, "name":"steve", "hometown":"newtown"}  [{"id":100, "name":"anna", "hometown":"chicago"},{"id":600, "name":"john", "hometown":"san jose"},{"id":900, "name":"james", "hometown":"aurora"},{"id":1000, "name":"peter", "hometown":"elgin"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1500, "name":"glenn", "hometown":"uptown"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1500, "name":"glenn", "hometown":"uptown"}  [{"id":200, "name":"beth", "hometown":"indiana"},{"id":300, "name":"frank", "hometown":"new york"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1100, "name":"don", "hometown":"santa monica"}]

键和值之间有一个空格(值是包含json文本的列表)。
我试过的代码

data = spark\
.read\
.format("json")\
.load("/Users/sahilnagpal/Desktop/dataworld.json")

data.show()

我得到的结果

+------------+----+-----+
|    hometown|  id| name|
+------------+----+-----+
|     chicago| 100| anna|
|santa monica|1100|  don|
|     newtown|1400|steve|
|      uptown|1500|glenn|
+------------+----+-----+

我想要的结果

+------------+----+-----+
|    hometown|  id| name| 
+------------+----+-----+
|     chicago| 100| anna| -- all the other ID,name,hometown corresponding to this ID and Name
|santa monica|1100|  don| -- all the other ID,name,hometown corresponding to this ID and Name
|     newtown|1400|steve| -- all the other ID,name,hometown corresponding to this ID and Name
|      uptown|1500|glenn| -- all the other ID,name,hometown corresponding to this ID and Name
+------------+----+-----+
zynd9foi

zynd9foi1#

我认为与其将其作为json文件读取,不如尝试将其作为文本文件读取,因为json字符串看起来不像有效的json。
下面是您应该尝试获得预期输出的代码:

from pyspark.sql.functions import *
from pyspark.sql.types import *
data1 = spark.read.text("/Users/sahilnagpal/Desktop/dataworld.json")
schema = StructType(
    [
        StructField('id', StringType(), True),
        StructField('name', StringType(), True),
        StructField('hometown',StringType(),True)
    ]
)
data2 = data1.withColumn("JsonKey",split(col("value"),"\\[")[0]).withColumn("JsonValue",split(col("value"),"\\[")[1]).withColumn("data",from_json("JsonKey",schema)).select(col('data.*'),'JsonValue')

下面是基于上述代码得到的输出。

ep6jt1vc

ep6jt1vc2#

您可以使用两个空格作为分隔符/分隔符将输入读取为csv文件。然后使用 from_json 具有适当的模式。

df = spark.read.csv('/Users/sahilnagpal/Desktop/dataworld.json', sep='  ').toDF('json1', 'json2')

df2 = df.withColumn(
    'json1', 
    F.from_json('json1', 'struct<id:int, name:string, hometown:string>')
).withColumn(
    'json2', 
    F.from_json('json2', 'array<struct<id:int, name:string, hometown:string>>')
).select('json1.*', 'json2')

df2.show(truncate=False)
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id  |name |hometown    |json2                                                                                                                                                                       |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |anna |chicago     |[[200, beth, indiana], [400, pete, new jersey], [500, emily, san fransisco], [700, anna, dudley], [1100, don, santa monica], [1300, sarah, hoboken], [1600, john, downtown]]|
|1100|don  |santa monica|[[100, anna, chicago], [400, pete, new jersey], [500, emily, san fransisco], [1200, jane, freemont], [1600, john, downtown], [1500, glenn, uptown]]                         |
|1400|steve|newtown     |[[100, anna, chicago], [600, john, san jose], [900, james, aurora], [1000, peter, elgin], [1100, don, santa monica], [1500, glenn, uptown], [1600, john, downtown]]         |
|1500|glenn|uptown      |[[200, beth, indiana], [300, frank, new york], [400, pete, new jersey], [500, emily, san fransisco], [1100, don, santa monica]]                                             |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

相关问题