pyspark从excel读取,只有一列是json格式的

pjngdqdw  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(455)

我有一个存储在excel中的数据,但是只有一列是json格式的。我想平展这一专栏,目前为止我尝试了以下几点:
首先,我提供输入数据和预期的所需输出:

输入数据

[Row(point='["{\\"data\\":{\\"state\\":\\"IL\\"}}","{\\"data\\":{\\"state\\":\\"CA\\"}}","{\\"data\\":{\\"pop\\":\\"100\\",\\"band\\":\\"Rock\\"}}","{\\"data\\":{\\"pop\\":\\"200\\",\\"band\\":\\"Melody\\"}}","{\\"data\\":{\\"pop\\":\\"300\\",\\"band\\":\\"Race\\"}}"]', id='1abc'),  
 Row(point='["{\\"data\\":{\\"state\\":\\"IL\\"}}","{\\"data\\":{\\"state\\":\\"CA\\"}}","{\\"data\\":{\\"pop\\":\\"400\\",\\"band\\":\\"Rock\\"}}","{\\"data\\":{\\"pop\\":\\"500\\",\\"band\\":\\"Jazz\\"}}","{\\"data\\":{\\"pop\\":\\"500\\",\\"band\\":\\"Loops\\"}}"]', id='2cde')]

预期输出数据

id = 1abc, state = IL, pop = None, band = None
id = 1abc, state = CA, pop = None, band = None
id = 1abc, state = None, pop = 100, band = Rock
id = 1abc, state = None, pop = 200, band = Melody
id = 1abc, state = None, pop = 300, band = Race
id = 2cde, state = IL, pop = None, band = None
id = 2cde, state = CA, pop = None, band = None
id = 2cde, state = None, pop = 400, band = Rock
id = 2cde, state = None, pop = 500, band = Jazz
id = 2cde, state = None, pop = 500, band = Loops

目前为止的代码。。


# Read as pandas

pd_df = pd.read_excel('test.xlsx')

# Convert to spark df

schema = StructType([StructField("point", StringType(), True),
                StructField("id", StringType(), True)
                ])
df = spark.createDataFrame(pd_df, schema = schema)

# Do some cleaning to remove \\ and quotes

a = df.withColumn('point', regexp_replace(col('point'), "\\\\", ""))
b = a.withColumn('point', regexp_replace(col('point'), '}","', '},'))
c = b.withColumn('point', regexp_replace(col('point'), '\\["', '['))
d = c.withColumn('point', regexp_replace(col('point'), '\\"]', ']'))

# after cleaning

d.take(2)
``` `[Row(point='[{"data":{"state":"IL"}},{"data":{"state":"CA"}},{"data":{"pop":"100","band":"Rock"}},{"data":{"pop":"200","band":"Melody"}},{"data":{"pop":"300","band":"Race"}}]', id='1'), Row(point='[{"data":{"state":"IL"}},{"data":{"state":"CA"}},{"data":{"pop":"400","band":"Rock"}},{"data":{"pop":"500","band":"Jazz"}},{"data":{"pop":"500","band":"Loops"}}]', id='2cde')]` ```

# Flatten the point column

point_schema = score_schema = StructType([StructField("state", StringType(), True),
            StructField("band", StringType(), True),
            StructField("pop", IntegerType(), True)
           ])

final_df = d.withColumn('point', from_json('point', point_schema))

尽管指定了点\模式,但Dataframefinal\ df的结果始终为无。我不知道为什么它没有返回。任何帮助都很有帮助

xtupzzrd

xtupzzrd1#

用这个-

final_df = d.withColumn('point', from_json('point', lit('array<struct<data:struct<band:string,pop:string,state:string>>>')))

您可以像下面一样更改您的模式-

point_schema = ArrayType(StructType([
      StructField("data", 
      StructType([
      StructField("state", StringType(),True),
      StructField("band", StringType(), True),
      StructField("pop", StringType(), True)
    ])
    ,True)]))

final_df = d.withColumn('point', from_json('point', point_schema))

请注意,不要更改 popint 在模式中,它将导致 null 全部 from_json(...) 表达式的值 pop 字段给定为 string 在字符串json中。

相关问题