我有一个存储在excel中的数据,但是只有一列是json格式的。我想平展这一专栏,目前为止我尝试了以下几点:
首先,我提供输入数据和预期的所需输出:
输入数据
[Row(point='["{\\"data\\":{\\"state\\":\\"IL\\"}}","{\\"data\\":{\\"state\\":\\"CA\\"}}","{\\"data\\":{\\"pop\\":\\"100\\",\\"band\\":\\"Rock\\"}}","{\\"data\\":{\\"pop\\":\\"200\\",\\"band\\":\\"Melody\\"}}","{\\"data\\":{\\"pop\\":\\"300\\",\\"band\\":\\"Race\\"}}"]', id='1abc'),
Row(point='["{\\"data\\":{\\"state\\":\\"IL\\"}}","{\\"data\\":{\\"state\\":\\"CA\\"}}","{\\"data\\":{\\"pop\\":\\"400\\",\\"band\\":\\"Rock\\"}}","{\\"data\\":{\\"pop\\":\\"500\\",\\"band\\":\\"Jazz\\"}}","{\\"data\\":{\\"pop\\":\\"500\\",\\"band\\":\\"Loops\\"}}"]', id='2cde')]
预期输出数据
id = 1abc, state = IL, pop = None, band = None
id = 1abc, state = CA, pop = None, band = None
id = 1abc, state = None, pop = 100, band = Rock
id = 1abc, state = None, pop = 200, band = Melody
id = 1abc, state = None, pop = 300, band = Race
id = 2cde, state = IL, pop = None, band = None
id = 2cde, state = CA, pop = None, band = None
id = 2cde, state = None, pop = 400, band = Rock
id = 2cde, state = None, pop = 500, band = Jazz
id = 2cde, state = None, pop = 500, band = Loops
目前为止的代码。。
# Read as pandas
pd_df = pd.read_excel('test.xlsx')
# Convert to spark df
schema = StructType([StructField("point", StringType(), True),
StructField("id", StringType(), True)
])
df = spark.createDataFrame(pd_df, schema = schema)
# Do some cleaning to remove \\ and quotes
a = df.withColumn('point', regexp_replace(col('point'), "\\\\", ""))
b = a.withColumn('point', regexp_replace(col('point'), '}","', '},'))
c = b.withColumn('point', regexp_replace(col('point'), '\\["', '['))
d = c.withColumn('point', regexp_replace(col('point'), '\\"]', ']'))
# after cleaning
d.take(2)
``` `[Row(point='[{"data":{"state":"IL"}},{"data":{"state":"CA"}},{"data":{"pop":"100","band":"Rock"}},{"data":{"pop":"200","band":"Melody"}},{"data":{"pop":"300","band":"Race"}}]', id='1'), Row(point='[{"data":{"state":"IL"}},{"data":{"state":"CA"}},{"data":{"pop":"400","band":"Rock"}},{"data":{"pop":"500","band":"Jazz"}},{"data":{"pop":"500","band":"Loops"}}]', id='2cde')]` ```
# Flatten the point column
point_schema = score_schema = StructType([StructField("state", StringType(), True),
StructField("band", StringType(), True),
StructField("pop", IntegerType(), True)
])
final_df = d.withColumn('point', from_json('point', point_schema))
尽管指定了点\模式,但Dataframefinal\ df的结果始终为无。我不知道为什么它没有返回。任何帮助都很有帮助
1条答案
按热度按时间xtupzzrd1#
用这个-
您可以像下面一样更改您的模式-
请注意,不要更改
pop
至int
在模式中,它将导致null
全部from_json(...)
表达式的值pop
字段给定为string
在字符串json中。