提取JSON值并使用PySpark连接它们

o2g1uqev  于 2023-02-21  发布在  Spark
关注(0)|答案(1)|浏览(138)

我有一个JSON数组,如下所示。

id  address

1   [{street: 11 Summit Ave, city: null, postal_code: 07306, state: NJ , country: null}, {street: 11 Sum Ave , city: null , postal_code: null, state: NJ, country: US}, {street: 12 Oliver Avenue, city: Seattle , postal_code: 98121, state: WA, country: US}]

以下是数据类型:

root
 |-- id: string (nullable = true)
 |-- addresses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |    |-- street: string (nullable = true)
 |    |    |-- postalCode: string (nullable = true)
 |    |    |-- country: string (nullable = true)

我想创建一个地址字符串,忽略空值,并用分隔符分隔(比如;)。因此输出应如下所示:

id  addresses

1   11 Summit Ave 07306 NJ ; 11 Sum Ave NJ US; 12 Oliver Avenue Seattle 98121 WA US

如何在PySpark中实现这一点呢?如果有必要的话,我的原始地址是字符串类型,但使用from_json,我将其转换为上面指定的模式。

f4t66c6m

f4t66c6m1#

这是可行的:

df.withColumn("allAdd", F.explode("addresses"))\
.withColumn("asString", F.expr("concat_ws(' ', allAdd.*)"))\
.groupBy("id")\
.agg(F.concat_ws("; ", F.collect_list("asString")).alias("asString"))\
.show(truncate=False)

输入:

输出:

相关问题