我有一个JSON数组,如下所示。
id address
1 [{street: 11 Summit Ave, city: null, postal_code: 07306, state: NJ , country: null}, {street: 11 Sum Ave , city: null , postal_code: null, state: NJ, country: US}, {street: 12 Oliver Avenue, city: Seattle , postal_code: 98121, state: WA, country: US}]
以下是数据类型:
root
|-- id: string (nullable = true)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- street: string (nullable = true)
| | |-- postalCode: string (nullable = true)
| | |-- country: string (nullable = true)
我想创建一个地址字符串,忽略空值,并用分隔符分隔(比如;)。因此输出应如下所示:
id addresses
1 11 Summit Ave 07306 NJ ; 11 Sum Ave NJ US; 12 Oliver Avenue Seattle 98121 WA US
如何在PySpark中实现这一点呢?如果有必要的话,我的原始地址是字符串类型,但使用from_json,我将其转换为上面指定的模式。
1条答案
按热度按时间f4t66c6m1#
这是可行的:
输入:
输出: