pyspark,如何解析格式为dict的字符串并将一些键作为新列附加

8fq7wneg  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(338)

我在读取带有奇怪列的csv文件时遇到了一个问题。
架构

root
 |-- Id: integer (nullable = true)
 |-- Lon_tower: double (nullable = true)
 |-- Lat_tower: double (nullable = true)
 |-- Compagny: string (nullable = true)
 |-- Address_tower: string (nullable = true)
 |-- Assigned_band_1: string (nullable = true)
 |-- Assigned_band_2: string (nullable = true)
 |-- Assigned_band_3: string (nullable = true)
 |-- Assigned_band_4: string (nullable = true)
 |-- Assigned_band_5: string (nullable = true)
 |-- raw_geocode: string (nullable = true)

原始地理代码示例

[{'road': 'Calle el Topo', 'residential': 'Los Sauces', 'hamlet': 'El Cardal', 'village': 'Los Sauces', 'city': 'San Andrés y Sauces', 'county': 'Santa Cruz de Tenerife', 'archipelago': 'Canarias', 'postcode': '38720', 'country': 'España', 'country_code': 'es'}]

我想获取作为标题的键,并用值填充sparkdataframe,如果此行不存在该键,则使用null。我不想要所有的钥匙,只想要清单上的一些。我删除了[']
一个更好理解的例子:

myList = ['road', 'tourism', 'country_code']
|Id   |...|raw_geocode |
|1    |...|{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es}
|2    |...|{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es}

期望结果

|ID  |...|road          |tourism                   |country_code|
|1   |...|Calle el Topo |NULL                      |es
|2   |...|Null          |Mirador Montaña El Molino |es
uqcuzwp8

uqcuzwp81#

你可以用 regexp_extract 要提取所需的值:

myList = ['road', 'tourism', 'country_code']

for i in myList:
    df = df.withColumn(
        i, 
        F.when(
            F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1) != "",
            F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1)
        )
    )

df.show(truncate=False)
+---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+
|Id |raw_geocode                                                                                                 |road                     |tourism                  |country_code|
+---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+
|1  |{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es}                             |Calle el Topo            |null                     |es          |
|2  |{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es}|Mirador Montaña El Molino|Mirador Montaña El Molino|es          |
+---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+

相关问题