我在读取带有奇怪列的csv文件时遇到了一个问题。
架构
root
|-- Id: integer (nullable = true)
|-- Lon_tower: double (nullable = true)
|-- Lat_tower: double (nullable = true)
|-- Compagny: string (nullable = true)
|-- Address_tower: string (nullable = true)
|-- Assigned_band_1: string (nullable = true)
|-- Assigned_band_2: string (nullable = true)
|-- Assigned_band_3: string (nullable = true)
|-- Assigned_band_4: string (nullable = true)
|-- Assigned_band_5: string (nullable = true)
|-- raw_geocode: string (nullable = true)
原始地理代码示例
[{'road': 'Calle el Topo', 'residential': 'Los Sauces', 'hamlet': 'El Cardal', 'village': 'Los Sauces', 'city': 'San Andrés y Sauces', 'county': 'Santa Cruz de Tenerife', 'archipelago': 'Canarias', 'postcode': '38720', 'country': 'España', 'country_code': 'es'}]
我想获取作为标题的键,并用值填充sparkdataframe,如果此行不存在该键,则使用null。我不想要所有的钥匙,只想要清单上的一些。我删除了[']
一个更好理解的例子:
myList = ['road', 'tourism', 'country_code']
|Id |...|raw_geocode |
|1 |...|{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es}
|2 |...|{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es}
期望结果
|ID |...|road |tourism |country_code|
|1 |...|Calle el Topo |NULL |es
|2 |...|Null |Mirador Montaña El Molino |es
1条答案
按热度按时间uqcuzwp81#
你可以用
regexp_extract
要提取所需的值: