我有一个特别有一列的sparkDataframe location_string
我只想把它分解成3列,叫做 country
, region
,和 city
. 然后我想把这些和已经存在的结合起来 country
, region
, city
列以确保填充空值。或者换句话说,我想把我的函数应用到 city
, region
,或 country
为null,尝试使用 location_string
.
示例数据集:
+--------------------+-----------------+------+-------+
| location_string| city|region|country|
+--------------------+-----------------+------+-------+
|Jonesboro, AR, US...| NULL| AR| NULL|
|Lake Village, AR,...| Lake Village| AR| USA|
|Little Rock, AR, ...| Little Rock| AR| USA|
|Little Rock, AR, ...| Little Rock| AR| USA|
|Malvern, AR, US, ...| Malvern| NULL| USA|
|Malvern, AR, US, ...| Malvern| AR| USA|
|Morrilton, AR, US...| Morrilton| AR| USA|
|Morrilton, AR, US...| Morrilton| AR| USA|
|N. Little Rock, A...|North Little Rock| AR| USA|
|N. Little Rock, A...|North Little Rock| AR| USA|
|Ozark, AR, US, 72949| Ozark| AR| USA|
|Ozark, AR, US, 72949| Ozark| AR| USA|
|Palestine, AR, US...| NULL| AR| USA|
|Pine Bluff, AR, U...| Pine Bluff| AR| NULL|
|Pine Bluff, AR, U...| Pine Bluff| AR| USA|
|Prescott, AR, US,...| Prescott| AR| USA|
|Prescott, AR, US,...| Prescott| AR| USA|
|Searcy, AR, US, 7...| Searcy| AR| USA|
|Searcy, AR, US, 7...| Searcy| AR| USA|
|West Memphis, AR,...| West Memphis| NULL| USA|
+--------------------+-----------------+------+-------+
分解位置字符串的函数示例:
def geocoder_decompose_location(location_string):
if not location_string:
return {'country': None, 'state': None, 'city': None}
GOOGLE_GEOCODE_API_KEY = "<API KEY HERE>"
result = geocoder.google(location_string, key=GOOGLE_GEOCODE_API_KEY)
return {'country': result.country, 'state': result.state, 'city': result.city}
1条答案
按热度按时间vtwuwzda1#
scala伪码
首先,我们需要从df中删除所有重复项(这将减少对google服务的api调用)。
我们还可以在删除重复项之前执行orderby(desc(“city”)、desc(“country”)、desc(“state”)),以便在存在重复项时(将删除具有空值的项)。