仅从libpostal获取街道和国家(pypostal)-pyspark

zf9nrax1  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(588)

我在用 libpostal - pypostal 但我只需要 road 以及 country 在数组中 ["franklin ave","usa"],["leonard st","united kingdom"] 我怎样才能做到这一点?
返回类型为 net.razorvine.pickle.objects.classdictconstructor ```
from pyspark.sql.functions import udf

LIBPOSTAL_LOADED = False
@udf("string")
def parse(address):
from postal.parser import parse_address

address_parsed = parse_address(address)

return str(address_parsed)

spark.createDataFrame(['781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA','The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom'], "string").toDF("address").select(parse("address")).show(truncate=False)

![](https://i.stack.imgur.com/D9ZXB.png)
@mck应要求更新

@udf("array")
def parse(address):
from postal.parser import parse_address

address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]

return address_parsed

+------------------+
|[franklin ave,usa]|
+------------------+

这是意料之中的############################################################################

@udf("array")
def parse(address):
from postal.parser import parse_address

address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]

return address_parsed[0]

+-----+
|null |
+-----+

这并不像预期的那样。我希望第一个元素来自 `address_parsed` 就是这样 `franklin ave` 
pgccezyw

pgccezyw1#

在返回解析后的地址之前,您可以尝试列表理解:

@udf("array<string>")
def parse(address):
   from postal.parser import parse_address

   address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]

   return address_parsed

相关问题