从字典中的数组列中查找值Pyspark

7qhs6swi  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(134)

我在Pyspark中有这样一个框架:

  1. data = [("definitely somewhere",), ("Las Vegas",), ("其他",), (None,), ("",), ("Pucela Madrid Langreo, España",), ("Trenches, With Egbon Adugbo",)]
  2. df = spark.createDataFrame(data, ["address"])
  3. city_country = {
  4. 'las vegas': 'US',
  5. 'lagos': 'NG',
  6. 'España': 'ES'
  7. }
  8. cities_name_to_code = spark.sparkContext.broadcast(city_country )
  9. df_with_codes = df.withColumn('cities_array', F.lower(F.col('address'))) \
  10. .withColumn('cities_array', F.split(F.col('cities_array'), ', '))

字符串
我想在cities_array中找到cities_name_to_code中每个元素的所有键(得到一个值数组)。问题是我不想使用UDF。

kuhbmx9i

kuhbmx9i1#

对于这个用例,你可以使用transform高阶函数,并将case作为它内部的函数传递。
这里有一个例子

  1. # create case when builder function
  2. case_whens = lambda c: reduce(lambda x, y: x.when(c == y[0].lower(), y[1]), city_country.items(), func)
  3. # test case when builder
  4. # case_whens(func.lit('bork'))
  5. # Column<'CASE WHEN (bork = las vegas) THEN US WHEN (bork = lagos) THEN NG WHEN (bork = españa) THEN ES END'>
  6. # use case when inside the `transform`
  7. df_with_codes_sdf = data_sdf. \
  8. withColumn('cities_array', func.lower(func.col('address'))). \
  9. withColumn('cities_array', func.split(func.col('cities_array'), ', ')). \
  10. withColumn('city_codes_array', func.transform('cities_array', lambda a: case_whens(a))). \
  11. show(truncate=False)
  12. # +-----------------------------+-------------------------------+----------------+
  13. # |address |cities_array |city_codes_array|
  14. # +-----------------------------+-------------------------------+----------------+
  15. # |definitely somewhere |[definitely somewhere] |[null] |
  16. # |Las Vegas |[las vegas] |[US] |
  17. # |其他 |[其他] |[null] |
  18. # |null |null |null |
  19. # | |[] |[null] |
  20. # |Pucela Madrid Langreo, España|[pucela madrid langreo, españa]|[null, ES] |
  21. # |Trenches, With Egbon Adugbo |[trenches, with egbon adugbo] |[null, null] |
  22. # +-----------------------------+-------------------------------+----------------+

字符串

展开查看全部

相关问题