pyspark:检查col中的值是否像dict中的键

erhoui1w  于 2021-07-09  发布在  Spark
关注(0)|答案(2)|浏览(310)

我想采取我的字典,其中包含关键字,并检查一列在pyspark-df看看关键字是否存在,如果是这样,然后返回一个新的列中的字典值。
问题是这样的;

myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}

df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])

+-----+-------------------------+
| id  |reason                   |
+-----+-------------------------+
|1    |Needed better support    |
|2    |Better value from android|
|3    | Price was to expensive  |
|4    | Support problems        |
+-----+-------------------------+

我想要的最终结果是:

+-----+-------------------------+---------------------+
| id  |reason                   |new_reason           |
+-----+-------------------------+---------------------+
|1    |Needed better support    | Support Issue       |
|2    |Better value from android| Left for Competitor |
|3    |Price was to expensive   | Pricing Issue       |
|4    |Support issue            | Support Issue       |
+-----+-------------------------+---------------------+

在pyspark中构建高效函数的最佳方法是什么?

y1aodyip

y1aodyip1#

你可以用 when 用于检查列 reason 匹配dict键。您可以动态生成 when 使用python的表达式 functools.reduce 通过传递列表 myDict.keys() :

from functools import reduce
from pyspark.sql import functions as F

df2 = df.withColumn(
    "new_reason",
    reduce(
        lambda c, k: c.when(F.lower(F.col("reason")).rlike(rf"\b{k.lower()}\b"), myDict[k]),
        myDict.keys(),
        F
    )
)

df2.show(truncate=False)

# +---+-------------------------+-------------------+

# |id |reason                   |new_reason         |

# +---+-------------------------+-------------------+

# |1  |Needed better Support    |Support Issue      |

# |2  |Better value from android|Left for Competitor|

# |3  |Price was to expensive   |Pricing Issue      |

# |4  |Support problems         |Support Issue      |

# +---+-------------------------+-------------------+
uidvcgyl

uidvcgyl2#

您可以创建关键字dataframe,并使用 rlike 条件。我补充道 \\\\b 在关键字之前和之后,只有单词边界之间的单词才会匹配,并且不会有部分单词匹配(例如,“菠萝”匹配“苹果”)。

import pyspark.sql.functions as F

keywords = spark.createDataFrame([[k,v] for (k,v) in myDict.items()]).toDF('key', 'new_reason')

result = df.join(
    keywords, 
    F.expr("lower(reason) rlike '\\\\b' || lower(key) || '\\\\b'"), 
    'left'
).drop('key')

result.show(truncate=False)
+---+-------------------------+-------------------+
|id |reason                   |new_reason         |
+---+-------------------------+-------------------+
|1  |Needed better Support    |Support Issue      |
|2  |Better value from android|Left for Competitor|
|3  |Price was to expensive   |Pricing Issue      |
|4  |Support problems         |Support Issue      |
+---+-------------------------+-------------------+

相关问题