我想采取我的字典,其中包含关键字,并检查一列在pyspark-df看看关键字是否存在,如果是这样,然后返回一个新的列中的字典值。
问题是这样的;
myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}
df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])
+-----+-------------------------+
| id |reason |
+-----+-------------------------+
|1 |Needed better support |
|2 |Better value from android|
|3 | Price was to expensive |
|4 | Support problems |
+-----+-------------------------+
我想要的最终结果是:
+-----+-------------------------+---------------------+
| id |reason |new_reason |
+-----+-------------------------+---------------------+
|1 |Needed better support | Support Issue |
|2 |Better value from android| Left for Competitor |
|3 |Price was to expensive | Pricing Issue |
|4 |Support issue | Support Issue |
+-----+-------------------------+---------------------+
在pyspark中构建高效函数的最佳方法是什么?
2条答案
按热度按时间y1aodyip1#
你可以用
when
用于检查列reason
匹配dict键。您可以动态生成when
使用python的表达式functools.reduce
通过传递列表myDict.keys()
:uidvcgyl2#
您可以创建关键字dataframe,并使用
rlike
条件。我补充道\\\\b
在关键字之前和之后,只有单词边界之间的单词才会匹配,并且不会有部分单词匹配(例如,“菠萝”匹配“苹果”)。