我有两列要彼此部分匹配。例如:
A, B Birmingham Hoover, Hoover Birmingham Area
这两列应该表示一个区域,但是当使用contains函数时,它没有捕捉到它。你知道有没有我可以部分匹配这两列的函数?谢谢。
ki0zmccv1#
尝试 .rlike 功能。分裂 column B 价值依据 " " 与…有关 | 然后与rlike匹配,因此任何与列匹配的单词都将被过滤掉 Example: ```df=spark.createDataFrame([('Birmingham Hoover','Hoover Birmingham Area'),('ABCD',"Z Y Z U")],['A','B'])
.rlike
column B
" "
|
Example:
df.show()
from pyspark.sql.functions import *
df.withColumn("B",concat_ws("|",split(col("B")," "))). filter(expr('A rlike B')). show(10,False)
1条答案
按热度按时间ki0zmccv1#
尝试
.rlike
功能。分裂
column B
价值依据" "
与…有关|
然后与rlike匹配,因此任何与列匹配的单词都将被过滤掉Example:
```df=spark.createDataFrame([('Birmingham Hoover','Hoover Birmingham Area'),('ABCD',"Z Y Z U")],['A','B'])
df.show()
+-----------------+----------------------+
|A |B |
+-----------------+----------------------+
|Birmingham Hoover|Hoover Birmingham Area|
|ABCD |Z Y Z U |
+-----------------+----------------------+
from pyspark.sql.functions import *
splitting B col value by " " concatinating with | then matching with rlike
df.withColumn("B",concat_ws("|",split(col("B")," "))).
filter(expr('A rlike B')).
show(10,False)
+-----------------+----------------------+
|A |B |
+-----------------+----------------------+
|Birmingham Hoover|Hoover|Birmingham|Area|
+-----------------+----------------------+