如果字符串中存在匹配模式,如何提取regexp\u-pyspark

ltskdhd1  于 2021-07-14  发布在  Spark
关注(0)|答案(3)|浏览(378)

我试图获得一些关于pyspark中regexp\u extract的见解,并尝试使用此选项进行检查以获得更好的理解。
下面是我的Dataframe

data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),        
('3983', 'Verified|for Stacy|by John')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+-----------------------------------------------------+
|  ID|               Notes                                 |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal                            |
|2398|Checked|by John|for kamal                            |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John                           |
+----+-----------------------------------------------------+

所以在这里我试图确定一个身份证是由约翰检查还是验证的
在so成员的帮助下,我破解了regexp\u提取物的用法,并得出了以下解决方案

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))

result.show()

+----+------------------------------------------------+------------+
|  ID|               Notes                                |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal                           | Checked|
|2398|Checked|by John|for kamal                           | Checked|
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |        |
+----+--------------------+----------------------------------------+

对于少数的身份证,这给了我完美的结果,但最后一个身份证,它没有打印验证。有人能告诉我是否需要在所提到的正则表达式中执行任何其他操作吗?
我的感觉是 (Checked|Verified)(\\|by John) 仅匹配相邻值。我试过*和$,但还是没有打印身份证号3983。

slwdgvem

slwdgvem1#

我会用正则表达式来表达:

(Checked|Verified)\b.*\bby John

演示

这种模式 Checked/Verified 然后 by John 两者之间用任意数量的文字隔开。请注意,这里我只使用单词边界,而不是管道。
更新代码:

result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))
ifmq2ha2

ifmq2ha22#

你可以试试这个正则表达式:

import pyspark.sql.functions as F

result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))

result.show()
+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...|        |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+
j8ag8udp

j8ag8udp3#

另一种方法是检查列注解是否包含字符串 by John :

df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)

+----+----------------------------------------------------+--------+
|ID  |Notes                                               |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal                           |Checked |
|2398|Checked|by John|for kamal                           |Checked |
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |Verified|
+----+----------------------------------------------------+--------+

相关问题