获取pyspark中特定位置后的子字符串的位置

lsmd5eda  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(382)

我有一张这样的table:

+-----+-----------------------+
| id  |                 word  |
+---+-------------------------+
|  1  |  today is a nice day  |
|  2  |          hello world  |
|  3  |           he is good  |
|  4  |       is it raining?  |
+-----+-----------------------+

我想得到一个子串的位置( is )在 word 仅当列出现在第3个位置之后时

+-----+-----------------------+-----------------+
| id  |                 word  |  substr_position|
+---+-------------------------+-----------------+
|  1  |  today is a nice day  |              7  |
|  2  |          hello world  |              0  |
|  3  |           he is good  |              4  |
|  4  |       is it raining?  |              0  |
+-----+-----------------------+-----------------+

有什么帮助吗?

2skhul33

2skhul331#

您可以使用spark中的定位功能。
它返回字符串列中第一个出现的子字符串,位于特定位置之后。

from pyspark.sql.functions import locate, col
df.withColumn("substr_position", locate("is", col("word"), pos=3)).show()

+---+-------------------+---------------+
| id|               word|substr_position|
+---+-------------------+---------------+
|  1|today is a nice day|              7|
|  2|        hello world|              0|
|  3|         he is good|              4|
|  4|     is it raining?|              0|
+---+-------------------+---------------+

相关问题