pyspark split不适用于字符串类型的Dataframe列

sshcrbum  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(423)

我有一个带有字符串列(url)的pysparkDataframe,所有记录都按以下方式显示

ID                                   URL
1          https://app.xyz.com/inboxes/136636/conversations/2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796
3          https://app.drift.com/inboxes/136636/conversations/2938419189

我想使用regex将conversations/之后的数字从url列提取到另一列。
我尝试了下面的代码,但它没有给我任何结果。

df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))

预期:

ID                                   URL                                         CONV_ID
1          https://app.xyz.com/inboxes/136636/conversations/2686735685         2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796         2938415796     
3          https://app.drift.com/inboxes/136636/conversations/2938419189       2938419189

结果:

ID                                   URL                                         CONV_ID
1          https://app.xyz.com/inboxes/136636/conversations/2686735685         https://app.xyz.com/inboxes/136636/conversations/2686735685
2          https://app.xyz.com/inboxes/136636/conversations/2938415796         https://app.xyz.com/inboxes/136636/conversations/2938415796     
3          https://app.drift.com/inboxes/136636/conversations/2938419189       https://app.drift.com/inboxes/136636/conversations/2938419189

不知道这里发生了什么。我在不同的在线regex测试工具中尝试了regex脚本,它突出显示了我想要的部分,但在pyspark中从未起作用。我尝试了不同的pyspark函数,比如f.split、regexp\u extract、regex replace,但都不起作用。

vsdwdz23

vsdwdz231#

你可以用 regexp_extract 取而代之的是:

import pyspark.sql.functions as F

df1 = df.withColumn(
    'CONV_ID',
    F.regexp_extract('URL', 'conversations/(.*)', 1)
)

df1.show()
+---+--------------------+----------+
| ID|                 URL|   CONV_ID|
+---+--------------------+----------+
|  1|https://app.xyz.c...|2686735685|
|  2|https://app.xyz.c...|2938415796|
|  3|https://app.drift...|2938419189|
+---+--------------------+----------+

或者如果你想用 split ,您不需要指定 .* . 您只需要指定用于拆分的模式。

import pyspark.sql.functions as F

df1 = df.withColumn(
    'CONV_ID',
    F.split('URL', '(?<=conversations/)')[1]    # just using 'conversations/' should also be enough
)

df1.show()
+---+--------------------+----------+
| ID|                 URL|   CONV_ID|
+---+--------------------+----------+
|  1|https://app.xyz.c...|2686735685|
|  2|https://app.xyz.c...|2938415796|
|  3|https://app.drift...|2938419189|
+---+--------------------+----------+
tvmytwxo

tvmytwxo2#

如果你是URL一直都是那种形式,你实际上可以只使用 substring_index 要获取最后一个路径元素:

import pyspark.sql.functions as F

df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))

df1.show(truncate=False)

# +---+-------------------------------------------------------------+----------+

# |ID |URL                                                          |CONV_ID   |

# +---+-------------------------------------------------------------+----------+

# |1  |https://app.xyz.com/inboxes/136636/conversations/2686735685  |2686735685|

# |2  |https://app.xyz.com/inboxes/136636/conversations/2938415796  |2938415796|

# |3  |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|

# +---+-------------------------------------------------------------+----------+

相关问题