从pyspark列中删除子字符串及其之前的所有字符

pjngdqdw  于 2022-11-01  发布在  Spark
关注(0)|答案(2)|浏览(165)

我在dataframe(df)中有一个pyspark对象列,如下所示:

|      'A'              |
-------------------------
| field 1 - order - one |
| field 2 - sell        |
|     order             |
|     sell              |

我想在使用regex_replace或其他SQL函数之前删除第一个出现的'- '和所有字符,但在这种情况下会有一点麻烦。下面是所需的输出:

|      'A'        |
-------------------
|   order - one   |
|     sell        |
|     order       |
|     sell        |
g9icjywg

g9icjywg1#

这应该行得通

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [
        ("field 1 - order", "None"),
        ("field 2 - sell", "None"),
        ("order", "None"),
        ("sell", "None"),
    ],
    ["A", "B"],
)
df.show()

df = (
    df
    .withColumn("A", F.regexp_replace("A" , "^([^-]+)-" ,"",)  )
)

df.show()

输出:

+---------------+----+
|              A|   B|
+---------------+----+
|field 1 - order|None|
| field 2 - sell|None|
|          order|None|
|           sell|None|
+---------------+----+

+------+----+
|     A|   B|
+------+----+
| order|None|
|  sell|None|
| order|None|
|  sell|None|
+------+----+
mzaanser

mzaanser2#

另一种解决方法是按字符拆分列A,然后对结果数组进行切片并获取元素。

df.withColumn('A', slice(split('A','\-'),-1,1)[0]).show()

相关问题