python 我可以在pyspark中从现有列的右侧使用可变数量的字符创建一个新列吗？

eqzww0vc 于 2024-01-05 发布在 Python

关注(0)|答案(1)|浏览(139)

我有一个pyspark框架，基本上看起来像下表：
| 产品|名称|
| --|--|
| ABCD - 12| ABCD|
| xyz - 123543| xyz|
我希望创建一个新列（UPC），它只包含Product列中连字符右侧的数字。
我知道在Excel中我可以使用Right函数和len和find，但据我所知，这些在Python中没有等价物。
我尝试创建两个新列，LastHyphen（因为product列可能有超过1个连字符）和ProductLength。然后我希望将它们插入子字符串函数，但我一直得到“列不可迭代”错误。

df4 = df3.withColumn("LastHyphen",length(col("PRODUCT"))-locate('-',reverse(col("PRODUCT"))))
df4 = df4.withColumn("ProductLength",length(col("PRODUCT")))
df4 = df4.withColumn("UPC", substring("PRODUCT", df4.LastHyphen, df4.ProductLength - df4.LastHyphen))
TypeError: Column is not iterable

字符串
我希望得到这样的输出：
| 产品|UPC|
| --|--|
| ABCD - 12| 12 |
| xyz - 123543| 123543 |

python

来源：https://stackoverflow.com/questions/77753468/can-i-create-a-new-column-using-a-variable-amount-of-characters-from-the-right-o

1条答案

按热度按时间

ssgvzors1#

有一个类似的问题here，答案涉及到一个regexp拆分。
在您的特定环境中，使用正则表达式从字符串中提取UPC可能是最简单的方法。

from pyspark.sql import Row
from pyspark.sql.functions import col, regexp_extract
df = spark.createDataFrame(
    [
        Row(product="abcd - 12", name="abcd"),
        Row(product="xyz - 123543", name="xyz"),
        Row(product="xyz - abc - 123456", name="xyz - abc"),
    ]
)
df.withColumn("UPC", regexp_extract(col("product"), ".* - ([0-9]{1,})", 1)).show()

个字符

展开查看全部

赞(0）回复(0）举报 2024-01-05

我来回答

python 我可以在pyspark中从现有列的右侧使用可变数量的字符创建一个新列吗？

1条答案

相关问题

热门标签

最新问答