如何在pyspark中定制regex模式

kfgdxczn 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(425)

我已经实现了一个regex模式来过滤jupyter笔记本中文本中的特定字符
现在我尝试在pyspark中使用regextokenizer实现同样的功能，你们中的一个能为我提供在pyspark中定制regex的输入吗

cleaned = re.sub('\W+', ' ', i).lower()
    # remove all single characters
cleaned = re.sub(r'\s+[a-zA-Z]\s+', ' ', cleaned)

# Remove single characters from the start

cleaned = re.sub(r'\^[a-zA-Z]\s+', ' ', cleaned) 

# Substituting multiple spaces with single space

cleaned= re.sub(r'\s+', ' ', cleaned, flags=re.I)

# Removing prefixed 'b'

cleaned = re.sub(r'^b\s+', '', cleaned)
cleaned = cleaned.strip()

apache-spark pyspark regex

来源：https://stackoverflow.com/questions/62718050/how-to-customize-regex-pattern-in-pyspark

1条答案

按热度按时间

qxgroojn1#

没什么特别的。使用sparkDataframe，根据需要的条件创建一个新列，并将尽可能多的regex链接在一起（尽管看起来您的regex可能会有很大的改进）

from pyspark.sql.functions import col, lower, regexp_replace

df = df.withColumn(
    "cleaned",
    lower(col("raw")).regexp_replace(col("raw"), "\W+", " ")
)

赞(0）回复(0）举报 2021-05-27

我来回答

如何在pyspark中定制regex模式

1条答案

相关问题

热门标签

最新问答