我有一个“text”列,其中存储了令牌数组。如何过滤所有这些数组,使令牌至少有三个字母长?
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
]
df = spark.createDataFrame(vals, columns)
df.show()
# Had tried this but have TypeError: Column is not iterable
# df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word
# in col('text')], ''))
# df_clean.show()
我期待看到:
id | text
1 | [good]
2 | [You, are]
2条答案
按热度按时间ni65a41a1#
这样就可以了,你可以决定是否排除行,我添加了一个额外的列并过滤掉,但选项是你的:
返回:
xzv2uavs2#
这就是解决办法