pandas 修改Stopword-Removal-Code以删除数字

tmb3ates 于 2023-01-19 发布在其他

关注(0)|答案(1)|浏览(131)

我在df列中有一个标记化的文本。从其中删除停止词的代码可以工作，但我喜欢删除标点符号、数字和特殊字符，而不需要拼写它们。就像我想确保它也删除了更大的数字/标记化为一个标记。
我的当前代码是：

eng_stopwords = stopwords.words('english')
punctuation = ['.', ',', ';', ':', '!' #and so on] 
complete_stopwords = punctuation + eng_stopwords
df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])

pandas

来源：https://stackoverflow.com/questions/75148129/modify-stopword-removal-code-to-remove-numbers-as-well

1条答案

按热度按时间

u3r8eeie1#

你可以从字符串模块中获取标点符号：

import string
print(string.punctuation)

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

eng_stopwords = stopwords.words('english')

punctuation = list(string.punctuation) 

complete_stopwords = punctuation + eng_stopwords

df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])

赞(0）回复(0）举报 2023-01-19

我来回答

pandas 修改Stopword-Removal-Code以删除数字

1条答案

相关问题

热门标签

最新问答