python-3.x 如何使CountVectorizer()忽略停用词而不管大小写

sdnqo3pr 于 2023-05-23 发布在 Python

关注(0)|答案(1)|浏览(136)

我使用Sklearn countvectorizer（）是这样的

vectorizer = CountVectorizer(
    stop_words="english",
    lowercase=False,
    ngram_range=ngram_range,
)

我不想将我的文本转换为小写，但我想删除所有的停用词，无论情况如何。上面的代码过滤掉了the，但没有过滤掉The或THE。我想过滤the，THE，The。是否可以通过CountVectorizer（）实现而不改变大小写？

python-3.x

来源：https://stackoverflow.com/questions/76291592/how-to-make-countvectorizer-ignore-stopwords-irrespective-of-the-case

1条答案

按热度按时间

bqujaahr1#

我不认为有一个简单的方法来覆盖停止字删除和只有停止字删除，但如果你通过一个自定义的分析器，你可以提供自己的停止字删除。
这是我能想到的最小的东西，它不会从分析器中删除任何功能：

from sklearn.feature_extraction.text import CountVectorizer
class CaseInsensitiveStopWordsAnalyzer:
    def set_cv(self, cv):
        self.cv = cv
    def remove_stop_words(self, stop_words, doc):
        stop_words = set(w.lower() for w in stop_words)
        return [w for w in doc if w.lower() not in stop_words]
    def __call__(self, doc):
        preprocessor = self.cv.build_preprocessor()
        tokenizer = self.cv.build_tokenizer()
        stop_words = self.cv.get_stop_words()
        ngrams = self.cv._word_ngrams
        if preprocessor is not None:
            doc = preprocessor(doc)
        if tokenizer is not None:
            doc = tokenizer(doc)
        if stop_words is not None:
            doc = self.remove_stop_words(stop_words, doc)
        if ngrams is not None:
            doc = ngrams(doc)
        return doc
analyzer = CaseInsensitiveStopWordsAnalyzer()
vectorizer = CountVectorizer(
    stop_words="english",
    lowercase=False,
    ngram_range=(1, 1),
    analyzer=analyzer,
)
analyzer.set_cv(vectorizer)
documents = ['foo Bar bar the The']
vectorizer.fit_transform(documents)
print(vectorizer.vocabulary_)

输出：

{'foo': 2, 'Bar': 0, 'bar': 1}

这可以删除the和The，而不会更改词汇表其余部分的大小写。

展开查看全部

赞(0）回复(0）举报 2023-05-23

我来回答

python-3.x 如何使CountVectorizer()忽略停用词而不管大小写

1条答案

相关问题

热门标签

最新问答