尽管使用了StopAnalyzer和StopFilter，但ApacheLucene并不过滤停用词

8zzbczxx 于 2022-11-07 发布在 Lucene

关注(0)|答案(2)|浏览(175)

我有一个基于ApacheLucene5.5/ 6.0的模块，它可以检索关键字。除了一件事- Lucene不能过滤停用词，其他一切都正常。
我尝试用两种不同的方法启用停用词过滤。

方法1：

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

方法2：

tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

完整代码可从以下网址获得：
https://stackoverflow.com/a/36237769/462347

我的问题：

1.为什么Lucene不过滤停用词？
1.我如何在Lucene 5.5 / 6.0中启用停用词过滤？

lucene

来源：https://stackoverflow.com/questions/36241051/apache-lucene-doesnt-filter-stop-words-despite-the-usage-of-stopanalyzer-and-st

2条答案

按热度按时间

dauxcl2d1#

我刚刚测试了方法1和方法2，它们似乎都能很好地过滤掉停用词。下面是我的测试方法：

public static void main(String[] args) throws IOException, ParseException, org.apache.lucene.queryparser.surround.parser.ParseException 
{
     StandardTokenizer stdToken = new StandardTokenizer();
     stdToken.setReader(new StringReader("Some stuff that is in need of analysis"));
     TokenStream tokenStream;

     //You're code starts here
     tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
     tokenStream.reset();
     //And ends here

     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
     while (tokenStream.incrementToken()) {
         System.out.println(token.toString());
     }
     tokenStream.close();
}

结果：
有些
填料
需要
分析，分析
它消除了我样本中的四个停用词。

赞(0）回复(0）举报 2022-11-07

2q5ifsrm2#

陷阱是在默认的Lucene的停用词列表，我预计，它是更广泛。
以下是默认情况下尝试加载自定义停用词表的代码，如果失败，则使用标准停用词表：

CharArraySet stopWordsSet;

try {
    // use customized stop words list
    String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
    stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
    // use standard stop words list
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();

赞(0）回复(0）举报 2022-11-07

我来回答

尽管使用了StopAnalyzer和StopFilter，但ApacheLucene并不过滤停用词

2条答案

相关问题

热门标签

最新问答