模糊文本搜索

siotufzp 于 2021-06-15 发布在 ElasticSearch

关注(0)|答案(1)|浏览(534)

我想知道是否有任何python库可以进行模糊文本搜索。例如：
我有三个关键词“信”、“邮票”和“邮件”。
我想有一个功能，以检查这三个字是否在同一段（或一定距离，一页）。
此外，这些词必须保持同样的顺序。在这三个词之间出现其他词是可以的。
我试过了 fuzzywuzzy 这并没有解决我的问题。另一个图书馆 Whoosh 看起来很强大，但我没有找到合适的功能。。。

elasticsearch python full-text-search fuzzy-search whoosh

来源：https://stackoverflow.com/questions/30449452/fuzzy-text-search-in-python

1条答案

按热度按时间

vuv7lop31#

{1} 你可以在家里做这个 Whoosh 2.7 . 它通过添加插件进行模糊搜索 whoosh.qparser.FuzzyTermPlugin : whoosh.qparser.FuzzyTermPlugin 允许您搜索“模糊”术语，即不必精确匹配的术语。模糊项将在一定数量的“编辑”（字符插入、删除和/或换位–这称为“damerau levenshtein编辑距离”）内匹配任何类似项。
添加模糊插件：

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())

将fuzzy插件添加到解析器后，可以通过添加 ~ 后跟可选的最大编辑距离。如果未指定编辑距离，则默认值为1。
例如，以下“模糊”术语查询：

letter~
letter~2
letter~2/3

{2} 要保持单词的顺序，请使用查询 whoosh.query.Phrase 但你应该替换 Phrase 插件依据 whoosh.qparser.SequencePlugin 允许您在短语中使用模糊术语：

"letter~ stamp~ mail~"

要用序列插件替换默认短语插件，请执行以下操作：

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())

{3} 若要允许字词之间，请初始化 slop 将短语查询中的参数设置为更大的数字：

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)

slop–短语中每个“单词”之间允许的单词数；默认值1表示短语必须完全匹配。
您还可以在查询中定义slop，如下所示：

"letter~ stamp~ mail~"~10

{4} 整体解决方案：
{4.a}索引器类似于：

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first,  mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()

{4.b}搜索者应该是这样的：

from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin

with ix.searcher() as searcher:
    parser = QueryParser(u"content", ix.schema)
    parser.add_plugin(FuzzyTermPlugin())
    parser.remove_plugin_class(PhrasePlugin)
    parser.add_plugin(SequencePlugin())
    query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
    results = searcher.search(query)
    print "nb of results =", len(results)
    for r in results:
        print r

结果是：

nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>

{5} 如果要将fuzzy search设置为默认值而不使用语法 word~n 在查询的每个单词中，都可以初始化 QueryParser 这样地：

from whoosh.query import FuzzyTerm
 parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)

现在您可以使用查询了 "letter stamp mail"~10 但请记住 FuzzyTerm 具有默认编辑距离 maxdist = 1 . 如果您想要更大的编辑距离，请对类进行个性化设置：

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) 
         # super().__init__() for Python 3 I think

参考文献：
whoosh.query.phrase短语
添加模糊术语查询
允许复杂短语查询
类whoosh.query.fuzzyterm
qparser模块

赞(0）回复(0）举报 2021-06-15

我来回答

模糊文本搜索

1条答案

相关问题

热门标签

最新问答