使用标记器、fuzziness和edge n-gram,我有三个文档:
《星际迷航i》
“星际迷航”
《星际迷航:星际迷航纪录片》
用模糊搜索“星际迷航”会给“星际迷航”一个比“星际迷航”更高的分数,因为额外的标记匹配“迷航”(=>“迷航”)。对抗这种情况的最佳方法是少模糊或无模糊的匹配吗?
此外,《星际迷航:星际迷航记录片》获得了更高的分数,因为它符合《星际迷航》和《迷航》。有没有办法只匹配最好的代币或者任何其他方法来给它和《星际迷航1》一样的分数(因为两者都包含《星际迷航》)?
编辑:
Map和设置:
PUT /stackoverflow
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"edge_n_gram": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "50"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"asciifolding",
"edge_n_gram"
],
"type": "custom",
"tokenizer": "autocomplete"
},
"autocomplete_search": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "char_group"
},
"full_word": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "char_group"
}
},
"tokenizer": {
"autocomplete": {
"type": "standard"
},
"char_group": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"."
]
}
}
}
},
"mappings": {
"properties": {
"search_field_full": {
"type": "text",
"similarity": "boolean",
"fields": {
"raw": {
"type": "text",
"similarity": "boolean",
"analyzer": "full_word",
"search_analyzer": "autocomplete_search"
}
},
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
文件:
POST stackoverflow/_doc/
{
"search_field_full": "Star Trek I"
}
POST stackoverflow/_doc/
{
"search_field_full": "Star Trakian: A Star Trek Documentary"
}
POST stackoverflow/_doc/
{
"search_field_full": "Star Trekian"
}
查询:
GET stackoverflow/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"search_field_full"
],
"fuzziness": "AUTO:4,7",
"max_expansions": 500,
"minimum_should_match": 2,
"operator": "or",
"query": "Star Trek",
"type": "best_fields"
}
}
],
"should": [
{
"multi_match": {
"fields": [
"search_field_full.raw^30"
],
"fuzziness": 0,
"operator": "or",
"query": "Star Trek",
"type": "best_fields"
}
},
{
"multi_match": {
"fields": [
"search_field_full.raw^20"
],
"fuzziness": 1,
"operator": "or",
"query": "Star Trek",
"type": "best_fields"
}
}
]
}
}
}
暂无答案!
目前还没有任何答案,快来回答吧!