elasticsearch 如何在query_string搜索中聚合匹配的项?

6jjcrrmo  于 2023-01-16  发布在  ElasticSearch
关注(0)|答案(1)|浏览(157)

我希望在嵌套的dict列表中搜索通配符术语,然后获得一个术语列表及其按匹配的通配符分组的uuid。
我的索引中有以下Map:

"mappings": {
    "properties": {
        "uuid": {
            "type": "keyword"
        },
        "urls": {
            "type": "nested",
            "properties": {
                "url": {
                    "type": "keyword"
                },
                "is_visited": {
                    "type": "boolean"
                }
            }
        }           
    }
}

和大量的数据,比如:

{
    "uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd"
    "urls": [
        {
            "is_visited": true,
            "url": "https://www.google.com"
        },
        {
            "is_visited": false,
            "url": "https://www.facebook.com"
        },
        {
            "is_visited": true,
            "url": "https://www.twitter.com"
        },              
    ]
},
{
    "uuid":"4a1c695d-756b-4d9d-b3a0-cf524d955884"
    "urls": [
        {
            "is_visited": true,
            "url": "https://www.stackoverflow.com"
        },
        {
            "is_visited": false,
            "url": "https://www.facebook.com"
        },
        {
            "is_visited": false,
            "url": "https://drive.google.com"
        },
        {
            "is_visited": false,
            "url": "https://maps.google.com"
        },                      
    ]
}
...

我希望通过通配符"*google.com OR *twitter.com"进行搜索,并获得如下内容:

"hits": [
    "*google.com": [
        {
            "uuid": "4a1c695d-756b-4d9d-b3a0-cf524d955884",
            "_source": {
                "is_visited": false,
                "url": "https://drive.google.com"
            }
        },
        {
            "id": "4a1c695d-756b-4d9d-b3a0-cf524d955884",
            "_source": {
                "is_visited": false,
                "url": "https://maps.google.com"
            }
        },
        {
            "uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd",
            "_source": {
                "is_visited": true,
                "url": "https://www.google.com"
            }
        }
    ]
    "*twitter.com": [
        {
            "uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd",
            "_source": {
                "is_visited": true,
                "url": "https://www.twitter.com"
            },  
        },
    ]
]

这是我的(python)搜索查询:

body = {
  #"_source": False,
  "size": 100,
  "query": {
        "nested": {
            "path": "urls",
            "query":{
                "query_string":{
                    "query": f"urls.url:{urlToSearch}",
                }
            }
            ,"inner_hits": {
                "size":100 # returns top 100 results
            }
        }
    }
}

但是它返回每个匹配项的命中结果,而不是将它们聚集在一个类似于我想要得到的列表中。

编辑这是我的设置和Map:

{
    "settings": {
        "analysis": {
            "char_filter": {
                "my_filter": {
                    "type": "mapping",
                    "mappings": [
                        "- => _",
                    ]
                },
            },
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "char_filter": [
                        "my_filter"
                    ],
                    "filter": [
                        "lowercase",
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "uuid": {
                "type": "keyword"
            },
            "urls": {
                "type": "nested",
                "properties": {
                    "url": {
                        "type": "keyword"
                    },
                    "is_visited": {
                        "type": "boolean"
                    }
                }
            }           
        }
    }
}
atmip9wb

atmip9wb1#

Elasticsearch不会提供你想要的输出,你设置查询的方式。这个场景是一个聚合。我的建议是应用嵌套查询,并对结果使用聚合。
注意事项wildcard query
避免以 * 或?开头的模式。这会增加查找匹配项所需的迭代次数,降低搜索性能。

{
  "size": 0,
  "query": {
    "nested": {
      "path": "urls",
      "query": {
        "bool": {
          "should": [
            {
              "wildcard": {
                "urls.url": {
                  "value": "*google.com"
                }
              }
            },
            {
              "wildcard": {
                "urls.url": {
                  "value": "*twitter.com"
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggs": {
    "agg_providers": {
      "nested": {
        "path": "urls"
      },
      "aggs": {
        "google.com": {
          "terms": {
            "field": "urls.url",
            "include": ".*google.com",
            "size": 10
          }
        },
        "twitter.com": {
          "terms": {
            "field": "urls.url",
            "include": ".*twitter.com",
            "size": 10
          }
        }
      }
    }
  }
}

结果:

"aggregations": {
    "agg_providers": {
      "doc_count": 7,
      "twitter.com": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "https://www.twitter.com",
            "doc_count": 1
          }
        ]
      },
      "google.com": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "https://drive.google.com",
            "doc_count": 1
          },
          {
            "key": "https://maps.google.com",
            "doc_count": 1
          },
          {
            "key": "https://www.google.com",
            "doc_count": 1
          }
        ]
      }
    }
  }

相关问题