分析elasticsearch dsl中每个帖子的单个标签

k5ifujac  于 2021-06-10  发布在  ElasticSearch
关注(0)|答案(0)|浏览(246)

我有一个es示例运行travel.stackexchange的数据。


# Example Data

first = ["This was one of our definition questions, but also one that interests me personally:
          How can I find a guide that will take me safely through the Amazon jungle? I'd love
          to explore the Amazon but would not attempt it without a guide, at least not the first
          time. I'd prefer a guide that wasn't going to ambush me or anything.I don't want to go
          anywhere touristy.  Start and end points are open, but the trip should take me places
          where I am not likely to see other travelers/tourists and where I will definitely
          require a good guide in order to be safe.", # content
          '2011-06-21T20:22:33.760', # date of creation
          '39', # votes
          '2799', # views
          '8', # answers
          '4', # comments
          'How can I find a guide that will take me safely through the Amazon jungle?', # title
          '"guides", "extreme-tourism", "amazon-river", "amazon-jungle"'] # TAGS

我使用

connections.create_connection(alias='es', hosts=['localhost'], timeout=60)

正如您所看到的,这个帖子有几个标签(“guides”、“amazon river”…)。当我将数据输入es时,我将标记格式化为字符串。
现在,当我查询索引时(当然是使用更大的数据集)

s = Search(using="es", index=current_index)

并计算每个标签被提及的次数。

s.aggs.bucket("per_tag", "terms", field="tags", size=5)
r = s.execute()

然而,当我查看结果时,它们看起来像

r.aggregations.per_tag.buckets
>>> [{'key': 'no tags', 'doc_count': 70672},
>>>  {'key': '"visas", "uk"', 'doc_count': 330}, 
>>>  {'key': '"visas", "schengen"', 'doc_count': 264}, 
>>>  {'key': '"visas"', 'doc_count': 253},
>>>  {'key': '"air-travel"', 'doc_count': 182}]

这很好,但不是我想要的。如你所见,“签证”这个标签被提到了三次,而不是一次。我想要的是

>>> [{'key': 'no tags', 'doc_count': 70672},
>>>  {'key': 'visas', 'doc_count': XXX}, 
>>>  {'key': 'uk', 'doc_count': YYY}, 
>>>  {'key': 'Schenge', 'doc_count': ZZZ},
>>>  {'key': 'air-travel', 'doc_count': AAA}]

到目前为止,我尝试的是以不同的方式输入标签。一次 "" 一次没有,离开 , ,仅与 spaces . 但是,我觉得我必须更简洁地定义聚合函数,而不是输入。任何帮助都将不胜感激。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题