从存储在hdfs中的文档中提取数据以在elasticsearch中建立索引

3bygqnnd 于 2021-06-02 发布在 Hadoop

关注(0)|答案(2)|浏览(655)

我有一个hdfs档案，以存储各种文件，如pdf，ms word文件，ppt，csv等。我想建立一个平台，使用elasticsearch搜索文件或文本内容。我知道我可以使用es hadoop插件将数据从hdfs索引到es。我想知道从hdfs中存储的文档中提取文本数据并对其进行索引的最佳方法。
任何帮助都将不胜感激。

hadoop elasticsearch full-text-search elasticsearch-hadoop

来源：https://stackoverflow.com/questions/36419608/extracting-data-from-documents-stored-in-hdfs-to-index-in-elasticsearch

2条答案

按热度按时间

oug3syen1#

我做了很多搜索，这是我到目前为止找到的方法列表。
以下是整体集成/插件页面：https://www.elastic.co/guide/en/elasticsearch/plugins/master/integrations.html
这是Map器附件的新替代品，injest插件：https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html 关于如何使用它的帖子：https://qbox.io/blog/index-attachments-files-elasticsearch-mapper 下面讨论使用injest vs fs crawler的优缺点（dadoonet是一个弹性开发人员）：https://discuss.elastic.co/t/mapper-attachment-plugin-vs-pre-parsing-and-extracting-content-from-binary-files/73764/10
以下是文件系统爬虫（fs爬虫）插件：https://github.com/dadoonet/fscrawler
以下是ambar文档搜索系统-他们有一个社区github，带有开放源代码：https://ambar.cloud/https网址：github.com/rd17/ambarhttps://blog.ambar.cloud/ingesting-documents-pdf-word-txt-etc-into-elasticsearch/ 他们似乎使用了两种数据库服务器类型（mongodb和redis），但还不知道为什么。
这里是ApacheTika，injest和ambar都使用它（它还通过使用tesseract提供ocr，我听说injest不支持它）：http://tika.apache.org/1.16/
此外，在injest使用tika时，只支持文件类型的一个子集：https://discuss.elastic.co/t/full-list-of-supported-document-formats-by-es/81149
我希望上述节省其他开发人员的时间，如果人们发现更多，他们将评论下面。
谢谢！

赞(0）回复(0）举报 2021-06-02

dtcbnfnu2#

您可以使用elasticsearch mapper附件插件。这个插件使用apache tika来接收几乎所有已知类型的文档，并使其可以通过elasticsearch进行搜索。希望有帮助。

赞(0）回复(0）举报 2021-06-02