如何使用StormCrawler在网站的ElasticSearch索引中存储自定义元标记

baubqpgj  于 2024-01-04  发布在  Apache
关注(0)|答案(1)|浏览(227)

我正在使用Stormcrawler(v 2.10)抓取内部网网站,并将数据存储在Elasticsearch(v 7.8.0)上。使用kibana进行可视化。内部网页面具有自定义Meta标签,如下所示

{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "5s",
      "default_pipeline": "timestamp"
    }
  },
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "content": {
        "type": "text"
      },
      "description": {
        "type": "text"
      },
      "domain": {
        "type": "keyword"
      },
      "format": {
        "type": "keyword"
      },
      "keywords": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "url": {
        "type": "keyword"
      },
      "timestamp": {
        "type": "date",
        "format": "date_optional_time"
      },
      "metatag": {
        "properties": {
          "article_description": {
            "type": "text"
          },
          "article_heading": {
            "type": "text"
          },
          "article_publisheddate": {
            "type": "date"
          },
          "article_type": {
            "type": "text"
          },
          "article_year": {
            "type": "text"
          }
        }
      }
    }
  }
}

字符串
在jsoupfilters.json中添加了

"parse.article_description": "//META[@name=\"Article_Description\"]/@content",
"parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
"parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
"parse.article_type": "//META[@name=\"Article_Type\"]/@content",
"parse.article_year": "//META[@name=\"Article_Year\"]/@content"


在crawler-conf.yaml添加

indexer.md.mapping:
  - parse.title=title
  - parse.search=search
  - parse.keywords=keywords
  - parse.description=description
  - parse.article_description=metatag.article_description
  - parse.article_heading=metatag.article_heading
  - parse.article_publisheddate=metatag.article_publisheddate
  - parse.article_type=metatag.article_type
  - parse.article_year=metatag.article_year
  - domain
  - format

muk1a3rh

muk1a3rh1#

我看不出你的设置有任何明显的错误。你可以在一个URL上运行类https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java来检查提取。在命令行上测试协议的输出也很有用,参见our recent blog的例子。

相关问题