我正在使用Stormcrawler(v 2.10)抓取内部网网站,并将数据存储在Elasticsearch(v 7.8.0)上。使用kibana进行可视化。内部网页面具有自定义Meta标签,如下所示
{
"settings": {
"index": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "5s",
"default_pipeline": "timestamp"
}
},
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"content": {
"type": "text"
},
"description": {
"type": "text"
},
"domain": {
"type": "keyword"
},
"format": {
"type": "keyword"
},
"keywords": {
"type": "keyword"
},
"host": {
"type": "keyword"
},
"title": {
"type": "text"
},
"url": {
"type": "keyword"
},
"timestamp": {
"type": "date",
"format": "date_optional_time"
},
"metatag": {
"properties": {
"article_description": {
"type": "text"
},
"article_heading": {
"type": "text"
},
"article_publisheddate": {
"type": "date"
},
"article_type": {
"type": "text"
},
"article_year": {
"type": "text"
}
}
}
}
}
}
字符串
在jsoupfilters.json中添加了
"parse.article_description": "//META[@name=\"Article_Description\"]/@content",
"parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
"parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
"parse.article_type": "//META[@name=\"Article_Type\"]/@content",
"parse.article_year": "//META[@name=\"Article_Year\"]/@content"
型
在crawler-conf.yaml添加
indexer.md.mapping:
- parse.title=title
- parse.search=search
- parse.keywords=keywords
- parse.description=description
- parse.article_description=metatag.article_description
- parse.article_heading=metatag.article_heading
- parse.article_publisheddate=metatag.article_publisheddate
- parse.article_type=metatag.article_type
- parse.article_year=metatag.article_year
- domain
- format
型
1条答案
按热度按时间muk1a3rh1#
我看不出你的设置有任何明显的错误。你可以在一个URL上运行类https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java来检查提取。在命令行上测试协议的输出也很有用,参见our recent blog的例子。