如何使用stormcrawler从网站抓取特定数据

bpzcxfmw 于 2021-06-24 发布在 Storm

关注(0)|答案(1)|浏览(377)

我正在使用stormcrawler（v1.16）抓取新闻网站，并将数据存储在elasticsearch（v7.5.0）上。我的爬虫配置文件是stormcrawler文件。我使用kibana进行可视化。我的问题是
当爬行新闻网站，我只想文章内容的网址，但我也得到了广告，网站上的其他标签网址。我要做什么和哪里的变化kibana链接
如果我只需要从一个url获取特定的东西（比如只需要标题或内容），我们怎么做呢。
编辑：我想在内容索引中添加一个字段。所以我在src/main/resources/parsefilter.json、es\u indecinit.sh和crawler-conf.yaml中做了更改。我添加的xpath是正确的。我已添加为 "parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content" 在parsefilter中。 parse.pubDate =PublishDate 在crawler conf中添加 PublishDate": { "type": "text", "index": false, "store": true} 在esèu indexinit.sh的属性中。但我仍然没有在kibana或elasticsearch中获得任何名为publishdate的字段。esu indexinit.shMap如下：

{
  "mapping": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "PublishDate": {
        "type": "text",
        "index": false,
        "store": true
      },
      "content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "description": {
        "type": "text",
        "store": true
      },
      "domain": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "host": {
        "type": "keyword",
        "store": true
      },
      "keywords": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "store": true
      },
      "url": {
        "type": "keyword",
        "store": true
      }
    }
  }
}

apache-storm data-extraction web-crawler stormcrawler

来源：https://stackoverflow.com/questions/62456731/how-to-crawl-specific-data-from-a-website-using-stormcrawler

1条答案

按热度按时间

h43kikqp1#

一种只为站点中的新闻页面编制索引的方法是依赖站点Map，但并非所有站点都提供这些索引。
或者，您需要一种机制作为解析的一部分，可能在parsefilter中，以确定页面是新闻项，并根据索引期间元数据中存在的键/值进行筛选。
在commoncrawl的新闻爬网数据集中执行的方式是，种子URL是站点Map或rss提要。
要不索引内容，只需注解掉即可

indexer.text.fieldname: "content"

在配置中。

赞(0）回复(0）举报 2021-06-24

我来回答

如何使用stormcrawler从网站抓取特定数据

1条答案

相关问题

热门标签

最新问答