我希望刮所有的新闻稿发布的公司使用Scrapy的CrawlSpider
。
例如,对于BP公司,每个新闻稿链接包括/press-releases/
,例如:Here
然而,下面的代码并没有生成任何输出。如何更改Linkextractor
规则?或者这是公司网站限制抓取新闻稿页面的情况?
多谢了!
1.蜘蛛码
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CompanyRulesSpider2(CrawlSpider):
name = 'companymorerules2'
allowed_domains = ['bp.com']
start_urls = ['https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html']
rules = [
Rule(LinkExtractor(allow='^(/investors/)'), callback='parse_items', follow=True
),
]
def parse_items(self, response):
print(response.url)
title = response.css('h1::text').extract_first()
url = response.url
text = response.xpath('//div[@id="mw-content-text"]//text()').extract()
lastUpdated = response.css('li#footer-info-lastmod::text').extract_first()
lastUpdated = lastUpdated.replace('This page was last edited on ', '')
print('Title is: {} '.format(title))
print('title is: {} '.format(title))
print('text is: {}'.format(text))
print('weblink is {}'.format(url))
1.电流蜘蛛输出
2020-07-26 15:55:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bp.com/robots.txt> (referer: None)
2020-07-26 15:55:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html> (r
eferer: None)
2020-07-26 15:55:04 [scrapy.core.engine] INFO: Closing spider (finished)
1条答案
按热度按时间7lrncoxx1#
您需要将
def parse_items
更改为def parse
:您需要覆盖
parse
方法,因为它是spider的默认回调。