scrapy 从一个公司的网站使用crawlspider刮新闻稿?

z4iuyo4d  于 2023-10-20  发布在  其他
关注(0)|答案(1)|浏览(128)

我希望刮所有的新闻稿发布的公司使用Scrapy的CrawlSpider
例如,对于BP公司,每个新闻稿链接包括/press-releases/,例如:Here
然而,下面的代码并没有生成任何输出。如何更改Linkextractor规则?或者这是公司网站限制抓取新闻稿页面的情况?
多谢了!
1.蜘蛛码

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CompanyRulesSpider2(CrawlSpider):
    name = 'companymorerules2'
    allowed_domains = ['bp.com']
    start_urls = ['https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html']
    rules = [
        Rule(LinkExtractor(allow='^(/investors/)'), callback='parse_items', follow=True
            ),
    ]

    def parse_items(self, response):
        print(response.url)
        title = response.css('h1::text').extract_first()
        url = response.url
        text = response.xpath('//div[@id="mw-content-text"]//text()').extract()
        lastUpdated = response.css('li#footer-info-lastmod::text').extract_first()
        lastUpdated = lastUpdated.replace('This page was last edited on ', '')
        print('Title is: {} '.format(title))
        print('title is: {} '.format(title))
        print('text is: {}'.format(text))
        print('weblink is {}'.format(url))

1.电流蜘蛛输出

2020-07-26 15:55:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bp.com/robots.txt> (referer: None)
2020-07-26 15:55:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html> (r
eferer: None)
2020-07-26 15:55:04 [scrapy.core.engine] INFO: Closing spider (finished)
7lrncoxx

7lrncoxx1#

您需要将def parse_items更改为def parse

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CompanyRulesSpider2(CrawlSpider):
    name = 'companymorerules2'
    allowed_domains = ['bp.com']
    start_urls = ['https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html']
    rules = [
        Rule(LinkExtractor(allow='^(/investors/)'), callback='parse_items', follow=True
            ),
    ]

    def parse(self, response):
        print(response.url)
        title = response.css('h1::text').extract_first()
        url = response.url
        text = response.xpath('//div[@id="mw-content-text"]//text()').extract()
        lastUpdated = response.css('li#footer-info-lastmod::text').extract_first() or ''
        lastUpdated = lastUpdated.replace('This page was last edited on ', '')
        print('Title is: {} '.format(title))
        print('title is: {} '.format(title))
        print('text is: {}'.format(text))
        print('weblink is {}'.format(url))

您需要覆盖parse方法,因为它是spider的默认回调。

相关问题