scrapy 从一个公司的网站使用crawlspider刮新闻稿？

z4iuyo4d 于 2023-10-20 发布在其他

关注(0)|答案(1)|浏览(128)

我希望刮所有的新闻稿发布的公司使用Scrapy的CrawlSpider。
例如，对于BP公司，每个新闻稿链接包括/press-releases/，例如：Here
然而，下面的代码并没有生成任何输出。如何更改Linkextractor规则？或者这是公司网站限制抓取新闻稿页面的情况？
多谢了！
1.蜘蛛码

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CompanyRulesSpider2(CrawlSpider):
    name = 'companymorerules2'
    allowed_domains = ['bp.com']
    start_urls = ['https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html']
    rules = [
        Rule(LinkExtractor(allow='^(/investors/)'), callback='parse_items', follow=True
            ),
    ]

    def parse_items(self, response):
        print(response.url)
        title = response.css('h1::text').extract_first()
        url = response.url
        text = response.xpath('//div[@id="mw-content-text"]//text()').extract()
        lastUpdated = response.css('li#footer-info-lastmod::text').extract_first()
        lastUpdated = lastUpdated.replace('This page was last edited on ', '')
        print('Title is: {} '.format(title))
        print('title is: {} '.format(title))
        print('text is: {}'.format(text))
        print('weblink is {}'.format(url))

1.电流蜘蛛输出

2020-07-26 15:55:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bp.com/robots.txt> (referer: None)
2020-07-26 15:55:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html> (r
eferer: None)
2020-07-26 15:55:04 [scrapy.core.engine] INFO: Closing spider (finished)

scrapy

来源：https://stackoverflow.com/questions/63101811/scraping-press-releases-from-a-company-website-using-crawlspider

1条答案

按热度按时间

7lrncoxx1#

您需要将def parse_items更改为def parse：

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CompanyRulesSpider2(CrawlSpider):
    name = 'companymorerules2'
    allowed_domains = ['bp.com']
    start_urls = ['https://www.bp.com/en/global/corporate/news-and-insights/press-releases.html']
    rules = [
        Rule(LinkExtractor(allow='^(/investors/)'), callback='parse_items', follow=True
            ),
    ]

    def parse(self, response):
        print(response.url)
        title = response.css('h1::text').extract_first()
        url = response.url
        text = response.xpath('//div[@id="mw-content-text"]//text()').extract()
        lastUpdated = response.css('li#footer-info-lastmod::text').extract_first() or ''
        lastUpdated = lastUpdated.replace('This page was last edited on ', '')
        print('Title is: {} '.format(title))
        print('title is: {} '.format(title))
        print('text is: {}'.format(text))
        print('weblink is {}'.format(url))

您需要覆盖parse方法，因为它是spider的默认回调。

赞(0）回复(0）举报 2023-10-20

我来回答

scrapy 从一个公司的网站使用crawlspider刮新闻稿？

1条答案

相关问题

热门标签

最新问答