scrapy 在多次请求或擦除项目后暂停擦除

yrdbyhpb 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(132)

我尝试在init函数中添加item_count = 0，然后在每个yield项之前添加self.item_count += 1。
之后，我添加了if self.item_count == x time.sleep(y)。
但这似乎行不通。
我想补充这一点，因为我试图刮的网站有一个反刮政策，我不能通过15万项目。所以我认为，使暂停5-10分钟，每5万项目将有助于我克服这个问题。

scrapy

来源：https://stackoverflow.com/questions/70914132/pause-scrapy-after-a-number-of-requests-or-scraped-items

2条答案

按热度按时间

xlpyo6sf1#

您可以使用from_crawler类方法将item_scraped信号连接到spider方法。然后在spider方法中，检查item_count是否可被50000整除，然后使用crawler.engine.pause()方法将引擎暂停所需的时间。之后使用crawler.engine.unpause()方法继续搜索。
在下面的示例代码中，我实现了每5个项目暂停10秒。修改它以满足您的需要（例如，每50000个项目暂停5分钟）。

import scrapy
from scrapy import signals
import time

class SampleSpider(scrapy.Spider):
    name = 'sample'
    start_urls = ['http://quotes.toscrape.com/page/1/']
    item_count = 0

    @classmethod
    def from_crawler(cls, crawler, *args,**kwargs):
        spider = super(SampleSpider, cls).from_crawler(crawler, *args,**kwargs)
        crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
        spider.crawler = crawler
        return spider

    def item_scraped(self, item):
        # increase item count and then check if the item count is 5 from the previous pause
        self.item_count += 1
        if self.item_count % 5 == 0:
            self.logger.info(f"Pausing scrape job...item count = {self.item_count}")
            self.crawler.engine.pause()
            time.sleep(10)
            self.crawler.engine.unpause()
            self.logger.info(f"Resuming crawl...")

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

赞(0）回复(0）举报 2022-11-09

z5btuh9x2#

相信我，我什么都试过了。唯一有效的方法是每页等待2分钟以上。由于我们有50页和1500个项目，我认为对于这种情况，我们应该使用其他工具。

赞(0）回复(0）举报 2022-11-09