scrapy 从Crawler开始到下一个日期时间暂停Crawler

6pp0gazn  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(181)

我想暂停所有的请求一段时间,一旦我的爬虫程序运行到超过当前计时器20秒的时间,然后让它像往常一样继续运行直到完成。
我发现我当前的脚本没有实现暂停,蜘蛛照常继续,我如何使它有效地工作?

import scrapy
from scrapy import signals
import datetime
from twisted.internet import defer
from twisted.internet import defer
class TestSpider(scrapy.Spider):
    name = 'signal'

    start_urls = start_urls = [f'https://www.meadowhall.co.uk/eatdrinkshop?page={i}' for i in range(1, 15)]

    custom_settings = {
        'DOWNLOAD_DELAY':2
    }

    def __init__(self):
        self.timer = 0
        self.datetime = datetime.datetime.now()

    @classmethod
    def from_crawler(cls, crawler, *args,**kwargs):
        spider = super(TestSpider, cls).from_crawler(crawler, *args,**kwargs)
        crawler.signals.connect(spider.schedule_request, signal=signals.request_scheduled)
        crawler.signals.connect(spider.close_spider, signal=signals.spider_closed)
        return spider

    def schedule_request(self):
        self.timer += 1
        if self.datetime == (self.datetime + datetime.timedelta(seconds = 20)).time():
            deferred = defer.Deferred()
            deferred.pause(self.timer)
            deferred.unpause()

    def close_spider(self):
        print(f"The current time: {datetime.datetime.now().time()}")

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                cb_kwargs = {
                    'pg':url
                }
            ) 
    def parse(self, response,pg):
        content_page = response.xpath("//div[@class='view-content']//div")
        url_split = pg.split('?')[-1]
        for cnt in content_page:
            image_url = cnt.xpath(".//img//@src").get()
            if image_url != None:
                yield {
                    'image_urls':image_url,
                    'url_page':url_split
                }
mqkwyuun

mqkwyuun1#

这比我想象的要简单得多,我肯定是把这个过程复杂化了。telnet控制台上的scrapy文档给出了使用engine.pause的例子。虽然不推荐在scrapy爬虫中使用time.sleep,因为我们已经直接用scrapy的方法暂停了这个过程,然后我们只需要在engine.unpause之前添加一个sleep来重新启动它。
因此,类似这样的做法是可行的:

end_time = datetime.timedelta(seconds = 3)
    def __init__(self, stats, pause):
       self.stats = stats
       self.pause = pause

    @classmethod
    def from_crawler(cls, crawler, *args,**kwargs):
        #spider = super(TestSpider, cls).from_crawler(crawler, *args,**kwargs)
        stat = cls(crawler.stats, crawler)
        crawler.signals.connect(stat.spider_opened, signals.spider_opened)
        crawler.signals.connect(stat.spider_closed, signals.spider_closed)
        return stat

    def spider_opened(self):

        start_time = self.stats.get_stats()['start_time']
        if datetime.datetime.now().time() >= (start_time+ self.end_time).time():
            print(f"The time final: {datetime.datetime.now()}")
            self.pause.engine.pause()
            time.sleep(5)
            self.pause.engine.unpause()

因此,我们可以根据它运行了多长时间以及我们希望它在何时暂停以及暂停多长时间来暂停。应该有可能将其包含在循环中,因此,我们可以在每隔几秒后暂停,如果刮取完成,我们将中断循环并关闭蜘蛛。

相关问题