刮削增强刮削时的鲁棒性

yptwkmov  于 2022-10-22  发布在  Python
关注(0)|答案(1)|浏览(108)

当出现以下情况时,我正在尽力搜索Scrapy蜘蛛的设置。
1、在我的刮水活动中,如果我停电了
1.我的ISP坏了
我期待的行为是Scrapy不应该放弃。而是无限地等待电源恢复,并在短暂的暂停或10秒的间隔后重试请求,继续进行刮取。
这是我在互联网关闭时收到的错误消息。

https://example.com/1.html
 2022-10-21 17:44:14 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
 <GET https://www.example.com/1.html
 (failed 1 times): An error occurred while connecting: 10065: A socket operation was attempted to an unreachable host..

信息会重复。
我担心的是,当blip被恢复时,scrapy会放弃尝试1.html,而可能会转到另一个名为99.html的url。
我的问题是,当对无法访问的主机进行错误套接字操作时,如何使scrapy等待并重试相同的url https://www.example.com/1.html
提前谢谢。

rhfm7lfc

rhfm7lfc1#

没有内置设置可以做到这一点,但是这仍然可以很容易地实现。
在我看来,最直接的方法是在你的spider中捕捉response_received信号,并在你的ISP故障时检查你收到的特定错误代码。当这种情况发生时,您可以暂停scrapy引擎并等待任意时间,然后再次重试相同的请求,直到成功。
例如:

from scrapy import Spider
from scrapy.signals import response_received

class MySpider(Spider):
   ...
   ...

    @classmethod
    def from_crawler(cls, crawler, *args,**kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args,**kwargs)
        # listen for the response_received signal and call check_response
        crawler.signals.connect(spider.check_response, signal=response_received)
        return spider

    def check_response(self, response, request, spider):
        engine = spider.crawler.engine
        if response.status == 404:  # <- your error code goes here
            engine.pause()
            time.sleep(6000)        # <- wait 10 minutes
            request.dont_filter = True   # <- tell engine not to filter
            engine.unpause()
            engine.crawl(request.copy())  # <- resend the request

使现代化
因为它不是http错误代码,所以下一个最好的解决方案是创建一个自定义DownloaderMiddleware来捕获异常,然后执行与第一个示例中相同的操作。
middlewares.py文件中:

import time
from twisted.internet.error import (ConnectError, ConnectionLost
                                    TimeoutError, DNSLookupError,
                                    ConnectionRefusedError)

class ConnectionLostPauseDownloadMiddleware:

    def __init__(self, settings, crawler):
        self.crawler = crawler
        self.exceptions = (ConnectionRefusedError, ConnectionDone, ConnectError, ConnectionLost)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings, crawler)

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.exceptions):
            new_request = request.copy()
            new_request.dont_filter = True
            self.crawler.engine.pause()
            time.sleep(60 * 10)
            self.crawler.engine.unpause()
            return new_request

然后在你的settings.py

DOWNLOADER_MIDDLEWARES = {
   'MyProjectName.middlewares.ConnectionLostPauseDownloadMiddleware': 543,
}

相关问题