scrapy 具有特定重试条件的自定义scrappy代理旋转中间件

ax6ht2ek  于 2023-02-22  发布在  其他
关注(0)|答案(1)|浏览(192)

对于Scrappy中间件中的旋转代理,我需要实现几个条件:
1.如果response不是200,请使用列表中的另一个随机代理尝试request
1.我有两个代理列表,比方说,我想开始爬行与第一个代理列表,并重试约10次与该列表,之后作为最后的手段,我想尝试第二个代理列表。
我试过创建中间件,但它不是预期的工作,它不是旋转代理,以及没有拿起第二代理列表作为最后手段.以下是代码:

class SFAProxyMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def __init__(self, settings):

        self.packetstream_proxies = [
            settings.get("PS_PROXY_USA"),
            settings.get("PS_PROXY_CA"),
            settings.get("PS_PROXY_IT"),
            settings.get("PS_PROXY_GLOBAL"),
        ]

        self.unlimited_proxies = [
            settings.get("UNLIMITED_PROXY_1"),
            settings.get("UNLIMITED_PROXY_2"),
            settings.get("UNLIMITED_PROXY_3"),
            settings.get("UNLIMITED_PROXY_4"),
            settings.get("UNLIMITED_PROXY_5"),
            settings.get("UNLIMITED_PROXY_6"),
        ]

    def add_proxy(self, request, host):
        request.meta["proxy"] = host

    def process_request(self, request, spider):
        retries = request.meta.get("retry_times", 0)
        if "proxy" in request.meta.keys():
            return None
        if retries <= 10:
           self.add_proxy(request, random.choice(self.unlimited_proxies))
        else:
            self.add_proxy(request, random.choice(self.packetstream_proxies))

我在实现中间件时做错了什么吗?谢谢

ubbxdtey

ubbxdtey1#

我认为根据您问题开头的条件,您还需要处理响应以检查其状态代码,如果不是200,则增加重试计数并将其发送到调度程序。
您可能需要将请求中的dont_filter参数设置为True,并且可能还应该设置重试次数的最大值。
例如

from scrapy.exceptions import IgnoreRequest
MAX_RETRY = 20
class SFAProxyMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def __init__(self, settings):

        self.packetstream_proxies = [
            settings.get("PS_PROXY_USA"),
            settings.get("PS_PROXY_CA"),
            settings.get("PS_PROXY_IT"),
            settings.get("PS_PROXY_GLOBAL"),
        ]

        self.unlimited_proxies = [
            settings.get("UNLIMITED_PROXY_1"),
            settings.get("UNLIMITED_PROXY_2"),
            settings.get("UNLIMITED_PROXY_3"),
            settings.get("UNLIMITED_PROXY_4"),
            settings.get("UNLIMITED_PROXY_5"),
            settings.get("UNLIMITED_PROXY_6"),
        ]

    def add_proxy(self, request, host):
        request.meta["proxy"] = host

    def process_request(self, request, spider):
        retries = request.meta.get("retry_times", 0)
        if "proxy" in request.meta.keys():
            return None
        if retries <= 10:
           self.add_proxy(request, random.choice(self.unlimited_proxies))
        else:
            self.add_proxy(request, random.choice(self.packetstream_proxies))

    def process_response(self, response, spider):
        if response.status_code != 200:
            request = response.request
            request.meta.setdefault("retry_times", 1)
            request.meta["retry_times"] += 1
            if request.meta["retry_times"] > MAX_RETRY:
                raise IgnoreRequest
            request.dont_filter = True
            return request
        return response

相关问题