scrapy 使用旋转代理运行scrappy splash

zlwx9yxi  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(166)

我正在尝试使用带有飞溅和旋转代理的Scrapysettings.py:

ROBOTSTXT_OBEY = False
BOT_NAME = 'mybot'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
LOG_LEVEL = 'INFO'
USER_AGENT = 'Mozilla/5.0'

# JSON file pretty formatting

FEED_EXPORT_INDENT = 4

# Suppress dataloss warning messages of scrapy downloader

DOWNLOAD_FAIL_ON_DATALOSS = False   
DOWNLOAD_DELAY = 1.25  

# Enable or disable spider middlewares

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

# Splash settings

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
SPLASH_URL = 'http://localhost:8050'

我正在设置蜘蛛中的ROTATING_PROXY_LIST:

proxy_list = re.findall(r'(\d*\.\d*\.\d*\.\d*\:\d*)\b',
             requests.get("https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list.txt").text)     
custom_settings = {'ROTATING_PROXY_LIST': proxy_list}

我用docker run -p 8050:8050 scrapinghub/splash启动了splash。下面是启动splash请求的方法:

def start_requests(self):
    urls =  [ 'http://example-com/page_1.html', 'http://example-com/page_1.html']
    for url in urls:
        yield SplashRequest(url, 
                            self.parse_url, 
                            headers={'User-Agent': self.user_agent }, 
                            args = {'render_all': 1, 'wait': 0.5}
                            )

但是,当运行爬虫时,我没有看到任何请求通过Splash。我该如何解决这个问题?
谢谢,辛

t5fffqht

t5fffqht1#

我认为我们不能在splash中使用scrapy-rotating-proxy,如果你想在splash中使用proxy,试试这个:

yield SplashRequest(
            'https://ipv4.icanhazip.com/',
            self.parse_response,
            endpoint='execute',
            args={
                'lua_source': self.lua_script,
                'http_method': 'POST',
                'timeout': 60,
                'proxy': 'http://use:pass@Ip:Port'
            },
            errback=self.errback_httpbin)

如果你想对带有Splash请求的Scrapy请求使用Scrapy-rotating-proxy,添加另一个中间件来从Splash中排除请求。
setting.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
    810,
    'scrapping_tool.middlewares.ProxiesMiddleware': 400,
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

以及代理中间件:

class ProxiesMiddleware(object):
    def __init__(self, settings):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_request(self, request, spider):
        if (isinstance(request,
                       scrapy.http.request.form.FormRequest) == False):
            request.meta['proxy'] = None

相关问题