我试图从一个网站解析数据,我使用scrapy,但该网站受到cloudflare的保护。我找到了一个解决方案,使用cloudscraper,这个cloudscraper确实可以绕过保护。但我不明白它如何与scrapy一起使用。
想写这样的东西
import scrapy
from scrapy.xlib.pydispatch import dispatcher
import cloudscraper
import requests
from scrapy.http import Request, FormRequest
class PycoderSpider(scrapy.Spider):
name = 'armata_exper'
start_urls = ['https://arma-models.ru/catalog/sbornye_modeli/?limit=48']
def start_requests(self):
url = "https://arma-models.ru/catalog/sbornye_modeli/?limit=48"
scraper = cloudscraper.CloudScraper()
cookie_value, user_agent = scraper.get_tokens(url)
yield scrapy.Request(url, cookies=cookie_value, headers={'User-Agent': user_agent})
def parse(self, response):
....
字符串
收到错误
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/usr/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/usr/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'Response' object has no attribute 'meta'
Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/usr/lib/python3.6/site-packages/scrapy/crawler.py", line 309, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1283, in run
self.mainLoop()
File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1292, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python3.6/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 135, in _next_request
self.crawl(request, spider)
File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 91, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'Response' object has no attribute 'dont_filter'
型
请告诉我怎么做才对
1条答案
按热度按时间brtdzjyr1#
我已经成功地使用Scrapy downloader middlewares集成了Scrapy和Cloudscraper。
这是我提出的中间件:
字符串
我使用
process_response
中间件方法。如果我检测到响应是403或503,那么我使用cloudscraper
执行相同的请求。否则,我只是继续正常的管道(为了简单起见,也可以删除这个if
以始终使用cloudscraper;或者定义更精确的条件来使用或不使用cloudscraper)。此外,由于requests
响应与Scrapy不同,我们需要将它们转换为Scrapy响应。最后,您必须在spider中配置中间件。我喜欢通过定义
custom_settings
类变量来实现这一点:型
(the中间件的确切路径将取决于您的项目结构)
你可以找到我完整的例子here。