如何修复403错误,而刮与scrapy?

wwwo4jvm  于 2022-11-09  发布在  其他
关注(0)|答案(3)|浏览(197)

我一直得到403错误时,使用scrapy,即使我有适当的标题设置。网站,我试图刮是https://steamdb.info/graph/
我的代码:

def start_request(self):
        headers =  {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Mobile Safari/537.36",
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,en-GB;q=0.8,ar;q=0.7",
"cache-control":" no-cache",
"pragma": "no-cache",
"referer": "https://steamdb.info/graph/", 
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"x-requested-with": "XMLHttpRequest"
            }

        yield scrapy.Request(url = 'https://steamdb.info/graph', method='GET', headers = headers, callback=self.parse)

    def parse(self, response):    
        #stuff to do

错误:

2022-07-08 20:20:41 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://steamdb.info/graph> (referer: https://steamdb.info/graph/)
2022-07-08 20:20:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://steamdb.info/graph>: HTTP status code is not handled or not allowed
ut6juiuv

ut6juiuv1#

该网站受cloudflare保护。

https://steamdb.info/graph/ is using Cloudflare CDN/Proxy!

https://steamdb.info/graph/ is using Cloudflare SSL!

它与cloudscraper一起工作,这相当于requests模块可以处理云耀斑保护。

import cloudscraper
scraper = cloudscraper.create_scraper(delay=10,   browser={'custom': 'ScraperBot/1.0',})
url = 'https://steamdb.info/graph/'
req = scraper.get(url)
print(req)

输出:

<Response [200]>
tzcvj98z

tzcvj98z2#

这是因为该站点不存在-https:steamdb.info/graphs/转到404
谢谢

n9vozmp4

n9vozmp43#

我解决了这个问题。如果一个网站使用的是cloudfare,你可以使用未检测到的chrome驱动程序,并将其作为scrapy middleware使用。
将此添加到Middleware.py:

class SeleniumMiddleWare(object):

    def __init__(self):
        path = "G:/Downloads/chromedriver.exe"
        options = uc.ChromeOptions()
        options.headless=True
        chrome_prefs = {}
        options.experimental_options["prefs"] = chrome_prefs
        chrome_prefs["profile.default_content_settings"] = {"images": 2}
        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
        #self.driver = uc.Chrome(options=options)
        self.driver=  uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)

    def process_request(self, request, spider):
        try:
            self.driver.get(request.url)
        except:
            pass
        content = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)

    def process_response(self, request, response, spider):
        return response

Settings.py:

DOWNLOADER_MIDDLEWARES = {
    'my_scraper.middlewares.SeleniumMiddleWare': 491 #change my_scraper to your scraper's name
}

my_scraper.py:我的电脑

class SeleniumSpider(scrapy.Spider):
    name = 'steamdb'
    allowed_domains = ['steamdb.info']
    start_urls = ['https://steamdb.info/graph/']

    def parse(self, response):
        yield {"title": response.css("h1::text").get()}

相关问题