scrapy `scrappy`不能从网站得到响应,但`requests`可以

hfyxw5xn  于 2023-10-20  发布在  其他
关注(0)|答案(1)|浏览(150)

我正在使用scrapy抓取this页面
但由于某些原因,scrapy无法收到来自此网站的响应。当我运行爬虫我收到https 500错误
下面是我的基本spider

import scrapy

class SavingsGov(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/'
    ]

    def parse(self, response):
        for option in response.css('select option'):
            yield {
                'url': option.css('::attr(value)').get()
            }

这里是我运行它时得到的错误,(我还在settings.py中将重试次数增加到10次)

2023-08-26 16:30:22 [scrapy.core.engine] INFO: Spider opened
2023-08-26 16:30:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-26 16:30:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-26 16:30:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/robots.txt> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/robots.txt> (referer: None)
2023-08-26 16:30:40 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-08-26 16:30:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:43 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:53 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/download-draws/> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/download-draws/> (referer: None)
2023-08-26 16:30:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://savings.gov.pk/download-draws/>: HTTP status code is not handled or not allowed
2023-08-26 16:30:56 [scrapy.core.engine] INFO: Closing spider (finished)

但是我可以很容易地使用python的requests模块得到响应。下面是它的python代码

import requests

response = requests.get('https://savings.gov.pk/download-draws/')
print(response.text)

我不知道为什么会发生这种情况,我假设问题出在scrapy.Request上。
有没有办法用requests执行请求并将响应传递给scrapy?但更好的选择是以某种方式调试scrapy.Request
我是scrapy的新手,所以如果有可能我误解了这个问题,请让我知道。

xmq68pz9

xmq68pz91#

这很可能是因为服务器可能会拒绝来自scrapy默认用户代理的请求。
尝试在蜘蛛自定义设置中设置一个自定义设置。还将ROBOTSTXT_OBEY设置为false。
举例来说:

import scrapy

class SavingsGov(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/'
    ]
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
        "ROBOTSTXT_OBEY": False
    }

    def parse(self, response):
        for option in response.css('select option'):
            yield {
                'url': option.css('::attr(value)').get()
            }

部分输出:

2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draw-list/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-200-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-15000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-7500-draws/'}

相关问题