我正在使用scrapy
抓取this页面
但由于某些原因,scrapy
无法收到来自此网站的响应。当我运行爬虫我收到https 500错误
下面是我的基本spider
import scrapy
class SavingsGov(scrapy.Spider):
name = 'savings'
start_urls = [
'https://savings.gov.pk/download-draws/'
]
def parse(self, response):
for option in response.css('select option'):
yield {
'url': option.css('::attr(value)').get()
}
这里是我运行它时得到的错误,(我还在settings.py
中将重试次数增加到10次)
2023-08-26 16:30:22 [scrapy.core.engine] INFO: Spider opened
2023-08-26 16:30:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-26 16:30:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-26 16:30:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/robots.txt> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/robots.txt> (referer: None)
2023-08-26 16:30:40 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-08-26 16:30:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:43 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:53 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/download-draws/> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/download-draws/> (referer: None)
2023-08-26 16:30:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://savings.gov.pk/download-draws/>: HTTP status code is not handled or not allowed
2023-08-26 16:30:56 [scrapy.core.engine] INFO: Closing spider (finished)
但是我可以很容易地使用python的requests
模块得到响应。下面是它的python代码
import requests
response = requests.get('https://savings.gov.pk/download-draws/')
print(response.text)
我不知道为什么会发生这种情况,我假设问题出在scrapy.Request
上。
有没有办法用requests
执行请求并将响应传递给scrapy
?但更好的选择是以某种方式调试scrapy.Request
。
我是scrapy
的新手,所以如果有可能我误解了这个问题,请让我知道。
1条答案
按热度按时间xmq68pz91#
这很可能是因为服务器可能会拒绝来自scrapy默认用户代理的请求。
尝试在蜘蛛自定义设置中设置一个自定义设置。还将ROBOTSTXT_OBEY设置为false。
举例来说:
部分输出: