scrapy无法获取,但curl或浏览器可以检索该页面

00jrzges  于 2024-01-09  发布在  其他
关注(0)|答案(1)|浏览(199)

我有一只简单的蜘蛛。

import scrapy
from scrapy.crawler import CrawlerProcess

class ScraperSpider(scrapy.Spider):
    name = "scraper"

    def start_requests(self):
        urls = [
            'https://api.ipify.org?format=json',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        self.logger.info('================Request: %s, IP address: %s' % (response.request, response.text))

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(ScraperSpider)
    process.start()

字符串
但是,它给出了一个错误:

2023-12-18 23:56:34 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://api.ipify.org?format=json> (referer: None)
2023-12-18 23:56:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://api.ipify.org?format=json>: HTTP status code is not handled or not allowed
2023-12-18 23:56:34 [scrapy.core.engine] INFO: Closing spider (finished)


但实际上url可以用curl或browser获取。

8gsdolmq

8gsdolmq1#

在url中的?之前添加/

import scrapy

class ScraperSpider(scrapy.Spider):
    name = "scraper"

    def start_requests(self):
        urls = [
            'https://api.ipify.org/?format=json',
        ]
        for url in urls:
            yield scrapy.Request(url=url)

    def parse(self, response):
        self.logger.info('================Request: %s, IP address: %s' % (response.request, response.json().get('ip')))

字符串
输出量:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.ipify.org/?format=json> (referer: None)
[scraper] INFO: ================Request: <GET https://api.ipify.org/?format=json>, IP address: X.X.X.X

相关问题