来自Scrapy的响应405

6rqinv9w  于 2023-01-05  发布在  其他
关注(0)|答案(1)|浏览(254)

我试图从http://quotes.toscrape.com/中抓取作者数据,但不幸的是,当我运行spider时,作者页面返回405;而在浏览器中或通过获取Scrapy shell中的URL,它返回200

class AuthorsSpider(scrapy.Spider):
    name = 'authors'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    custom_settings = {
        'CONCURRENT_REQUESTS': 50,
        'DOWNLOAD_DELAY': 0.1,
        'FEED_URI': f'output/authors_{datetime.datetime.today().strftime("%Y-%m-%d %H-%M-%S")}.csv',
        'FEED_FORMAT': 'csv',
        'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_EXPORT_FIELDS': ('name','birth_date','birth_location','description',) 
    }

    def parse(self, response):
        for _ in response.xpath("//div[@class='quote']"):
            author_page = response.xpath("//a[text()='(about)']/@href").get()
            yield response.follow(author_page,
                                method="GET",
                                callback=self.parse_author)

        next_page = response.xpath("//li[@class='next']/a/@href").get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_author(self, response):
        yield {
            'name': response.xpath("//h3[@class='author-title']/text()").get(),
            'birth_date': response.xpath("//span[@class='author-born-date']/text()").get(),
            'birth_location': response.xpath("//span[@class='author-born-location']/text()").get(),
            'description': response.xpath("//div[@class='author-description']/text()").get()
        }

下面是我运行scrapy crawl authors时的部分响应:

2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Suzanne-Collins/> (referer: http://quotes.toscrape.com/page/7/)
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/Suzanne-Collins/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/W-C-Fields/> (referer: http://quotes.toscrape.com/page/8/)
2023-01-02 10:53:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (308) to <NONE http://quotes.toscrape.com/author/John-Lennon/> from <GET http://quotes.toscrape.com/author/John-Lennon>
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/W-C-Fields/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Alfred-Tennyson/> (referer: http://quotes.toscrape.com/page/8/)
2j4z5cfb

2j4z5cfb1#

基本上,通过response.follow(),你是在请求parse函数再次跟踪这个url。如果你想把url传递给另一个函数,那么你需要使用Scrapy.Request()而不是response.follow()。如果你想把作者的页面url传递给parse_author,那么你的代码应该如下所示。

class AuthorsSpider(scrapy.Spider):
    name = 'authors'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    custom_settings = {
        'CONCURRENT_REQUESTS': 50,
        'DOWNLOAD_DELAY': 0.1,
        'FEED_URI': f'output/authors_{datetime.datetime.today().strftime("%Y-%m-%d %H-%M-%S")}.csv',
        'FEED_FORMAT': 'csv',
        'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_EXPORT_FIELDS': ('name','birth_date','birth_location','description',) 
    }

    def parse(self, response):
        for _ in response.xpath("//div[@class='quote']"):
            author_page = response.xpath("//a[text()='(about)']/@href").get()
            yield scrapy.Request(author_page,
                                method="GET",
                                callback=self.parse_author)

        next_page = response.xpath("//li[@class='next']/a/@href").get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_author(self, response):
        yield {
            'name': response.xpath("//h3[@class='author-title']/text()").get(),
            'birth_date': response.xpath("//span[@class='author-born-date']/text()").get(),
            'birth_location': response.xpath("//span[@class='author-born-location']/text()").get(),
            'description': response.xpath("//div[@class='author-description']/text()").get()
        }

附件图片x1c 0d1x如果您还有任何问题,请回复此答案。快乐学习!

相关问题