Scrapy和Python解析 _大数据知识库

我在学Scrapy例如，有一个网站http://quotes.toscrape.com。我正在创建一个简单的蜘蛛（scrapy genspider报价）。我想解析引号，以及转到作者的页面并解析他的出生日期。我试着这样做，但没有什么工作。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        
        quotes=response.xpath('//div[@class="quote"]') 
        
        item={}

        for quote in quotes: 
            item['name']=quote.xpath('.//span[@class="text"]/text()').get()
            item['author']=quote.xpath('.//small[@class="author"]/text()').get()
            item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            url=quote.xpath('.//small[@class="author"]/../a/@href').get()
            response.follow(url, self.parse_additional_page, item) 
            

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse) 
            
    def parse_additional_page(self, response, item): 
        item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get() 
        yield item

没有出生日期的代码（正确）：

import scrapy 

  

  

class QuotesSpiderSpider(scrapy.Spider): 

    name = "quotes_spider" 

    allowed_domains = ["quotes.toscrape.com"] 

    start_urls = ["https://quotes.toscrape.com/"] 

     

    def parse(self, response): 

        quotes=response.xpath('//div[@class="quote"]') 

        for quote in quotes: 

            yield { 

                'name':quote.xpath('.//span[@class="text"]/text()').get(), 

                'author':quote.xpath('.//small[@class="author"]/text()').get(), 

                'tags':quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall() 

                } 

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse)

问题：如何转到每个引用的作者页面并解析出生日期？
如何为每个引用转到作者的页面并解析出生日期？

你真的很接近正确了。只有几件事你是失踪和一件事，需要移动。

response.follow返回一个请求对象，因此除非您yield该请求对象，否则它将永远不会从scrapy引擎中调度。
1.当将对象从一个回调函数传递到另一个回调函数时，应该使用cb_kwargs参数。使用meta字典也可以，但scrappy官方更喜欢使用cb_kwargs。然而，简单地将其作为位置参数传递将不起作用。
1.一个dict是可变的，这包括当它们被用作碎片项目时。因此，当你创建零碎物品时，每个物品都应该是唯一的。否则，当您稍后更新该项时，可能会导致以前生成的项发生变化。
下面是一个使用您的代码的示例，但实现了我上面提出的三点。

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            # moving the item constructor inside the loop 
            # means it will be unique for each item
            item={}   

            item['name']=quote.xpath('.//span[@class="text"]/text()').get()
            item['author']=quote.xpath('.//small[@class="author"]/text()').get()
            item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            url=quote.xpath('.//small[@class="author"]/../a/@href').get()
            # you have to yield the request returned by response.follow
            yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item})
        new_page=response.xpath('//li[@class="next"]/a/@href').get()
        if new_page is not None:
            yield response.follow(new_page)

    def parse_additional_page(self, response, item=None):
        item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get()
        yield item

部分输出：

2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/>
{'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/>
{'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/>
{'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': '
September 20, 1948'}
2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/>
{'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}

更多信息，请查看scrapy docs中的Passing additional data to callback functions和Response.follow。

Scrapy和Python解析

1条答案

相关问题

热门标签

最新问答