如何使用scrapy为同一项目抓取多个页面

a7qyws3x  于 2023-01-13  发布在  其他
关注(0)|答案(1)|浏览(148)

我想为同一个项目抓取多个页面,但是每次我让步时,它返回的是项目列表的增量,而不是同一个项目列表中的所有子项目。

class GdSpider(scrapy.Spider):
    name = 'pcs'
    start_urls = [...]

    def parse(self, response):
        PC= dict()
        PC['Name'] = response.css('h2::text').get()
        components_urls = response.css('a::attr(href)').get()
        components = []
        for url in components_urls:
            req = yield scrapy.Request(response.urljoin(url), self.parse_component)
            components.append(parse_component(req))
        PC['components'] = components
        yield PC

    def parse_component(self, response):
        component_name = response.css('h1::text')
        component_tag = response.css('div[class="tag"]::text').get()
        yield {"component_name": component_name, "component_tag": component_tag}

我的out应该如下所示:

{"Name": "HP 15", "components": [.....]}

但它能独立刮除所有东西:

{"Name": "HP 15", "components":  [<generator object GdSpider.parse_part_component at 0x000001B8A7405230>]

{component1}
{component2}

例如,如何使用@inline-requests装饰器返回一个包含所有组件的项?

bakd9h0s

bakd9h0s1#

**选项1:**使用async await

class GdSpider(scrapy.Spider):
    name = 'pcs'
    start_urls = [...]

    async def parse(self, response):
        PC = dict()
        PC['Name'] = response.css('h2::text').get()
        components_urls = response.css('a::attr(href)').get()
        components = []
        for url in components_urls:
            req = scrapy.Request(response.urljoin(url), self.parse_component)
            res = await self.crawler.engine.download(req, self)
            components.append(self.parse_component(res))
        PC['components'] = components
        yield PC

    def parse_component(self, response):
        component_name = response.css('h1::text')
        component_tag = response.css('div[class="tag"]::text').get()
        yield {"component_name": component_name, "component_tag": component_tag}

**选项2:**使用类成员变量。

(注意CONCURRENT_REQUESTS为1)。

class GdSpider(scrapy.Spider):
    name = 'pcs'
    start_urls = [...]
    components = []

    custom_settings = {'CONCURRENT_REQUESTS': 1}
    
    def parse(self, response):
        PC = dict()
        PC['Name'] = response.css('h2::text').get()
        components_urls = response.css('a::attr(href)').get()

        for url in components_urls:
            yield scrapy.Request(response.urljoin(url), self.parse_component)

        PC['components'] = self.components
        yield PC

    def parse_component(self, response):
        component_name = response.css('h1::text')
        component_tag = response.css('div[class="tag"]::text').get()
        self.components.append({"component_name": component_name, "component_tag": component_tag})

相关问题