使用scrapy重新格式化JSON repsonse

wz8daaqr  于 2023-04-30  发布在  其他
关注(0)|答案(1)|浏览(172)

我以json格式输出抓取的数据。默认scrapy导出器输出json格式的dict列表。项目类型如下所示:

[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]

但我想导出的数据在一个特定的格式,如-所以我将设置商店名称,位置,联系人手动在一个变量。然后将需要获得的数据,我爬,并粘贴在一个数组中的产品键值。

{
"Shop Name":"Shop 1",
"Location":"XXXXXXXXX",
"Contact":"XXXX-XXXXX",
"Products":
[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]
}

下面是我的代码,我如何获得抓取的数据。

def parse(self, response):
        for products in response.css('div.single_product'):
            yield {
                'name': products.css('h4.product_name::text').get(),
                'price': products.css('span.current_price::text').get(),
                'code': products.css('div.single_product').attrib['data-itemcode'],
                'url' : urljoin("https://xxxx", products.css('a.image-popup-no-margins').attrib['data-image'] )
            }
uwopmtnx

uwopmtnx1#

格式化dict并生成它(而不是分别生成每个项目),例如:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['scrapingclub.com']
    start_urls = ['https://scrapingclub.com/exercise/list_basic/']
    custom_settings = {
        'FEEDS': {
            'items.json': {
                'format': 'json',
                'encoding': 'utf8',
                'indent': 4,
            }
        }
    }

    def parse(self, response, **kwargs):
        products = list()
        for product in response.xpath('//div[@class="card"]'):
            products.append(
                {'title': product.xpath('.//h4/a/text()').get(default=''),
                 'price': product.xpath('.//h5//text()').get(default='')}
            )

        item = dict()
        item['URL'] = response.url
        item['Products'] = products
        yield item

        next_page = response.xpath('//ul[@class="pagination"]//li[last()]/a/@href').get()
        if next_page:
            yield response.follow(url=next_page)

相关问题