Scrapy不抓取从分页收集的链接

quhf5bfb  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(134)

我正在试着从一个电子商务网站上搜索它的产品,但我目前面临的问题是,并非所有分页的页面都被访问过。链接本身是有效的,可访问的,而不是不存在的。
我的蜘蛛代码:

import scrapy
import json
from pbl.items import ShopCard

class SpidermaximaSpider(scrapy.Spider):
    name = 'spiderMaxima'
    allowed_domains = ['www.trobos.lt']
    start_urls = ['https://trobos.lt/prekes?vendor=MAXIMA']
    item = []
    list = [{
        'sid': 10,
        'name': 'Maxima',
        'domain': 'hhttps://www.maxima.lt/',
        'imageurl': 'https://upload.wikimedia.org/wikipedia/commons/c/c1/Maxima_logo.svg',
        'product': item
        }]

    def __init__(self):
        self.declare_xpath()

    def declare_xpath(self):
        self.getAllItemsXpath =  '//*[@id="category"]/div/div[1]/div/div[3]/div[4]/div/div/div/div/div/a/@href'
        self.TitleXpath  = '//*[@id="product"]/section[1]/div[3]/section/div[2]/h1/text()'    
        self.PriceXpath = '//*[@id="product"]/section[1]/div[3]/section/div[2]/div[1]/div/div[1]/div/div[1]/span/text()'

    def parse(self, response):
        for href in response.xpath(self.getAllItemsXpath):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url=url,callback=self.parse_main_item, dont_filter=True)

        next_page = [response.url + '&page='+str(x) for x in range(1,193)]
        for page in next_page:
            print('-'* 100)
            print(page)
            print('-'* 100)
            url = page
            yield scrapy.Request(url, callback=self.parse)

    def parse_main_item(self,response): 
        shop = ShopCard()
        Title = response.xpath(self.TitleXpath).extract_first()
        Link = response.url
        Image = 'https://upload.wikimedia.org/wikipedia/commons/c/c1/Maxima_logo.svg'
        Price = response.xpath(self.PriceXpath).extract_first()
        Price = Price.replace(',', '.')
        Price = float(Price.split(' ')[0])

        shop['item'] = {
                'title': Title,
                'link': Link,
                'image': Image,
                'price': Price
            }

        self.item.append(shop['item'])

    def closed(self, reason):
        with open("spiderMaxima.json", "w") as final:
            json.dump(self.list, final, indent=2, ensure_ascii=False)

我使用了一个带有range()函数的列表,因为在响应中(来自scrapy shell视图(response)),分页按钮连接到一个脚本。我也尝试了scrapy shell的几个链接,xpath的输出工作正常,但是页面仍然没有被抓取。可能是什么问题?有没有其他方法来处理分页?

qvtsj1bj

qvtsj1bj1#

你的代码有很多错误,还有其他可以改进的地方。请仔细阅读documentation
1.实际上不需要创建xpath属性。
1.您可以将xpath编写得更短。
1.您可以从头开始创建一个start_urls
1.您可以让item exporter来处理json。
这里有一个例子,根据您的需要进行更改。

import scrapy

class ShopCard(scrapy.Item):
    item = scrapy.Field()

class SpidermaximaSpider(scrapy.Spider):
    name = 'spiderMaxima'
    allowed_domains = ['trobos.lt']
    start_urls = [f'https://trobos.lt/prekes?vendor=MAXIMA&page={i}' for i in range(1, 190)]
    items = []

    custom_settings = {
        'DOWNLOAD_DELAY': 0.4,
        'FEEDS': {
            'spiderMaxima.json': {
                'format': 'json',
                'indent': 2,
                }
        }
    }

    def parse(self, response):
        for url in response.xpath('//div[@class="card small"]//a[contains(@class, "shrink")]/@href').getall():
            yield response.follow(url=url, callback=self.parse_main_item)

    def parse_main_item(self, response):
        shop = ShopCard()
        Title = response.xpath('//h1/text()').get()
        Link = response.url
        Image = 'https://upload.wikimedia.org/wikipedia/commons/c/c1/Maxima_logo.svg'
        Price = response.xpath('//div[@class="price"]//span/text()').get()
        Price = Price.replace(',', '.')
        Price = float(Price.split(' ')[0])

        shop['item'] = {
            'title': Title,
            'link': Link,
            'image': Image,
            'price': Price
        }

        yield shop

相关问题