scrapy 抓取动态亚马逊页面与滚动

r9f1avp5  于 2024-01-09  发布在  其他
关注(0)|答案(1)|浏览(200)

我试图刮亚马逊的畅销书100产品的一个特定类别.例如-
https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_0
100种产品分为两页,每页50种产品。
以前,页面是静态的,所有50种产品都显示在页面上。但是,现在页面是动态的,我需要向下滚动才能看到页面上的所有50种产品。
我是使用scrapy刮页面早些时候。真的很感激,如果你能帮我这个。谢谢!
在下面添加我的代码-

import scrapy
from scrapy_splash import SplashRequest

class BsrNewSpider(scrapy.Spider):
    name = 'bsr_new'
    allowed_domains = ['www.amazon.in']
    #start_urls = ['https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0']

script = '''
    function main(splash, args)
        splash.private_mode_enabled = false
        url = args.url
        assert(splash:go(url))
        assert(splash:wait(0.5))
        return splash:html()
    end
'''

def start_requests(self):
    url = 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0'
    yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
        'lua_source': self.script
    })

def parse(self, response):
    for rev in response.xpath("//div[@id='gridItemRoot']"):   
        yield {
            'Segment': "Home", #Enter name of the segment here
            #'Sub-segment':segment,
            'ASIN' : rev.xpath(".//div/div[@class='zg-grid-general-faceout']/div/a[@class='a-link-normal']/@href").re('\S*/dp/(\S+)_\S+')[0][:10],
            'Rank' : rev.xpath(".//span[@class='zg-bdg-text']/text()").get(),
            'Name' : rev.xpath("normalize-space(.//a[@class='a-link-normal']/span/div/text())").get(),
            'No. of Ratings' : rev.xpath(".//span[contains(@class,'a-size-small')]/text()").get(),
            'Rating' : rev.xpath(".//span[@class='a-icon-alt']/text()").get(),
            'Price' : rev.xpath(".//span[@class='a-size-base a-color-price']//text()").get()
            }      
        
        next_page = response.xpath("//a[text()='Next page']/@href").get()
        if next_page:
            url = response.urljoin(next_page)
            yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
                'lua_source': self.script
            })

字符串
问候Sreejan

gmxoilav

gmxoilav1#

这里有一个不需要Splash的替代方法。
所有50个产品的ASIN都隐藏在第一页上。您可以提取这些ASIN并构建所有50个产品的URL。

import scrapy
import json

class AmazonSpider(scrapy.Spider):
    custom_settings ={
        'DEFAULT_REQUEST_HEADERS':''# Important
    }
    name = 'amazon'
    start_urls = ['https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_pg_1?_encoding=UTF8&pg=1']

    def parse(self, response):
        raw_data = response.css('[data-client-recs-list]::attr(data-client-recs-list)').get()
        data = json.loads(raw_data)
        for item in data:
            url = 'https://www.amazon.com/dp/{}'.format(item['id'])
            yield scrapy.Request(url, callback=self.parse_item)
    def parse_item(self, response,):
        ...

字符串

相关问题