使用Scrapy从无限滚动页面中抓取数据

gg58donl  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(201)

我是新的网页抓取,我想从一个网站上抓取所有产品的信息。
https://www.trendyol.com/
我已经写了一个示例代码来抓取数据,如下所示:

def start_requests(self):
    urls = [
        'https://www.trendyol.com/camasir-deterjani-x-c108713',
        'https://www.trendyol.com/yumusaticilar-x-c103814',
        'https://www.trendyol.com/camasir-suyu-x-c103812',
        'https://www.trendyol.com/camasir-leke-cikaricilar-x-c103810',
        'https://www.trendyol.com/camasir-yan-urun-x-c105534',
        'https://www.trendyol.com/kirec-onleyici-x-c103806',
        'https://www.trendyol.com/makine-kirec-onleyici-ve-temizleyici-x-c144512'

    ]

    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse, meta=meta, dont_filter=True)

def parse(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    data = re.search(r"__SEARCH_APP_INITIAL_STATE__=(.*?});", response.text)
    data = json.loads(data.group(1))

    for p in data["products"]:
        item=TeknosaItem()

        item['rowid'] = hash(str(datetime.datetime.now()) + str(p["id"]))
        item['date'] = str(datetime.datetime.now())
        item['listing_id'] = p["id"]
        item['product_id'] = p["id"]
        item['product_name'] = p["name"]
        item['price'] = p["price"]["sellingPrice"]
        item['url'] = p["url"]
        yield item

我编写的代码能够抓取第一页上列出的所有产品的数据,但是当我们向下滚动页面时,页面会通过 AJAX GET请求动态加载更多数据,而它无法抓取这些数据。我看过一些视频,读过一些文章,但我无法弄清楚如何滚动在滚动时动态生成的数据。任何帮助都将不胜感激。
我在目标站点上找到了无限页示例:
web site link

gywdnpxw

gywdnpxw1#

我不使用Scrapy,但你可以调整下一个例子,如何从类别中获取所有产品(使用他们的 AJAX API):

import requests

categories = [
    "camasir-deterjani-x-c108713",
    "yumusaticilar-x-c103814",
    "camasir-suyu-x-c103812",
    "camasir-leke-cikaricilar-x-c103810",
    "camasir-yan-urun-x-c105534",
    "kirec-onleyici-x-c103806",
    "makine-kirec-onleyici-ve-temizleyici-x-c144512",
]

# iterate over categories to construct api_url

# here I will only get products from first category:

api_url = (
    "https://public.trendyol.com/discovery-web-searchgw-service/v2/api/infinite-scroll/"
    + categories[0]
)

payload = {
    "pi": 1,
    "culture": "tr-TR",
    "userGenderId": "1",
    "pId": "0",
    "scoringAlgorithmId": "2",
    "categoryRelevancyEnabled": "false",
    "isLegalRequirementConfirmed": "false",
    "searchStrategyType": "DEFAULT",
    "productStampType": "TypeA",
    "fixSlotProductAdsIncluded": "false",
}

page = 1
while True:
    payload["pi"] = page
    data = requests.get(api_url, params=payload).json()

    if not data["result"]["products"]:
        break

    for p in data["result"]["products"]:
        name = p["name"]
        id_ = p["id"]
        price = p["price"]["sellingPrice"]
        u = p["url"]
        print("{:<10} {:<50} {:<10} {}".format(id_, name[:49], price, u[:60]))

    page += 1

这将获取类别中的所有产品:

...

237119563  Organik Sertifikalı Çamaşır Deterjanı              63         /eya-clean/organik-sertifikali-camasir-deterjani-p-237119563
90066873   Toz Deterjan Sık Yıkananlar                        179        /bingo/toz-deterjan-sik-yikananlar-p-90066873
89751820   Sıvı Çamaşır Deterjanı 2 x3L (100 Yıkama) Renkli   144.9      /perwoll/sivi-camasir-deterjani-2-x3l-100-yikama-renkli-siya
112627101  Sıvı Çamaşır Deterjanı (95 Yıkama) 3L Renkli + 2,  144.9      /perwoll/sivi-camasir-deterjani-95-yikama-3l-renkli-2-7l-cic
95398460   Toz Çamaşır Deterjanı Active Beyazlar Ve Renklile  180.99     /omo/toz-camasir-deterjani-active-beyazlar-ve-renkliler-10-k

...
qlfbtfca

qlfbtfca2#

所以老实说,我认为最好的方法是从API中获取信息,但我想回答你关于分页的问题。
因此,当你滚动时,你可以看到url发生了变化(?pi=pagenumber),所以我们可以循环浏览页面,当我们到达一个不存在的页面(404状态)时,我们将处理状态代码并中断循环。

import scrapy
import logging
import json
import datetime

class ExampleSpider(scrapy.Spider):
    name = 'ExampleSpider'

    start_urls = [
        'https://www.trendyol.com/camasir-deterjani-x-c108713',
        'https://www.trendyol.com/yumusaticilar-x-c103814',
        'https://www.trendyol.com/camasir-suyu-x-c103812',
        'https://www.trendyol.com/camasir-leke-cikaricilar-x-c103810',
        'https://www.trendyol.com/camasir-yan-urun-x-c105534',
        'https://www.trendyol.com/kirec-onleyici-x-c103806',
        'https://www.trendyol.com/makine-kirec-onleyici-ve-temizleyici-x-c144512'
    ]

    handle_httpstatus_list = [404]
    custom_settings = {'DOWNLOAD_DELAY': 0.4}

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, cb_kwargs={'base_url': url, 'page_number': 0}, callback=self.parse_page)

    def parse_page(self, response, base_url, page_number):
        # last page
        if response.status == 404:
            logging.log(logging.INFO, f'Finished scraping {base_url}')
            return

        # You don't need to use beautifulsoup, and you can and the regex directly
        all_data = response.xpath('//script[@type="application/javascript"]/text()').re(r'__SEARCH_APP_INITIAL_STATE__=(.*?});')

        for data in all_data:   # supposed to be only one element, but still...
            data = json.loads(data)

            for p in data["products"]:
                # item=TeknosaItem()
                item = dict()
                item['rowid'] = hash(str(datetime.datetime.now()) + str(p["id"]))
                item['date'] = str(datetime.datetime.now())
                item['listing_id'] = p["id"]
                item['product_id'] = p["id"]
                item['product_name'] = p["name"]
                item['price'] = p["price"]["sellingPrice"]
                item['url'] = p["url"]
                yield item

        # go to the next page
        page_number += 1
        yield scrapy.Request(url=base_url+f'?pi={str(page_number)}', cb_kwargs={'base_url': base_url, 'page_number': page_number}, callback=self.parse_page)

相关问题