我想从这个在线store中抓取数据。以前,我可以抓取所有我想要的数据**,除了**类别,子类别和子子类别,可以找到here。
然而,该网站最近似乎进行了更改,因为我使用允许的域URL获得了DNSError,并且在运行以前版本的代码时也获得了以下错误:
if data and data['data'] and data['data']['products'] and data['data']['products']['items']:
KeyError: 'data'
根据用户的评论,我的开始请求URL似乎有问题。然而,我无法弄清楚是什么,即使在使用谷歌开发者工具-网络。因此,我创建了一个新的scraper,包括错误解析(* 完整的脚本可以在这篇文章的底部找到 *),它捕获了下面的3个错误/bug:
错误/漏洞1:
DEBUG: Rule at line 1 without any user agent to enforce it on.
错误/Bug 2:
File "src\lxml\etree.pyx", line 1582, in lxml.etree._Element.xpath
File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression
错误/Bug 3:
ValueError: XPath error: Invalid expression in .//a[contains(@class, 'MuiBox-root css-1efcy4n')]/text())
此外,由于我不再使用数据API,我在开始请求URL和分页方面遇到了困难。我想刮不同类别的产品,我包括在列表中。例如,6011是电气产品,而24175是杂货。由于网站似乎是用JavaScript制作的,我也很难删除下一页的数据。我需要 selenium 吗?飞溅?请指示。
categories = {
"6011": {"pages": 60, "name": "Цахилгаан бараа"},
"24175": {"pages": 70, "name": "Хүнс"},
"24273": {"pages": 40, "name": "Гэр ахуй"},
"21297": {"pages": 70, "name": "Гоо сайхан"},
"19653": {"pages": 30, "name": "Гутал, хувцас"}
}
def start_requests(self):
yield scrapy.Request(url="https://e-shop.nomin.mn/t/6011", errback=self.parse_error)
# handling pagination
next_page = response.xpath(
"//a[contains(@class,'number-list-next js-page-filter number-list-line')]/@href").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
print(f'Scraped {next_page}')
完整代码:
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy import Request
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from twisted.internet.error import DNSLookupError
dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' E-CPI Nomin'
categories = {
"6011": {"pages": 60, "name": "Цахилгаан бараа"},
"24175": {"pages": 70, "name": "Хүнс"},
"24273": {"pages": 40, "name": "Гэр ахуй"},
"21297": {"pages": 70, "name": "Гоо сайхан"},
"19653": {"pages": 30, "name": "Гутал, хувцас"},
"19451": {"pages": 10, "name": "Авто бараа"},
"19518": {"pages": 40, "name": "Барилгын материал"},
"19853": {"pages": 10, "name": "Аялал, Спорт бараа"},
"19487": {"pages": 50, "name": "Ном"},
"19767": {"pages": 20, "name": "Бичиг хэрэг"},
"19469": {"pages": 10, "name": "Эрүүл мэнд"},
"19545": {"pages": 20, "name": "Хүүхдийн бараа"},
}
# create Spider class
class ecpiNominSpider(scrapy.Spider):
name = "cpi_nomin"
allowed_domains = "www.e-shop.nomin.mn"
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
"FEEDS": {
f'{filename}.csv': {
'format': 'csv',
'overwrite': True}}
}
def start_requests(self):
yield scrapy.Request(url="https://e-shop.nomin.mn/t/6011", errback=self.parse_error)
def parse_error(self, failure):
if failure.check(DNSLookupError):
# this is the original request
request = failure.request
yield {
'URL': request.url,
'Status': failure.value
}
def parse(self, response, **kwargs):
cards = response.xpath("//*[contains(@class,'MuiBox-root css-1kmsi46')]")
# parse details
for card in cards:
name = card.xpath(".//a[contains(@class, 'MuiBox-root css-1efcy4n')]/text())").extract_first()
price = card.xpath(".//*[contains(@class, 'MuiBox-root css-qr51gz')]").extract_first().strip()
link = card.xpath(".//a[contains(@href)]/@href").get()
item = {'name': name,
'price': price,
'link': 'https://e-shop.nomin.mn/p/' + link
}
# follow absolute link to scrape deeper level
yield response.follow(link, callback=self.parse_item, meta={'item': item})
# handling pagination
next_page = response.xpath(
"//a[contains(@class,'number-list-next js-page-filter number-list-line')]/@href").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
print(f'Scraped {next_page}')
def parse_item(self, response):
# retrieve previously scraped item between callbacks
item = response.meta['item']
# parse additional details
list_li = response.xpath(".//*[contains(@class, 'MuiBreadcrumbs-ol css-nhb8h9')]/text()").extract()
# get next layer data
cat1 = list_li[0].strip()
cat2 = list_li[1].strip()
cat3 = list_li[2].strip()
cat4 = list_li[3].strip()
skp = response.xpath(".//*[contains(@class, ' MuiBox-root css-jyp6ua')/text()").extract()
# update item with next layer data
item.update({
'category': cat1,
'sub_category': cat2,
'sub_sub_category': cat3,
'productName': cat4,
'productCode': skp
})
yield item
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(ecpiNominSpider)
process.start()
我似乎不知道我是否应该试图以某种方式修复我的previous post的代码的前一个版本,或修复我的当前版本张贴在上面。我真的很感激,如果你们能帮助我在刮所有网页的数据(包括cat 1:类别和cat 2:子类别信息)。对不起,我对自己感到沮丧,因为我提出了多个问题,在过去的几个月里查询了Stackoverflow,似乎无法取得进展。再次感谢您的帮助!
1条答案
按热度按时间5kgi1eie1#
我建议使用selenium作为它的webdriver实现,它正好具有您正在寻找的功能。例如,您可以使用
element.click()
并定义总是导航到特定子类别的方法。不过,要适应它还得花点功夫。这个网站有一个很好的“入门”部分,有助于熟悉使用 selenium 时的所有重要方面。
简介:https://www.selenium.dev/documentation/webdriver/getting_started/
Selenium文档:https://www.selenium.dev/documentation/webdriver/
有关Python的特定文档,请参阅:https://selenium-python.readthedocs.io/