已爬网0页错误:Python Scrapy的亚马逊评论

b09cbbtk  于 2023-03-18  发布在  Python
关注(0)|答案(1)|浏览(110)

我正在构建一个亚马逊Scrapy,使用Scrapy检索产品评论作为个人练习。我主要遵循ScrapeOps亚马逊评论教程,做了一些修改。
更具体地说,我希望scraper执行以下操作:
1.从Amazon产品详细信息页面URL开始
1.找到“see-all-reviews-link-foot”链接并转到产品评论页面
1.从产品评论页面抓取评论数据并移动到下一个评论页面,直到它存储了所有评论信息。
1.保存评论信息在一个csv.
我创建了一个标准的Scrapy项目,并在我的“spider”文件夹中添加了以下代码:start_requests从一个简单的amazon产品详细信息页面开始,parse_see_all_link函数应该找到see-all-reviews-link-foot并在产品评论页面上调用parse_review函数,parse review函数应该抓取评论信息并转到下一个评论页面。
不幸的是,我得到了一个“爬网0页,抓取0项”的结果
下面是我代码:

import scrapy
from urllib.parse import urljoin

class AmazonReviewsSpider(scrapy.Spider):
    name = "amazon_reviews"

    def start_requests(self):
        url_list = ["https://www.amazon.it/Caff%C3%A8-Toraldo-Miscela-Cremosa-Cialde/dp/B08BPLN57Q/ref=lp_6377867031_1_7?sbo=RZvfv%2F%2FHxDF%2BO5021pAnSA%3D%3D"]
        for product_url in url_list:
            yield scrapy.Request(url=product_url, callback=self.parse_see_all_link, meta={'retry_count': 0})

    def parse_see_all_link(self, response):
        # go to see all reviews link
        see_all_reviews_link = response.css("a[data-hook=see-all-reviews-link-foot]").attrib['href']
        see_all_reviews_url = "https://www.amazon.it" + see_all_reviews_link
        yield scrapy.Request(url=see_all_reviews_url, callback=self.parse_reviews, meta={'retry_count': 0})

    def parse_reviews(self, response):
        retry_count = response.meta['retry_count']

        ## Get Next Page Url
        next_page_relative_url = response.css(".a-pagination .a-last>a::attr(href)").get()
        if next_page_relative_url is not None:
            retry_count = 0
            next_page = urljoin('https://www.amazon.it/', next_page_relative_url)
            yield scrapy.Request(url=next_page, callback=self.parse_reviews, meta={'retry_count': retry_count})
        
        ## Adding this retry_count here to bypass any amazon js rendered review pages
        elif retry_count < 3:
            retry_count = retry_count+1
            yield scrapy.Request(url=response.url, callback=self.parse_reviews, dont_filter=True, meta={'retry_count': retry_count})

        ## Parse Product Reviews
        review_elements = response.css("#cm_cr-review_list div.review")
        for review_element in review_elements:
            yield {
                    "text": "".join(review_element.css("span[data-hook=review-body] ::text").getall()).strip(),
                    "title": review_element.css("*[data-hook=review-title]>span::text").get(),
                    "location_and_date": review_element.css("span[data-hook=review-date] ::text").get(),
                    "verified": bool(review_element.css("span[data-hook=avp-badge] ::text").get()),
                    "rating": review_element.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0],
                    }
qnakjoqk

qnakjoqk1#

我不完全确定你想用重复过滤器做什么,但这是没有必要的。
试试这个:

import scrapy

class AmazonReviewsSpider(scrapy.Spider):
    name = "amazon_reviews"

    def start_requests(self):
        url_list = ["https://www.amazon.it/Caff%C3%A8-Toraldo-Miscela-Cremosa-Cialde/dp/B08BPLN57Q/ref=lp_6377867031_1_7?sbo=RZvfv%2F%2FHxDF%2BO5021pAnSA%3D%3D"]
        for product_url in url_list:
            yield scrapy.Request(url=product_url, callback=self.parse_see_all_link)

    def parse_see_all_link(self, response):
        see_all_reviews_link = response.xpath("//a[@data-hook='see-all-reviews-link-foot']/@href").get()
        url = response.urljoin(see_all_reviews_link)
        yield scrapy.Request(url=url, callback=self.parse_reviews)

    def parse_reviews(self, response):
        next_page = response.css(".a-pagination .a-last>a::attr(href)").get()
        if next_page is not None:
            url = response.urljoin(next_page)
            yield scrapy.Request(url=url, callback=self.parse_reviews)
        review_elements = response.css("#cm_cr-review_list div.review")
        for review_element in review_elements:
            yield {
                    "text": "".join(review_element.css("span[data-hook=review-body] ::text").getall()).strip(),
                    "title": review_element.css("*[data-hook=review-title]>span::text").get(),
                    "location_and_date": review_element.css("span[data-hook=review-date] ::text").get(),
                    "verified": bool(review_element.css("span[data-hook=avp-badge] ::text").get()),
                    "rating": review_element.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out"),
              }

输出

{'downloader/request_bytes': 106007,
 'downloader/request_count': 109,
 'downloader/request_method_count/GET': 109,
 'downloader/response_bytes': 11126825,
 'downloader/response_count': 109,
 'downloader/response_status_count/200': 108,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 64.951797,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 3, 16, 19, 50, 49, 168484),
 'httpcompression/response_bytes': 40477192,
 'httpcompression/response_count': 109,
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/404': 1,
 'item_scraped_count': 1070,  # number of items.
 'log_count/DEBUG': 1182,
 'log_count/INFO': 12,
 'request_depth_max': 108,
 'response_received_count': 109,
 'scheduler/dequeued': 109,
 'scheduler/dequeued/memory': 109,
 'scheduler/enqueued': 109,
 'scheduler/enqueued/memory': 109,
 'start_time': datetime.datetime(2023, 3, 16, 19, 49, 44, 216687)}

相关问题