scrapy 爬网页面时停止爬行器时出现问题

2wnc66cl 于 2023-03-08 发布在其他

关注(0)|答案(1)|浏览(216)

我对Python上的Scrapy模块真的很陌生，我有一个关于我的代码的问题。
我想要报废的网站包含一些我想要报废的数据，为了做到这一点，我的蜘蛛在每个页面上爬行，检索一些数据。
我的问题是如何让它停止。当加载最后一页（第75页），我的蜘蛛改变网址去76号，但网站并没有显示错误左右，而是一次又一次地显示75页，在这里我通过自动要求停止当蜘蛛想在76页爬行时让它停止，但这并不准确，因为数据可以改变，并且网站可以随时间包含更多或更少的页面，这不是必须的。
你能帮我一下吗？我真的很感激：）
下面是我的代码：

import scrapy
from scrapy.exceptions import CloseSpider

class TowardsSustainabilitySpider(scrapy.Spider):
    name = "towards_sustainability"
    allowed_domains = ["towardssustainability.be"]
    start_urls = ["https://towardssustainability.be/products?page=1"]
    page_number = 1

    def parse(self, response):
        rows = response.xpath('//a[@class="Product-item"]')
        for row in rows:
            fund_name = row.xpath('./div/h2/text()').get()
            yield {
                'fund_name':fund_name
            }

        #go to the next page
        self.page_number+=1
        next_page = f'https://towardssustainability.be/products?page={self.page_number}'
        if next_page == f'https://towardssustainability.be/products?page=76':
            raise CloseSpider
        yield response.follow(next_page, callback=self.parse)`

我试了几样东西：

在第一页有一个写着结果数量的盒子。考虑到每页包含10个结果，我所要做的就是把它除以10，然后四舍五入得到最后一页的数量。没有计算出来，我不太清楚为什么。
刚刚尝试了100种不同的方法让它准时停下来：停止时tuplings在我的csv文件，试图匹配的结果，前一页和当前页，...没有什么使它停止的时间

scrapy

来源：https://stackoverflow.com/questions/75546750/problem-stopping-my-spider-when-crawling-pages

1条答案

按热度按时间

ohtdti5x1#

在页面上（http响应）你可以找到next链接。尝试使用它。

....
next_page = response.css(".Nav-item--next a::attr(href)").get()
if not next_page: 
    raise CloseSpider
...

赞(0）回复(0）举报 2023-03-08

我来回答

scrapy 爬网页面时停止爬行器时出现问题

1条答案

相关问题

热门标签

最新问答