如何限制Scrapy CrawlSpider抓取的页面数量？

enyaitl3 于 2023-10-20 发布在其他

关注(0)|答案(1)|浏览(167)

我想限制刮到5页的数量与下面的代码，虽然该网站有50页。我在用Scrapy的爬行蜘蛛我怎么能这么做呢？

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BooksSpider(CrawlSpider):
    name = "bookscraper"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    rules = (Rule(LinkExtractor(restrict_xpaths='//h3/a'), callback='parse_item', follow=True),
             Rule(LinkExtractor(restrict_xpaths='//li[@class="next"]/a'), follow=True),)

    def parse_item(self, response):

        product_info = response.xpath('//table[contains(@class, "table-striped")]')

        name = response.xpath('//h1/text()').get()
        upc = product_info.xpath('(./tr/td)[1]/text()').get()
        price = product_info.xpath('(./tr/td)[3]/text()').get()
        availability = product_info.xpath('(./tr/td)[6]/text()').get()

        yield {'Name': name, 'UPC': upc, 'Availability': availability, 'Price': price}

scrapy

来源：https://stackoverflow.com/questions/77076028/how-can-i-limit-the-number-of-pages-scraped-with-scrapy-crawlspider

1条答案

按热度按时间

yws3nbqq1#

您可以使用CLOSESPIDER_PAGECOUNT设置。这将是正确的方式https://docs.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider在项目设置中使用此设置，或者您可以覆盖您的蜘蛛中的设置（如果您仅为此具体蜘蛛使用此设置）
您也可以使用deny参数和蜘蛛将停止扫描第五页。https://docs.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml

Rule(LinkExtractor(restrict_xpaths='//li[@class="next"]/a', deny=('page-5',)), follow=True)

你也可以使用DEPTH_LIMIT spider设置，比如：https://docs.scrapy.org/en/latest/topics/settings.html?highlight=DEPTH_LIMIT#depth-limit

赞(0）回复(0）举报 2023-10-20

我来回答

如何限制Scrapy CrawlSpider抓取的页面数量？

1条答案

相关问题

热门标签

最新问答