scrapy 为什么我的抓取脚本在正常工作之前多次打印第一个项目？

初学者在这里，试图学习网页抓取和一般的Python。
在看了一些基本的网页抓取教程后，我一直在做一个蜘蛛，用来抓取这个网站上的产品名称（“special_offers”）和网址：https://m.alibaba.com/sitemap/showroom/showroom-A.html。代码如下：

import scrapy

class SpecialOffersSpider(scrapy.Spider):
    name = "special_offers"
    allowed_domains = ["m.alibaba.com"]
    start_urls = ["https://m.alibaba.com/sitemap/showroom/showroom-A.html"]

    def parse(self, response):
        for product in response.xpath("//ul[@class='link-container']/li"):
            yield{
                'title': product.xpath(".//li/a/text()").get(),
                'url': response.urljoin(product.xpath(".//li/a/@href").get())
            }

字符串
蜘蛛在调试时没有错误，但我只得到标题的搜索结果，URL都是错误的。
我试着修改xpath修饰符（比如删除.，以//a开头），但是我不能产生任何正确的输出。

首先你搜索.../li，然后搜索.//li/a/...，得到../li//li/a/...--所以你有太多的li。你在li中搜索li，这就有问题了。
在循环内部，应该搜索.//a/...而不是.//li/a/...
我发现了另一个问题--在其他元素之间有空的li来创建空间--它可能需要三种方法之一

检查title是否不是None
仅获取具有a的li-类似于.../li[a]
直接获取../li/a和更高版本.//text()，.//@href

我添加了搜索按钮Next以加载下一页的代码，因为还有另一个问题。class在这个按钮的末尾有空格。对于scrapy，这个空格非常重要-它将字符串与空格作为一个类，并将许多类作为一个类（但在BeautifulSoup或lxml中，您将使用没有这个空格的类，许多类作为单独的类）
完整的工作代码-使用.../li[a]。
您可以将所有代码放在一个文件script.py中并运行python script.py，而无需创建项目。

#!/usr/bin/env python3

import scrapy

class SpecialOffersSpider(scrapy.Spider):

    name = "special_offers"
    
    allowed_domains = ["m.alibaba.com"]
    start_urls = ["https://m.alibaba.com/sitemap/showroom/showroom-A.html"]

    def parse(self, response):
        for product in response.xpath("//ul[@class='link-container']/li[a]"):
            yield {
                'title': product.xpath(".//a/text()").get(),
                'url': response.urljoin(product.xpath(".//a/@href").get())
            }

        # find button `Next` tp get url to next page
        # warning: this button has class with space at the end 
        
        next_page = response.xpath("//a[@class='btn-pagination-next ']/@href").get()
        print(f'NEXT PAGE: {next_page}')
        
        if next_page:
            yield response.follow(next_page)  # it sends `Request` with url to next page         
            
# --- run without creating project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})

c.crawl(SpecialOffersSpider)
c.start()

字符串
使用.../li/a的版本

def parse(self, response):
        for product in response.xpath("//ul[@class='link-container']/li/a"): 
            yield {
                'title': product.xpath(".//text()").get(),
                'url': response.urljoin(product.xpath(".//@href").get())
            }

型

编辑：

可以使用list comprehension生成列表start_urls

class SpecialOffersSpider(scrapy.Spider):
    start_urls = [f"m.alibaba.com/sitemap/showroom/showroom-A_{x}.html" for x in range(1,4)]

型
您也可以添加两个列表来创建一个包含URL的列表。

class SpecialOffersSpider(scrapy.Spider):

    start_urls = ["m.alibaba.com/sitemap/showroom/showroom-A.html"] + [f"m.alibaba.com/sitemap/showroom/showroom-A_{x}.html" for x in range(1,4)]

型
最后，你可以使用函数start_requests()来生成请求列表-它允许向Request添加额外的参数（即使用SeleniumRequest或添加dont_follow=False或在meta=...中发送额外的参数

class SpecialOffersSpider(scrapy.Spider):

    def start_requests(self):
        requests = [scrapy.Request("m.alibaba.com/sitemap/showroom/showroom-A.html")]

        for x in range(1,4):
            requests.append(
                scrapy.Request(f"m.alibaba.com/sitemap/showroom/showroom-A_{x}.html")
            )

        return requests

型

scrapy 为什么我的抓取脚本在正常工作之前多次打印第一个项目？

1条答案

相关问题

热门标签

最新问答