scrapy 很糟糕,请求未通过

5q4ezhmt  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(143)

搜索进程似乎忽略和/或不执行yield scrapy.Request(property_file, callback=self.parse_property)行。def start_requests中的第一个scrapy.Request正确通过并执行,但def parse_navpage中的一个请求却不正确,如图所示。

import scrapy

class SmartproxySpider(scrapy.Spider):
    name = "scrape_zoopla"
    allowed_domains = ['zoopla.co.uk']

    def start_requests(self):
        # Read source from file
        navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
        yield scrapy.Request(navpage_file, callback=self.parse_navpage)

    def parse_navpage(self, response):
        listings = response.xpath("//div[starts-with(@data-testid, 'search-result_listing_')]")

        for listing in listings:
            listing_url = listing.xpath(
                "//a[@data-testid='listing-details-link']/@href").getall()  # List of property urls
            break
        print(listing_url) #Works

        property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
        print("BEFORE YIELD")
        yield scrapy.Request(property_file, callback=self.parse_property) #Not going through
        print("AFTER YIELD")

    def parse_property(self, response):
        print("PARSE PROPERTY")
        print(response.url)
        print("PARSE PROPERTY AFTER URL")

在命令中运行scrapy crawl scrape_zoopla将返回:

2022-09-10 20:38:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html> (referer: None)
BEFORE YIELD
AFTER YIELD
2022-09-10 20:38:24 [scrapy.core.engine] INFO: Closing spider (finished)

两个都不好。请求请求本地文件,只有第一个工作。文件存在并正确显示页面,如果其中一个文件不存在,爬行器将返回错误“没有这样的文件或目录”,并可能被中断。在这里,爬行器似乎只是通过了请求,甚至没有通过它,并没有返回错误。这里的错误是什么?

xoefb8l8

xoefb8l81#

这完全是瞎猜的,但是你可以尝试从你的start_requests方法发送两个请求。老实说,我不明白为什么这会起作用,但是它可能值得一试。

import scrapy

class SmartproxySpider(scrapy.Spider):
    name = "scraoe_zoopla"
    allowed_domains = ['zoopla.co.uk']

    def start_requests(self):
        # Read source from file
        navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
        property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
        yield scrapy.Request(navpage_file, callback=self.parse_navpage)
        yield scrapy.Request(property_file, callback=self.parse_property)

    def parse_navpage(self, response):
        listings = response.xpath("//div[starts-with(@data-testid, 'search-result_listing_')]")

        for listing in listings:
            listing_url = listing.xpath(
                "//a[@data-testid='listing-details-link']/@href").getall()  # List of property urls
            break
        print(listing_url) #Works

    def parse_property(self, response):
        print("PARSE PROPERTY")
        print(response.url)
        print("PARSE PROPERTY AFTER URL")

更新

我突然明白了为什么会发生这种情况。这是因为您设置了allowed_domains属性,但您正在发出的请求是在您的本地文件系统上,这自然不会与允许的域匹配。
Scrapy假设所有从start_requests发送的初始url都是允许的,因此不对这些url进行任何验证,但是所有后续的解析方法都会检查allowed_domains属性。
只要从spider类的顶部删除这一行,原始结构就可以正常工作了。

相关问题