尖刻的剧作家：使用Scrapy剧作家执行爬行蜘蛛

mi7gmzs6 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(228)

是否可以使用Scrapy的剧作家集成来执行CrawlSpider？我正在尝试以下脚本来执行CrawlSpider，但它没有刮取任何内容。它也没有显示任何错误！

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class GumtreeCrawlSpider(CrawlSpider):
    name = 'gumtree_crawl'
    allowed_domains = ['www.gumtree.com']
    def start_requests(self):
        yield scrapy.Request(
            url='https://www.gumtree.com/property-for-sale/london/page',
            meta={"playwright": True}
        )
        return super().start_requests()

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='grid-col-12']/ul[1]/li/article/a"), callback='parse_item', follow=False),
    )

    async def parse_item(self, response):
        yield {
            'Title': response.xpath("//div[@class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
            'Price': response.xpath("//h3[@itemprop='price']/text()").get(),
            'Add Posted': response.xpath("//dl[@class='css-16xsajr elf7h8q4'][1]/dd/text()").get(),
            'Links': response.url
        }

scrapy

来源：https://stackoverflow.com/questions/71459599/scrapy-playwright-execute-crawlspider-using-scrapy-playwright

2条答案

按热度按时间

b91juud31#

从规则中提取的请求没有playwright=True meta键，如果浏览器需要呈现这些请求以获得有用的内容，这是一个问题。你可以通过使用Rule.process_request来解决这个问题，类似于：

def set_playwright_true(request, response):
    request.meta["playwright"] = True
    return request

class MyCrawlSpider(CrawlSpider):
    ...
    rules = (
        Rule(LinkExtractor(...), callback='parse_item', follow=False, process_request=set_playwright_true),
    )

备注后更新

1.请确保您的URL是正确的，我没有得到该特定的一个结果（删除/page？）.
1.带回你的start_requests方法，似乎第一页也需要使用浏览器下载

除非显式标记（如@classmethod，@staticmethod），Python示例方法将调用对象作为隐式第一个参数接收。约定是调用这个self（如def set_playwright_true(self, request, response)）。然而，如果你这样做，你将需要改变你创建规则的方式，或者：
Rule(..., process_request=self.set_playwright_true)
Rule(..., process_request="set_playwright_true")

从文档中：process_request是一个可调用项（或字符串，在这种情况下，将使用来自具有该名称的spider对象的方法）
我的原始示例在spider外部定义了处理函数，因此它不是一个示例方法。

赞(0）回复(0）举报 2022-11-09

g52tjvyc2#

正如elacuesta所建议的，我只添加将“parse_item”def从async更改为标准def。
def解析项（自身，响应）：
它也违背了我所读到的一切，但这让我度过了难关。

赞(0）回复(0）举报 2022-11-09