使用Playwright进行JavaScript渲染时未执行Scrapy回调

von4xj4u  于 2023-04-21  发布在  Java
关注(0)|答案(1)|浏览(335)

我使用Scrapy和Playwright插件来抓取一个依赖JavaScript渲染的网站。我的蜘蛛包含两个异步函数parse_categories和parse_product_page。
parse_categories函数检查URL中的类别,并再次向parse_categories回调发送请求,直到找到产品页面,这应该是在没有找到类别时。如果没有找到类别,它应该向parse_product_page回调发送请求。
然而,当它到达parse_categories中的else块时,似乎从未发出parse_product_page的请求。我已经确认代码进入了else块,但parse_product_page函数中的print语句从未到达。
以下是我的reprex:

import scrapy
from scrapy_playwright.page import PageMethod

class Spider():
    name = "quotes"
    allowed_domains = ['quotes.toscrape.com']
  
    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/js/', callback=self.parse_urls, 
              meta=dict(
                   playwright = True, 
                   playwright_include_page = True,
                   playwright_page_methods = [
                         PageMethod('wait_for_selector','body > div > nav > ul > li > a')
                        ],
                   ))
    

    async def parse_urls(self, response):
        page = response.meta['playwright_page']
        await page.close()
        
        next_page_url = response.xpath('//li[@class="next"]/a/@href').get()

        if next_page_url:
            print("Inside if block")
            url = 'https://quotes.toscrape.com' + next_page_url
            yield scrapy.Request(url=url,callback=self.parse_urls,
                meta=dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_methods = [
                        PageMethod('wait_for_selector','body > div > div.quote')]
                        ))
        else:
            print("Next page link not found")
            yield scrapy.Request(url=response.request.url, callback=self.parse, 
                    meta=dict(
                        playwright = True,
                        playwright_include_page = True,
                        playwright_page_methods = [
                            PageMethod('wait_for_selector','body > div > div.quote')]
                        ))

    async def parse(self,response):
        page = response.meta['playwright_page']
        await page.close()
        print("Function has been called, because next page link not found")

这是reprex的日志:

Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Inside if block
Next page link not found
2023-04-11 09:47:04 [root] WARNING: spider quotes finished crawling
okxuctiv

okxuctiv1#

此问题已通过在else块中的yield scrapy.Request中添加参数dont_filter = True得到修复。

else:
    yield scrapy.Request(url=response.request.url,
          callback=self.parse, 
          dont_filter=True,
          meta=dict(
               playwright = True,
               playwright_include_page = True,
               playwright_page_methods = [
               PageMethod('wait_for_selector','body > div > div.quote')]
            ))

相关问题