我尝试使用scrapy-playwright
从动态加载的javascript网站中提取一些数据,但我停留在一开始。
从我在www.example.com文件中遇到的麻烦settings.py如下:
剧作家
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
#TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
#ASYNCIO_EVENT_LOOP = 'uvloop.Loop'
当我注入下面的杂剧作家汉德勒:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
然后我得到:
scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The installed reactor
(twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)
当我注入TWISTED_REACTOR时”
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
然后我得到:
raise TypeError(
TypeError: SelectorEventLoop required, instead got: <ProactorEventLoop running=False closed=False debug=False>
毕竟,当我注入ASYNCIO_EVENT_LOOP时
然后我得到:
ModuleNotFoundError: No module named 'uvloop'
最后,无法安装“uvloop”
pip install uvloop
剧本
import scrapy
from scrapy_playwright.page import PageCoroutine
class ProductSpider(scrapy.Spider):
name = 'product'
def start_requests(self):
yield scrapy.Request(
'https://shoppable-campaign-demo.netlify.app/#/',
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_coroutines': [
PageCoroutine("wait_for_selector", "div#productListing"),
]
}
)
async def parse(self, response):
pass
# parses content
2条答案
按热度按时间f3temu5u1#
scrapy_playwright
的开发人员建议将DOWNLOAD_HANDLERS
和TWISTER_REACTOR
示例化到脚本中。here提供了类似的注解
下面是一个工作脚本,它实现了这个功能:
我们会得到以下输出:
{“产品”:“牛津乐福鞋”}
ckocjqey2#
如果您使用的是Windows,那么您的问题是Playwright不支持Windows。请在此处查看https://github.com/scrapy-plugins/scrapy-playwright/issues/154