我试图抓取一个页面,我想等待,直到在script
元素中检测到字符串,然后返回页面的HTML。
这是我的MRE scraper:
from scrapy import Request, Spider
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod
class FlashscoreSpider(Spider):
name = "flashscore"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
}
def start_requests(self):
yield Request(
url="https://www.flashscore.com/match/WKM03Vff/#/match-summary/match-summary",
meta=dict(
dont_redirect=True,
playwright=True,
playwright_page_methods=[
PageMethod(
method="wait_for_selector",
selector="//script[contains(text(), 'WKM03Vff')]",
timeout=5000,
),
],
),
callback=self.parse,
)
def parse(self, response):
print("I've loaded the page ready to parse!!!")
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(FlashscoreSpider)
process.start()
这将导致以下错误:
playwright._impl._api_types.TimeoutError: Timeout 5000ms exceeded.
我的理解是,这是因为在script
中有多个文本节点,我只选择了第一个带有前缀的节点。由于我要查找的字符串在后面的节点中,因此我得到TimeoutError
错误。
这个answer提供了一个整洁的解决方案,但scrappy不支持x2.0,所以当我用途:
"string-join(//script/text()[normalize-space()], ' ')"
我得到以下错误:
playwright._impl._api_types.Error: Unexpected token "string-join(" while parsing selector "string-join(//script/text()[normalize-space()], ' ')"
在对答案的评论中给出了一个替代方案,但我担心文本节点的数量会发生变化。
从一些相当密集的谷歌搜索,我不认为有一个强大的解决方案。但是,是否有一个CSS等价物?我试过:
"script:has-text('WKM03Vff')"
但是,这又会导致Timeout
异常。
1条答案
按热度按时间mcdcgff01#
正如我在评论中提到的,脚本标记通常不需要等待任何时间,因为它们不需要呈现。
您应该能够直接从parse方法中访问它们的内容。
举例来说:
部分输出