scrapy 在脚本元素的文本中查找字符串

uqdfh47h  于 2023-10-20  发布在  其他
关注(0)|答案(1)|浏览(131)

我试图抓取一个页面,我想等待,直到在script元素中检测到字符串,然后返回页面的HTML。
这是我的MRE scraper:

from scrapy import Request, Spider
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod

class FlashscoreSpider(Spider):
    name = "flashscore"
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
    }

    def start_requests(self):
        yield Request(
            url="https://www.flashscore.com/match/WKM03Vff/#/match-summary/match-summary",
            meta=dict(
                dont_redirect=True,
                playwright=True,
                playwright_page_methods=[
                    PageMethod(
                        method="wait_for_selector",
                        selector="//script[contains(text(), 'WKM03Vff')]",
                        timeout=5000,
                    ),
                ],
            ),
            callback=self.parse,
        )

    def parse(self, response):
        print("I've loaded the page ready to parse!!!")

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(FlashscoreSpider)
    process.start()

这将导致以下错误:

playwright._impl._api_types.TimeoutError: Timeout 5000ms exceeded.

我的理解是,这是因为在script中有多个文本节点,我只选择了第一个带有前缀的节点。由于我要查找的字符串在后面的节点中,因此我得到TimeoutError错误。
这个answer提供了一个整洁的解决方案,但scrappy不支持x2.0,所以当我用途:

"string-join(//script/text()[normalize-space()], ' ')"

我得到以下错误:

playwright._impl._api_types.Error: Unexpected token "string-join(" while parsing selector "string-join(//script/text()[normalize-space()], ' ')"

在对答案的评论中给出了一个替代方案,但我担心文本节点的数量会发生变化。
从一些相当密集的谷歌搜索,我不认为有一个强大的解决方案。但是,是否有一个CSS等价物?我试过:

"script:has-text('WKM03Vff')"

但是,这又会导致Timeout异常。

mcdcgff0

mcdcgff01#

正如我在评论中提到的,脚本标记通常不需要等待任何时间,因为它们不需要呈现。
您应该能够直接从parse方法中访问它们的内容。
举例来说:

from scrapy import Request, Spider
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod

class FlashscoreSpider(Spider):
    name = "flashscore"
    custom_settings = {
        "ROBOTSTXT_OBEY": False,
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
    }

    def start_requests(self):
        yield Request(
            url="https://www.flashscore.com/match/WKM03Vff/#/match-summary/match-summary",
            meta=dict(
                dont_redirect=True,
                playwright=True,
                playwright_include_page=True),
            callback=self.parse,
        )

    def parse(self, response):
        print(response.xpath("//script[contains(text(), 'WKM03Vff')]"))
        print(response.xpath("//script[contains(text(), 'WKM03Vff')]/text()").get())
        print("I've loaded the page ready to parse!!!")

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(FlashscoreSpider)
    process.start()

部分输出

2023-09-13 00:07:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET 
https://cdn.cookielaw.org/scripttemplates/202210.1.0/assets/otCommonStyles.css> 
(resource type: fetch, referrer: https://www.flashscore.com/)
[<Selector query="//script[contains(text(), 'WKM03Vff')]" 
data='<script>\n\t\t\twindow.environment = {"ev...'>]

                        window.environment = {"event_id_c":"WKM03Vff",
"eventStageTranslations":{"1":"&nbsp;","45":"To finish","42":"Awaiting 
updates","2":"Live","17": "Set 1","18":"Set 2","19":"Set 3","20":"Set 
4","21":"Set 5","47":"Set 1 - Tiebreak","48":"Set 2 - Tiebreak","49":"Set 3 - 
Tiebreak","50":"Set 4 - Tiebreak","51":"Set 5 - Tiebreak","46":"Break 
Time","3":"Finished",....p10:100","port":443,"sslEnabled":true,"namespace":"\/f
s\/fs3_","projectId":2,"enabled":false},"project_id":2};

I've loaded the page ready to parse!!!
2023-09-13 00:07:02 [scrapy.core.engine] INFO: Closing spider (finished)

相关问题