使用Scrapy从脚本标记中gtag函数中抓取数据

brjng4g3 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(102)

我正在抓取一个网站，它的script标签包含以下代码：

<script type="text/javascript">
        window.dataLayer = window.dataLayer || [];
          function gtag(){dataLayer.push(arguments);}
          gtag('js', new Date());

          gtag('set', 'content_group1', 'World');
          gtag('set', 'content_group2', 'AFP');
          gtag('config', 'UA-40396753-1', {
            'custom_map': {"dimension6":"Id","dimension1":"Category","dimension3":"Author","dimension5":"PublishedDate"}
          });              
          gtag('event', 'custom', {"Id":"news\/1696246","Category":"World","Categories":"World","Author":"AFP-119","Authors":"AFP","PublishedDate":"2022-06-23 07:08:42"});
</script>

我需要刮取值"PublishedDate":"2022-06-23 07:08:42"我怎么能用scrapy做到这一点这是我尝试过的：

time = response.xpath('//script[@type="text/javascript"]/text()').re(r"gtag\('event', 'custom', ({.*})\);")
json_data = json.loads(time, strict=False)

print('dawn time::', json_data['PublishedDate'])

但是，我没有得到任何结果

scrapy

来源：https://stackoverflow.com/questions/72759756/scrape-a-data-from-a-gtag-function-in-a-script-tag-using-scrapy

2条答案

按热度按时间

oknwwptz1#

我简单地解决了这个问题：

time = response.xpath('//meta[@property="article:published_time"]/@content')[0].extract()

因为我需要的字段有一个相关的 meta标记

赞(0）回复(0）举报 2022-11-09

azpvetkf2#

使用regex从选择器中获取该数据，并使用json.loads()。

import scrapy
import json

class ExampleSpider(scrapy.Spider):
    name = "example"

    start_urls = ['file:///PathToFile/temp.html']

    def parse(self, response):
        all_data = response.xpath('//script[@type="text/javascript"]/text()').re(r"gtag\('event', 'custom', ({.*})\);")
        for data in all_data:
            data = json.loads(data)
            yield {'PublishedDate': data['PublishedDate']}

输出量：

[scrapy.core.scraper] DEBUG: Scraped from <200 file:///PathToFile/temp.html>
{'PublishedDate': '2022-06-23 07:08:42'}

赞(0）回复(0）举报 2022-11-09

我来回答

使用Scrapy从脚本标记中gtag函数中抓取数据

2条答案

相关问题

热门标签

最新问答