scrapy 如何抓取所有可见文本，但排除超链接上的文本？

lndjwyie 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(155)

我对网站的所有可见文本都感兴趣。
唯一的一点是：我想排除超链接文本。这样我就可以排除菜单栏中的文本，因为它们通常包含链接。在图片中，您可以看到菜单栏中的所有内容都可以被排除（例如“Wohnen & Bauen”）。
x1c 0d1x https://www.gross-gerau.de/B%C3%BCrger-Service/Ver-und-Entsorgung/Abfallinformationen/index.php?object=tx,2289.12976.1&NavID=3411.60&La=1
总之我的蜘蛛看起来是这样的：

class MySpider(CrawlSpider):
    name = 'my_spider'

    start_urls = ['https://www.gross-gerau.de/B%C3%BCrger-Service/Wohnen-Bauen/']

    rules = (
            Rule(LinkExtractor(allow="B%C3%BCrger-Service", deny=deny_list_sm),
                 callback='parse', follow=True),
        )

    def parse(self, response):

        item = {}
        item['scrape_date'] = int(time.time())
        item['response_url'] = response.url

        # old approach 
        # item["text"] = " ".join([x.strip() for x in response.xpath("//text()").getall()]).strip()
        # exclude at least javascript code snippets and stuff 
        item["text"] = " ".join([x.strip() for x in response.xpath("//*[name(.)!='head' and name(.)!='script']/text()").getall()]).strip()

        yield item

这个解决方案应该也适用于其他网站。有人知道如何解决这个挑战吗？欢迎提出任何想法！

scrapy

来源：https://stackoverflow.com/questions/71967820/how-to-scrape-all-visible-text-but-exlude-text-written-on-hyperlinks

1条答案

按热度按时间

5f0d552i1#

您可以将 predicate 扩展为

[name()!='head' and name()!='script' and name()!='a']

赞(0）回复(0）举报 2022-11-09

我来回答

scrapy 如何抓取所有可见文本，但排除超链接上的文本？

1条答案

相关问题

热门标签

最新问答