使用Scrapy获取页面中的所有链接文本和href

q5iwbnjs 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(171)

class LinkSpider(scrapy.Spider):
    name = "link"
    def start_requests(self):
        urlBasang = "https://bloomberg.com"
        yield scrapy.Request(url = urlBasang, callback = self.parse)
    def parse(self, response):
        newCsv = open('data_information/link.csv', 'a')
        for j in response.xpath('//a'):

            title_to_save = j.xpath('/text()').extract_first()
            href_to_save= j.xpath('/@href').extract_first()

            print("test")

            print(title_to_save)
            print(href_to_save)

            newCsv.write(title_to_save+ "\n")
        newCsv.close()

这是我代码，但title_to_保存和href_to_save返回None
我想获取标记“a”及其href中的所有文本

scrapy

来源：https://stackoverflow.com/questions/58021794/get-all-link-text-and-href-in-a-page-using-scrapy

1条答案

按热度按时间

8yoxcaq71#

你要

title_to_save = j.xpath('./text()').get()
href_to_save= j.xpath('./@href').get()

注意路径前面的点（由于这个原因，我使用get而不是extract_first）。
在输出csv中，您可能已经意识到了，但是您可能应该将您想要写出的信息yield，然后使用-o data_information/link.csv选项运行spider，这比在parse方法中打开一个文件进行追加要灵活一些。

class LinkSpider(scrapy.Spider):
    name = "link"
    # No need for start_requests for as this is the default anyway
    start_urls = ["https://bloomberg.com"]  

    def parse(self, response):
        for j in response.xpath('//a'):

            title_to_save = j.xpath('./text()').get()
            href_to_save= j.xpath('./@href').get()

            print("test")
            print(title_to_save)
            print(href_to_save)

            yield {'title': title_to_save}

赞(0）回复(0）举报 2022-11-09

我来回答

使用Scrapy获取页面中的所有链接文本和href

1条答案

相关问题

热门标签

最新问答