从start_urls中提取页面,并使用Scrapy从每个提取的页面中查找pdf链接

jv2fixgn  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(186)

我试图从start_url中提取一些字段,并希望添加从每个已获得的URL中获得的PDF链接字段。我尝试了Scrapy,但没有幸运地添加PDF字段。以下是我的代码,

import scrapy

class MybookSpider(scrapy.Spider):
    name = 'mybooks'
    allowed_domains = ['gln.kemdikbud.go.id']
    start_urls = ['https://gln.kemdikbud.go.id/glnsite/category/modul-gls/page/1/']

    def parse(self, response):
        #pass
        # gathering all links
        book_urls = response.xpath("//div[@class='td-module-thumb']//a/@href").getall()
        total_url = len(book_urls)
        i = 0
        for a in range(total_url):
            title = response.xpath("//h3[@class='entry-title td-module-title']//a/text()")[i].extract()
            url_source = response.xpath("//div[@class='td-module-thumb']//a/@href")[i].get()
            thumbnail = response.xpath('//*[@class="td-block-span4"]//*[has-class("entry-thumb")]//@src')[i].extract()
            pdf = scrapy.Request(book_urls[i], self.find_details)
            yield {
                'Book Title': title,
                'URL': url_source,
                'Mini IMG': thumbnail,
                'PDF Link': pdf
            }

            i+=1 

    def find_details(self, response):
        # find PDF link
        pdf = response.xpath("//div[@class='td-post-content']//a/@href").get()
        return pdf

当我将PDF导出为CSV时,如何正确添加PDF链接字段?

fbcarpbf

fbcarpbf1#

请输入您的电子邮件地址:
这意味着pdf变量是一个请求。
Scrapy是异步的,所以你很难从一个函数中得到一个返回值。只需发出一个请求,然后用cb_kwargs将细节传递给回调函数。

import scrapy

class MybookSpider(scrapy.Spider):
    name = 'mybooks'
    allowed_domains = ['gln.kemdikbud.go.id']
    start_urls = ['https://gln.kemdikbud.go.id/glnsite/category/modul-gls/page/1/']

    def parse(self, response):
        # gathering all links
        book_urls = response.xpath("//div[@class='td-module-thumb']//a/@href").getall()
        total_url = len(book_urls)

        for i in range(total_url):
            item = dict()
            item['title'] = response.xpath("//h3[@class='entry-title td-module-title']//a/text()")[i].extract()
            item['url_source'] = response.xpath("//div[@class='td-module-thumb']//a/@href")[i].get()
            item['thumbnail'] = response.xpath('//*[@class="td-block-span4"]//*[has-class("entry-thumb")]//@src')[i].extract()
            yield scrapy.Request(url=book_urls[i], callback=self.find_details, cb_kwargs={'item': item})

    def find_details(self, response, item):
        # find PDF link
        item['pdf'] = response.xpath("//div[@class='td-post-content']//a/@href").get()
        yield item

相关问题