scrapy输出丢失的行在从页

gab6jxml 于 2023-06-23 发布在其他

关注(0)|答案(1)|浏览(121)

页面有10个报价，我把他们放在一个列表中，它显示所有10个。
但是当我运行代码来抓取它时，输出中缺少一个引号，所以只有9行数据。
（注意）我注意到缺少的是同一（作者）的引用不确定是否与此有关。
正在抓取的页面：https://quotes.toscrape.com/page/4
其他页面也是如此
我有2个功能，一个刮网址和一些关于报价的基本信息，然后按照该网址刮数据的作者，并创建一个dict那里。
代码：

def parse(self, response):
    qs = response.css('.quote')
    for q in qs:
        n = {}
        page_url = q.css('span a').attrib['href']
        full_page_url = 'https://quotes.toscrape.com' + page_url

        # tags
        t = []
        tags = q.css('.tag')
        for tag in tags:
            t.append(tag.css('::text').get())

        # items
        n['quote'] = q.css('.text ::text').get(),
        n['tag'] = t,
        n['author'] = q.css('span .author ::text').get(),
        yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})


def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    yield {
        'text': item['quote'],
        'author': item['author'],
        'tags': item['tag'],
        'date': q.css('p .author-born-date ::text').get(),
        'location':  q.css('p .author-born-location ::text').get(),
    }

我也试过使用项目（刮擦领域）同样的事情
我试着调试和打印第一个函数的数据，丢失的行显示在那里，但它没有被发送到第二个函数。
所以我尝试了不同的方法来发送第一个信息和第二个信息。cb_kwargs：yield response.follow（full_page_url，callback=self.parse_page，cb_kwargs='item'：n}）

scrapy

来源：https://stackoverflow.com/questions/76457021/scrapy-output-missing-row-in-from-the-page

1条答案

按热度按时间

v9tzhpje1#

Scrapy有一个内置的重复过滤器，它会自动忽略重复的URL，所以当你有来自同一作者的两个引用时，这两个引用都会针对同一个URL的作者详细信息，这意味着当它到达第二次出现的URL时，它会忽略该请求，并且该项目永远不会被提交给输出提要处理器。
您可以通过在请求中将dont_filter参数设置为True来修复此问题。
例如：

def parse(self, response):
    for q in response.css('.quote'):
        n = {}
        n["tags"] = q.css('.tag::text').getall()
        n['quote'] = q.css('.text ::text').get().strip()
        n['author'] = q.css('span .author ::text').get().strip()
        page_url = q.css('span a').attrib['href']
        yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)


def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    item["date"] = q.css('p .author-born-date ::text').get()
    item["location"] = q.css('p .author-born-location ::text').get()
    yield item

赞(0）回复(0）举报 2023-06-23

我来回答

scrapy输出丢失的行在从页

1条答案

相关问题

热门标签

最新问答