使用scrapy删除重复值

vd8tlhqk 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(193)

有695 record in page，但他们给了954 record，所以有重复的值，所以我如何删除重复的值，所以他们只给了我695 record，这些是页面链接http://www.palatakd.ru/list/

import scrapy
from scrapy.http import Request

class PushpaSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://www.palatakd.ru/list/']
    page_number=1

    def parse(self, response):
        details=response.xpath("//p[@class='detail_block']")
        for detail in details:
            registration=detail.xpath(".//span[contains(.,'Регистрационный номер адвоката в реестре')]//following-sibling::span//text()").get()
            address=detail.xpath(".//span[contains(.,'Адрес')]//following-sibling::span//text()").get()
            phone=detail.xpath(".//span[contains(.,'Телефон')]//following-sibling::span//text()").get()
            fax=detail.xpath(".//span[contains(.,'Факс')]//following-sibling::span//text()").get()
            yield{
                'Телефон':phone,
                'Факс':fax,
                'Регистрационный номер адвоката в реестре':registration,
                'Адрес':address

            }
            next_page = 'http://www.palatakd.ru/list/?PAGEN_1=' + str(PushpaSpider.page_number)

            if PushpaSpider.page_number<=3:
                PushpaSpider.page_number += 1
                yield response.follow(next_page, callback = self.parse)

scrapy

来源：https://stackoverflow.com/questions/73098677/remove-duplicate-value-using-scrapy

1条答案

按热度按时间

mwkjh3gx1#

您可以启用项目管道以筛选出重复项。
例如：
在settings.py文件中打开（取消注解）ITEM_PIPELINES

ITEM_PIPELINES = {
   'project.pipelines.ProjectPipeline': 300,
}

在pipelines.py文件中筛选出重复的项目。

from scrapy.exceptions import DropItem

class ProjectPipeline:
    itemlist = []

    def process_item(self, item, spider):
        if item in self.itemlist:
            raise DropItem
        self.itemlist.append(item)
        return item

不需要对蜘蛛进行任何调整。

赞(0）回复(0）举报 2022-11-09

我来回答

使用scrapy删除重复值

1条答案

相关问题

热门标签

最新问答