有695 record in page
,但他们给了954 record
,所以有重复的值,所以我如何删除重复的值,所以他们只给了我695 record
,这些是页面链接http://www.palatakd.ru/list/
import scrapy
from scrapy.http import Request
class PushpaSpider(scrapy.Spider):
name = 'test'
start_urls = ['http://www.palatakd.ru/list/']
page_number=1
def parse(self, response):
details=response.xpath("//p[@class='detail_block']")
for detail in details:
registration=detail.xpath(".//span[contains(.,'Регистрационный номер адвоката в реестре')]//following-sibling::span//text()").get()
address=detail.xpath(".//span[contains(.,'Адрес')]//following-sibling::span//text()").get()
phone=detail.xpath(".//span[contains(.,'Телефон')]//following-sibling::span//text()").get()
fax=detail.xpath(".//span[contains(.,'Факс')]//following-sibling::span//text()").get()
yield{
'Телефон':phone,
'Факс':fax,
'Регистрационный номер адвоката в реестре':registration,
'Адрес':address
}
next_page = 'http://www.palatakd.ru/list/?PAGEN_1=' + str(PushpaSpider.page_number)
if PushpaSpider.page_number<=3:
PushpaSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)
1条答案
按热度按时间mwkjh3gx1#
您可以启用项目管道以筛选出重复项。
例如:
在settings.py文件中打开(取消注解)
ITEM_PIPELINES
在pipelines.py文件中筛选出重复的项目。
不需要对蜘蛛进行任何调整。