我爬www.example.com网站。对于一个给定的网址,有大约17页。我写的脚本是无法从第1页和第2页获取数据。蜘蛛在给出前2页的结果后自行关闭。请让我知道,我如何获取其余15页的数据。cdw.com website. For a given URL, there are around 17 pages. The script that I have written is able t fetch data from Page 1 and Page 2. Spider closes on its own after giving result of first 2 pages. Please let me know, how can I fetch data for remaining 15 pages.
短暂性脑缺血
import scrapy
from cdwfinal.items import CdwfinalItem
from scrapy.selector import Selector
import datetime
import pandas as pd
import time
class CdwSpider(scrapy.Spider):
name = 'cdw'
allowed_domains = ['www.cdw.com']
start_urls = ['http://www.cdw.com/']
base_url = 'http://www.cdw.com'
def start_requests(self):
yield scrapy.Request(url = 'https://www.cdw.com/search/?key=axiom' , callback=self.parse )
def parse(self, response):
item=[]
hxs = Selector(response)
item = CdwfinalItem()
abc = hxs.xpath('//*[@id="main"]//*[@class="grid-row"]')
for i in range(len(abc)):
try:
item['mpn'] = hxs.xpath("//div[contains(@class,'search-results')]/div[contains(@class,'search-result')]["+ str(i+1) +"]//*[@class='mfg-code']/text()").extract()
except:
item['mpn'] = 'NA'
try:
item['part_no'] = hxs.xpath("//div[contains(@class,'search-results')]/div[contains(@class,'search-result')]["+ str(i+1) +"]//*[@class='cdw-code']/text()").extract()
except:
item['part_no'] = 'NA'
yield item
next_page = hxs.xpath('//*[@id="main"]//*[@class="no-hover" and @aria-label="Next Page"]').extract()
if next_page:
new_page_href = hxs.xpath('//*[@id="main"]//*[@class="no-hover" and @aria-label="Next Page"]/@href').extract_first()
new_page_url = response.urljoin(new_page_href)
yield scrapy.Request(new_page_url, callback=self.parse, meta={"searchword": '123'})
- 日志:**
2023 - 02 - 11 15:39:55 [零碎_用户_代理.中间件]调试:指定的用户代理Mozilla/5.0(Windows NT 10.0;Win64; x64)苹果网络工具包/537.36(KHTML,像壁虎)Chrome浏览器/53.0.2785.116 Safari浏览器/537.36 2023 - 02 - 11 15:39:55 [scrappy. core. engine]调试:已爬网(200)〈GET https://www.cdw.com/search/?key=axiom&pcurrent=3>(引用:https://www.cdw.com/search/?key=axiom&pcurrent=2)['缓存'] 2023 - 02 - 11 15:39:55 [碎片.核心.引擎]信息:关闭蜘蛛(已完成)2023 - 02 - 11 15:39:55 [scrapy. extensions. feedexport]信息:存储的csv源(48项):Test5.csv 2023 - 02 - 11 15:39:55 [碎片统计收集器]信息:转储Scrapy统计信息:{"下载程序/请求字节":2178,"下载程序/请求计数":3,"下载器/请求方法计数/GET":3,"下载器/响应_字节":68059,"下载程序/响应计数":3,"下载器/响应状态计数/200":3,"已用时间秒数":1.30903,"源导出/成功计数/文件源存储":1,"完成原因":"已完成"、"完成时间":日期时间(2023年2月11日10日9月55日327740),"httpcache/hit":3,"http压缩/响应字节":384267,"http压缩/响应计数":3,"已刮除物品计数":48,'日志计数/调试':62,'日志计数/信息':11,"日志计数/警告":45,"请求深度最大值":2,"响应接收计数":3,"调度程序/出队":3,"调度程序/出队/内存":3,"调度程序/入队":3,"调度程序/入队/内存":3,"开始时间":日期时间。日期时间(2023年2月11日10日9月54日18710年)}
1条答案
按热度按时间x6492ojm1#
您的
next_page
选择器无法提取下一页的信息。通常,您的选择器比需要的更复杂,例如,您应该在for循环中使用相对xpath
表达式。下面是一个示例,除了使用更简单的选择器之外,它复制了与爬行器相同的行为,并成功地从所有页面提取结果。
部分输出: