我刮谷歌学者作者个人资料页。我面临着一个问题,当我试图刮每个作者的标题,每个作者有超过500个标题,他们显示使用加载更多按钮,我已经得到了链接loadmore分页。
问题是我想计算一个作者拥有的标题总数,但我没有得到正确的总值。当我试图只抓取2个作者时,它返回正确的值,但当我试图抓取一个页面中的所有作者时(一个页面中有10个作者),我得到了错误的总值。
我的代码如下。我的逻辑哪里错了?
def parse(self, response):
for author_sel in response.xpath('.//div[@class="gsc_1usr"]'): // loop to get all the author in a page
link = author_sel.xpath(".//h3[@class='gs_ai_name']/a/@href").extract_first()
url = response.urljoin(link)
yield scrapy.Request(url,callback=self.parse_url_to_crawl)
def parse_url_to_crawl(self,response):
url = response.url
yield scrapy.Request(url+'&cstart=0&pagesize=100',callback=self.parse_profile_content)
def parse_profile_content(self,response):
url = response.url
idx = url.find("user")
_id = url[idx+5:idx+17]
name = response.xpath("//div[@id='gsc_prf_in']/text()").extract()[0]
tmp = response.xpath('//tbody[@id="gsc_a_b"]/tr[@class="gsc_a_tr"]/td[@class="gsc_a_t"]/a/text()').extract() //it extracts the title
item = GooglescholarItem()
n = len(tmp)
titles=[]
if tmp:
offset = 0; d = 0
idx = url.find('cstart=')
idx += 7
while url[idx].isdigit():
offset = offset*10 + int(url[idx])
idx += 1
d += 1
self.n += len(tmp)
titles.append(self.n)
self.totaltitle = titles[-1]
logging.info('inside if URL is: %s',url[:idx-d] + str(offset+100) + '&pagesize=100')
yield scrapy.Request(url[:idx-d] + str(offset+100) + '&pagesize=100', self.parse_profile_content)
else:
item = GooglescholarItem()
item['name'] = name
item['totaltitle'] = self.totaltitle
self.n=0
self.totaltitle=0
yield item
这是结果,但我得到了错误的总标题值。克劳斯-罗伯特穆勒有总837标题和汤姆米切尔有264标题。日志请看所附的图像。我知道有一个问题,在我的逻辑
[
{"name": "Carl Edward Rasmussen", "totaltitle": 1684},
{"name": "Carlos Guestrin", "totaltitle": 365},
{"name": "Chris Williams", "totaltitle": 1072},
{"name": "Ruslan Salakhutdinov", "totaltitle": 208},
{"name": "Sepp Hochreiter", "totaltitle": 399},
{"name": "Tom Mitchell", "totaltitle": 282},
{"name": "Johannes Brandstetter", "totaltitle": 1821},
{"name": "Klaus-Robert Müller", "totaltitle": 549},
{"name": "Ajith Abraham", "totaltitle": 1259},
{"name": "Amit kumar", "totaltitle": 1127}
]
1条答案
按热度按时间qmb5sa221#
我觉得你把它弄得太复杂了。我推荐使用
request.meta
来保存你的offset
和计数的文章: