我正在尝试使用一个scrapy爬行蜘蛛来跟随一个网站上的链接,它可以无限滚动,从它跟随的网址中抓取信息,然后继续跟随链接并抓取信息。我已经找到了一些关于scrapy的建议,但对于爬行蜘蛛来说并不多见。下面是我目前所尝试的:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
class ItsySpider(CrawlSpider):
name = 'test'
allowed_domains = ['citizen.digital']
start_urls = ['https://www.citizen.digital/search?query=the']
rules = (
Rule(follow="True"),
)
def parse(self, response):
base = 'http://cms.citizen.digital/api/v2/search?page={}'
data = response.json
current_page = data["current_page"]
for page in range(2, 10):
next_page_url=base.format(current_page+page)
yield scrapy.Request(next_page_url, callback=self.parse_next)
def parse_next(self, response):
yield{
'url': response.url,
'date': response.xpath('//script[@type="application/ld+json"]/text()').re(r'(?i)(?<=datepublished": ")..........'),
}
正如你所看到的,我想在无限滚动网站上加载10个页面,并跟踪这些页面上的链接,然后我想从它所跟踪的url中提取url和日期,然后继续跟踪链接并提取信息。
我没有使用json的经验,所以我想知道我是否犯了一个错误。下面是一个在无限滚动网站上加载第二个页面的示例响应:
{
"data": [
{
"id": 186903,
"slug": "there-are-plans-to-harm-me-but-i-will-not-be-intimidated-a-defiant-nyoro-says-275851",
"thumbnail": "https:\/\/images.citizen.digital\/wp-content\/uploads\/2019\/09\/ndindi-nyoro-main-e1568106330665.jpg",
"description": " ",
"type": "news",
"title": "\u2018There are plans to harm me but I will not be intimidated,\u2019 a defiant Nyoro says",
"date": "12.05pm, September 10, 2019(EAT)",
"menu": {
"id": 14,
"slug": "news"
},
"author": "Wangui Ngechu"
},
{
"id": 106999,
"slug": "mwalala-lashes-out-at-intimidated-referees-after-leopards-defeat-243224",
"thumbnail": null,
"description": " ",
"type": "news",
"title": "Mwalala lashes out at \u2018intimidated referees\u2019 after Leopards defeat",
"date": "12.20pm, April 29, 2019(EAT)",
"menu": {
"id": 7,
"slug": "sports"
},
"author": "Geoffrey Mwamburi"
},
{
"id": 271435,
"slug": "why-men-are-intimidated-by-successful-women-133180",
"thumbnail": "http:\/\/images.citizen.digital\/wp-content\/uploads\/2018\/08\/Men.jpg",
"description": " ",
"type": "news",
"title": "Why men are intimidated by successful women",
"date": "05.11pm, August 29, 2018(EAT)",
"menu": {
"id": 4,
"slug": "entertainment"
},
"author": "Sheila Jerotich"
},
{
"id": 271671,
"slug": "besides-my-wife-these-are-the-only-people-who-can-intimidate-me-duale-132744",
"thumbnail": null,
"description": " ",
"type": "news",
"title": "Besides my wife, these are the only people who can intimidate me \u2013 Duale",
"date": "05.13pm, August 02, 2018(EAT)",
"menu": {
"id": 4,
"slug": "entertainment"
},
"author": "eDaily Reporter"
},
{
"id": 209728,
"slug": "nys-boss-richard-ndubai-will-intimidate-witnesses-if-freed-dpp-203602",
"thumbnail": "https:\/\/images.citizen.digital\/wp-content\/uploads\/2018\/06\/ndubai.png",
"description": " ",
"type": "news",
"title": "NYS boss Richard Ndubai will intimidate witnesses if freed: DPP",
"date": "06.15pm, June 11, 2018(EAT)",
"menu": {
"id": 14,
"slug": "news"
},
"author": "Dzuya Walter"
}
],
"meta": {
"pagination": {
"total": 15,
"count": 5,
"per_page": 5,
"current_page": 2,
"total_pages": 3,
"links": {
"previous": "http:\/\/cms.citizen.digital\/api\/v2\/search?page=1",
"next": "http:\/\/cms.citizen.digital\/api\/v2\/search?page=3"
}
}
}
}
当我使用scrapy crawl test -O test.csv
运行它时,它返回一个空的csv文件。
1条答案
按热度按时间af7jpaap1#
首先在html页面中抓取api密钥和api基本url(可选地,您也可以直接键入它),然后将api密钥添加到标头中并开始抓取api。