新的蜘蛛和我的爬虫不会刮下一页。在第一页数据之后,我的抓取日志显示'DEBUG Crawled DEBUG:抓取(200)〈GET https://reedsy.com/robots.txt>(referer:None)'两次,然后下一行是[scrapy.dupefilters] DEBUG:已筛选重复请求:〈GET https://reedsy.com/>-不再显示重复项(请参阅DUPEFILTER_DEBUG以显示所有重复项)。
提前感谢您的帮助!
import scrapy
class PublisherSpider(scrapy.Spider):
name = 'mycrawler'
start_urls = ['https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=']
def parse(self, response):
for publishers in response.css('div.panel-body'):
publisher = publishers.css('h3.text-heavy::text').get()
url = publishers.css('a.text-blue::attr(href)').get()
if publisher and url:
yield {"Publisher": publisher.strip(), "url": url}
next_page = response.css('a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback = self.parse)
沿着显示的代码,我尝试了:
next_page = response.css('a').attrib['href']
yield response.follow(next_page, callback = self.parse, dont_filter = True)
next_page = response.css('a::attr(href)').extract()
next_page = response.css('a::attr(href)').extract_first()
1条答案
按热度按时间4sup72z81#
你的next_page css选择器不够具体。目前,它只是抓取它在页面上找到的第一个链接标签。使用xpath表达式,可以将页面底部实际的next page链接的
rel
属性作为目标。例如:
输出