Scrapy不刮下一页

brqmpdu1  于 2023-05-07  发布在  其他
关注(0)|答案(1)|浏览(193)

新的蜘蛛和我的爬虫不会刮下一页。在第一页数据之后,我的抓取日志显示'DEBUG Crawled DEBUG:抓取(200)〈GET https://reedsy.com/robots.txt>(referer:None)'两次,然后下一行是[scrapy.dupefilters] DEBUG:已筛选重复请求:〈GET https://reedsy.com/>-不再显示重复项(请参阅DUPEFILTER_DEBUG以显示所有重复项)。
提前感谢您的帮助!

import scrapy

class PublisherSpider(scrapy.Spider):
    name = 'mycrawler'
    start_urls = ['https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=']
   
    def parse(self, response):
        for publishers in response.css('div.panel-body'):
            publisher = publishers.css('h3.text-heavy::text').get()
            url = publishers.css('a.text-blue::attr(href)').get()
            if publisher and url:
                yield {"Publisher": publisher.strip(), "url": url}
                
        next_page = response.css('a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

沿着显示的代码,我尝试了:

next_page = response.css('a').attrib['href']
yield response.follow(next_page, callback = self.parse, dont_filter = True)
next_page = response.css('a::attr(href)').extract()
next_page = response.css('a::attr(href)').extract_first()
4sup72z8

4sup72z81#

你的next_page css选择器不够具体。目前,它只是抓取它在页面上找到的第一个链接标签。使用xpath表达式,可以将页面底部实际的next page链接的rel属性作为目标。
例如:

import scrapy

class PublisherSpider(scrapy.Spider):
    name = 'mycrawler'
    start_urls = ['https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=']

    def parse(self, response):
        for publishers in response.css('div.panel-body'):
            publisher = publishers.css('h3.text-heavy::text').get()
            url = publishers.css('a.text-blue::attr(href)').get()
            if publisher and url:
                yield {"Publisher": publisher.strip(), "url": url}
        next_page = response.xpath('//a[@rel="next"]/@href').get()
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

输出

{'Publisher': 'Akashic Books', 'url': 'http://www.akashicbooks.com/submissions/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Chicago Review Press', 'url': 'https://www.chicagoreviewpress.com/information-for-authors--amp--agents-pages-100.php'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Atria Publishing Group', 'url': 'https://www.atriabooks.biz/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Yale University Press', 'url': 'https://yalebooks.yale.edu/about-us/editors#submissions'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Kensington Publishing', 'url': 'https://www.kensingtonbooks.com/pages/submissions/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Third World Press Foundation', 'url': 'https://thirdworldpressfoundation.org/submit-a-manuscript-2/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Dafina', 'url': 'https://www.kensingtonbooks.com/pages/submissions/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'University of Illinois Press', 'url': 'https://www.press.uillinois.edu/authors/proposal.html'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Arsenal Pulp Press', 'url': 'https://arsenalpulp.com/About-Arsenal-Pulp-Press/Submission-Guidelines'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'University of Georgia Press', 'url': 'https://ugapress.org/resources/frequently-asked-questions/'}
2023-05-03 22:13:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=> (refe
rer: https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=)
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Rosen Publishing', 'url': 'https://www.rosenpublishing.com/faqs'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Peepal Tree', 'url': 'https://peepaltreepress.submittable.com/submit'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'RedBone Press', 'url': 'https://www.redbonepress.com/pages/frontpage'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Just Us Books', 'url': 'https://justusbooks.com/pages/resource-center/submission-guidelines.html'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Good2Go Publishing', 'url': 'https://www.good2gopublishing.com/submissions'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Royalty Publishing House', 'url': 'https://www.royaltypublishinghouse.com/submissions/'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Black Classic Press', 'url': 'http://www.blackclassicbooks.com/manuscript-submission/'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Noemi Press', 'url': 'http://www.noemipress.org/contest/'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Wayne State University Press', 'url': 'https://www.wsupress.wayne.edu/authors'}

相关问题