我试图学习网页抓取/爬行,并试图应用下面的代码对Juypter笔记本电脑,但它没有显示任何输出,所以有谁可以帮助我,并指导我如何使用Juypter笔记本电脑的剪贴包.
密码:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BooksCrawlSpider(CrawlSpider):
name = 'books_crawl'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/catalogue/category/books/sequential-art_5/page-1.html']
le_book_details = LinkExtractor(restrict_css='h3 > a')
le_next = LinkExtractor(restrict_css='.next > a') # next_button
le_cats = LinkExtractor(restrict_css='.side_categories > ul > li > ul > li a') # Categories
rule_book_details = Rule(le_book_details, callback='parse_item', follow=False)
rule_next = Rule(le_next, follow=True)
rule_cats = Rule(le_cats, follow=True)
rules = (
rule_book_details,
rule_next,
rule_cats
)
def parse_item(self, response):
yield {
'Title': response.css('h1 ::text').get(),
'Category': response.xpath('//ul[@class="breadcrumb"]/li[last()-1]/a/text()').get(),
'Link': response.url
}
最终结果没有任何输出:-
1条答案
按热度按时间siv3szwd1#
要运行spider,您可以在新单元格中添加以下代码片段:
More details on the Scrapy docs
编辑:
从提取的项目创建 Dataframe 的解决方案是首先将输出导出到文件(例如. CSV),方法是将
settings
参数传递给CrawlerProcess:然后用Pandas打开它: