python 如何在Juypter Notebook上使用废 Package

mwg9r5ms  于 2023-02-18  发布在  Python
关注(0)|答案(1)|浏览(103)

我试图学习网页抓取/爬行,并试图应用下面的代码对Juypter笔记本电脑,但它没有显示任何输出,所以有谁可以帮助我,并指导我如何使用Juypter笔记本电脑的剪贴包.

密码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BooksCrawlSpider(CrawlSpider):
    name = 'books_crawl'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com/catalogue/category/books/sequential-art_5/page-1.html']

    le_book_details = LinkExtractor(restrict_css='h3 > a')
    le_next = LinkExtractor(restrict_css='.next > a')  # next_button
    le_cats = LinkExtractor(restrict_css='.side_categories > ul > li > ul > li a')  # Categories

    rule_book_details = Rule(le_book_details, callback='parse_item', follow=False)
    rule_next = Rule(le_next, follow=True)
    rule_cats = Rule(le_cats, follow=True)

    rules = (
        rule_book_details,
        rule_next,
        rule_cats
    )

    def parse_item(self, response):
        yield {
            'Title': response.css('h1 ::text').get(),
            'Category': response.xpath('//ul[@class="breadcrumb"]/li[last()-1]/a/text()').get(),
            'Link': response.url
        }

最终结果没有任何输出:-

siv3szwd

siv3szwd1#

要运行spider,您可以在新单元格中添加以下代码片段:

from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(BooksCrawlSpider)
process.start()

More details on the Scrapy docs
编辑:
从提取的项目创建 Dataframe 的解决方案是首先将输出导出到文件(例如. CSV),方法是将settings参数传递给CrawlerProcess:

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.csv": {"format": "csv"},
    },
})

然后用Pandas打开它:

df = pd.read_csv("items.csv")

相关问题