我有一个简单明了的Scrapy
蜘蛛来爬行https://books.toscrape.com/
。
还没有实现解析功能,我想看看蜘蛛是否能爬取网站。
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ["tosrape.com"]
start_urls = ["https://books.toscrape.com/"]
rules = (
Rule(LinkExtractor(allow="catalogue/category")),
)
即使我能够通过Scrapy
shell(例如response.css("a::text").getall()
)与网站交互,但爬虫不会抓取网站并返回:
2023-03-02 14:31:05 [scrapy.core.engine] INFO: Spider opened
2023-03-02 14:31:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-03-02 14:31:05 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-03-02 14:31:06 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://books.toscrape.com/robots.txt> (referer: None)
2023-03-02 14:31:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/> (referer: None)
2023-03-02 14:31:07 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'books.toscrape.com': <GET https://books.toscrape.com/catalogue/category/books_1/index.html>
2023-03-02 14:31:07 [scrapy.core.engine] INFO: Closing spider (finished)
...
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
...
2023-03-02 14:31:07 [scrapy.core.engine] INFO: Spider closed (finished)
我哪里做错了?
2条答案
按热度按时间3df52oht1#
我应该关闭抵销中间件。
零碎文档Spider Middleware
dnph8jn42#
您需要更新
allowed_domains
。它应该是books.toscrape.com
而不是tosrape.com
。