出于学习的目的,我一直在尝试递归地爬取和抓取https://triniate.com/images/
上的所有URL,但Scrapy似乎只想爬取和抓取TXT、HTML和PHP URL。
这是我的蜘蛛代码
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem
class HelloSpider(CrawlSpider):
#Identifier when executing scrapy from CLI
name = 'hello'
#Domains that allow spiders to explore
allowed_domains = ["triniate.com"]
#Starting point(Start exploration)URL
start_urls = ["https://triniate.com/images/"]
#Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
#When you download a page that matches the Rule, the function specified in callback will be called.
#If follow is set to True, the search will be performed recursively.
rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
def parse_pageinfo(self, response):
item = PageInfoItem()
item['URL'] = response.url
#Specify which part of the page to scrape
#In addition to specifying in xPath format, it is also possible to specify in CSS format
item['title'] = "idc"
return item
items.py 内容
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item, Field
class PageInfoItem(Item):
URL = Field()
title = Field()
pass
并且控制台输出为
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-21 22:30:50 [scrapy.extensions.feedexport] INFO: Stored json feed (175 items) in: haxx.json
2022-04-21 22:30:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 59541,
'downloader/request_count': 176,
'downloader/request_method_count/GET': 176,
'downloader/response_bytes': 227394,
'downloader/response_count': 176,
'downloader/response_status_count/200': 176,
'dupefilter/filtered': 875,
'elapsed_time_seconds': 8.711563,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 22, 3, 30, 50, 142416),
'httpcompression/response_bytes': 402654,
'httpcompression/response_count': 175,
'item_scraped_count': 175,
'log_count/DEBUG': 357,
'log_count/INFO': 11,
'request_depth_max': 5,
'response_received_count': 176,
'scheduler/dequeued': 176,
'scheduler/dequeued/memory': 176,
'scheduler/enqueued': 176,
'scheduler/enqueued/memory': 176,
'start_time': datetime.datetime(2022, 4, 22, 3, 30, 41, 430853)}
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Spider closed (finished)
有人能建议我应该如何修改代码以反映我所期望的结果吗?
编辑:澄清一下,我正在尝试获取URL,而不是图像或文件本身。
2条答案
按热度按时间idv4meu81#
要做到这一点,你需要知道Scrapy是如何工作的。首先,你应该编写一个蜘蛛来递归地爬取所有从根URL开始的目录。当它访问页面时,提取所有的图像链接。
所以我为你写了这段代码,并在你提供的网站上测试了它。它完美地工作了。
这样你就有了这个页面上提供的所有图片的所有链接和任何内部目录(递归地)。
get_images
方法搜索页面中的所有图片,获取所有图片的链接,然后放置要抓取的目录链接,这样就得到了所有目录的所有图片链接。我提供的代码会产生这样的结果,其中包含您想要的所有链接:
注意:我在
image_ext
属性中指定了图像文件的扩展名。你可以扩展到所有可用的图像扩展名,或者像我一样只包括网站中存在的扩展名。你自己选择。k10s72fa2#
我试过用基本的Spider和Scrapy Selenium一起使用,效果很好。
basic.py
settings.py
已添加
输出