Scrapy仅抓取和抓取HTML和TXT

vh0rcniy  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(229)

出于学习的目的,我一直在尝试递归地爬取和抓取https://triniate.com/images/上的所有URL,但Scrapy似乎只想爬取和抓取TXT、HTML和PHP URL。
这是我的蜘蛛代码

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem

class HelloSpider(CrawlSpider):
    #Identifier when executing scrapy from CLI
    name = 'hello'
    #Domains that allow spiders to explore
    allowed_domains = ["triniate.com"]
    #Starting point(Start exploration)URL

    start_urls = ["https://triniate.com/images/"]
    #Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
    #When you download a page that matches the Rule, the function specified in callback will be called.
    #If follow is set to True, the search will be performed recursively.
    rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]

    def parse_pageinfo(self, response):
        item = PageInfoItem()
        item['URL'] = response.url
            #Specify which part of the page to scrape
            #In addition to specifying in xPath format, it is also possible to specify in CSS format
        item['title'] = "idc"
        return item

items.py 内容


# Define here the models for your scraped items

# 

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

from scrapy.item import Item, Field

class PageInfoItem(Item):
    URL = Field()
    title = Field()
    pass

并且控制台输出为

2022-04-21 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-21 22:30:50 [scrapy.extensions.feedexport] INFO: Stored json feed (175 items) in: haxx.json
2022-04-21 22:30:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 59541,
 'downloader/request_count': 176,
 'downloader/request_method_count/GET': 176,
 'downloader/response_bytes': 227394,
 'downloader/response_count': 176,
 'downloader/response_status_count/200': 176,
 'dupefilter/filtered': 875,
 'elapsed_time_seconds': 8.711563,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 22, 3, 30, 50, 142416),
 'httpcompression/response_bytes': 402654,
 'httpcompression/response_count': 175,
 'item_scraped_count': 175,
 'log_count/DEBUG': 357,
 'log_count/INFO': 11,
 'request_depth_max': 5,
 'response_received_count': 176,
 'scheduler/dequeued': 176,
 'scheduler/dequeued/memory': 176,
 'scheduler/enqueued': 176,
 'scheduler/enqueued/memory': 176,
 'start_time': datetime.datetime(2022, 4, 22, 3, 30, 41, 430853)}
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Spider closed (finished)

有人能建议我应该如何修改代码以反映我所期望的结果吗?
编辑:澄清一下,我正在尝试获取URL,而不是图像或文件本身。

idv4meu8

idv4meu81#

要做到这一点,你需要知道Scrapy是如何工作的。首先,你应该编写一个蜘蛛来递归地爬取所有从根URL开始的目录。当它访问页面时,提取所有的图像链接。
所以我为你写了这段代码,并在你提供的网站上测试了它。它完美地工作了。

import scrapy

class ImagesSpider(scrapy.Spider):
    name = "images"
    image_ext = ['png', 'gif']

    images_urls = set()

        start_urls = [
            'https://triniate.com/images/',
            # if there are some other urls you want to scrape the same way
            # add them in this list
        ]

        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.get_images)

    def get_images(self, response):
        all_hrefs = response.css('a::attr(href)').getall()
        all_images_links = list(filter(lambda x: x.split('.')[-1] in self.image_ext, all_hrefs))

        for link in all_images_links:
            self.images_urls.add(link)
            yield {'link': f'{response.request.url}{link}'}

        next_page_links =  list(filter(lambda x: x[-1]=='/', all_hrefs))
        for link in next_page_links:
            yield response.follow(link, callback=self.get_images)

这样你就有了这个页面上提供的所有图片的所有链接和任何内部目录(递归地)。
get_images方法搜索页面中的所有图片,获取所有图片的链接,然后放置要抓取的目录链接,这样就得到了所有目录的所有图片链接。
我提供的代码会产生这样的结果,其中包含您想要的所有链接:

[
   {"link": "https://triniate.com/images/ChatIcon.png"},
   {"link": "https://triniate.com/images/Sprite1.gif"},
   {"link": "https://triniate.com/images/a.png"},
   ...
   ...
   ...
   {"link": "https://triniate.com/images/objects/house_objects/workbench.png"}
]

注意:我在image_ext属性中指定了图像文件的扩展名。你可以扩展到所有可用的图像扩展名,或者像我一样只包括网站中存在的扩展名。你自己选择。

k10s72fa

k10s72fa2#

我试过用基本的Spider和Scrapy Selenium一起使用,效果很好。

basic.py

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['triniate.com']

    def start_requests(self):
        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
        driver.set_window_size(1920, 1080)
        driver.get("https://triniate.com/images/")

        links = driver.find_elements(By.XPATH, "//html/body/table/tbody/tr/td[2]/a")

        for link in links:
            href= link.get_attribute('href')
            yield SeleniumRequest(
            url = href
            )

        driver.quit()
        return super().start_requests()

    def parse(self, response):
        yield {
            'URL': response.url
        }

settings.py

已添加

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

输出

2022-04-22 12:03:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/stand_right.gif>
{'URL': 'https://triniate.com/images/stand_right.gif'}
2022-04-22 12:03:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://triniate.com/images/walk_right_transparent.gif> (referer: None)
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back.gif>
{'URL': 'https://triniate.com/images/walk_back.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_left_transparent.gif>
{'URL': 'https://triniate.com/images/walk_left_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_front_transparent.gif>
{'URL': 'https://triniate.com/images/walk_front_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back_transparent.gif>
{'URL': 'https://triniate.com/images/walk_back_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right.gif>
{'URL': 'https://triniate.com/images/walk_right.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right_transparent.gif>
{'URL': 'https://triniate.com/images/walk_right_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.engine] INFO: Closing spider (finished)

相关问题