Scrapy -将下载的图像路径替换为项目URL链接

ncecgwcz  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(158)

如标题中所述,我希望用项目路径名替换图像路径名,下面是我的示例:
运行我的scrapy,我得到的文件为标准的SHA1哈希格式。
如果可能的话,我也会很感激,如果它可以得到第一个图像,而不是整个集团。
URL名称-https://www.antaira.com/products/10-100Mbps/LNX-500A预期的图像名称-LNX-500A.jpg
Spider.py

from copyreg import clear_extension_cache
import scrapy
from ..items import AntairaItem

class ImageDownload(scrapy.Spider):
    name = 'ImageDownload'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps/LNX-500A',
    ]

        def parse_images(self, response):
        raw_image_urls = response.css('.image img ::attr(src)').getall()
        clean_image_urls = []
        for img_url in raw_image_urls:
            clean_image_urls.append(response.urljoin(img_url))
            yield {
                'image_urls' : clean_image_urls
            }

pipelines.py

from scrapy.pipelines.images import ImagesPipeline
import json

class AntairaPipeline:
    def process_item(self, item, spider):

        # calling dumps to create json data.
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

    def open_spider(self, spider):
        self.file = open('result.json', 'w')

    def close_spider(self, spider):
        self.file.close()

class customImagePipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None):
        #item-request.meta['item'] # Like this you can use all from the item, not just url
        #image_guid = request.meta.get('filename', '')
        image_guid = request.url.split('/')[-1]
        #image_direct = request.meta.get('directoryname', '')
        return 'full/%s.jpg' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + response.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #return [Request(x, meta={'filename': item['image_name']})
        #    for x in item.get(self.image_urls_field, [])]\
        for image in item['images']:
            yield Request(image)

我知道有一种方法可以获得 meta数据,但我希望它有项目的名称产品,如果可能的图像,谢谢。
编辑-
原始文件名称- 12f6537bd206cf58e86365ed6b7c1fb446c533b2.jpg
所需文件名-“LNX_500A_01.jpg”-如果不止一个,则使用start_url路径的最后一部分,如果不是“LNX_500A.jpg”

acruukt9

acruukt91#

我提取了项目的名称和所有带有xpath表达式的图片,然后在图片管道中将项目名称和文件编号添加到requests meta关键字arg中,然后在管道的file_path方法中将这两项相加。
您可以很容易地拆分request url并将其用作文件名,这两种方法都可以实现。
另外,由于某种原因,我没有得到任何图像在所有与css选择器,所以我切换到一个xpath表达式。如果css为您工作,那么你可以切换回来,它应该仍然工作。

蜘蛛程序文件

import scrapy
from ..items import MyItem

class ImageDownload(scrapy.Spider):
    name = 'ImageDownload'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps/LNX-500A',
    ]

    def parse(self, response):
        item = MyItem()
        raw_image_urls = response.xpath('//div[@class="selectors"]/a/@href').getall()
        name = response.xpath("//h1[@class='product-name']/text()").get()
        filename = name.split(' ')[0].strip()
        urls = [response.urljoin(i) for i in raw_image_urls]
        item["name"] = filename
        item["image_urls"] = urls
        yield item

项目.py

from scrapy import Item, Field

class MyItem(Item):
    name = Field()
    image_urls = Field()
    images = Field()

管道.py

from scrapy.http import Request
from scrapy.pipelines.images import ImagesPipeline

class ImagePipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None, *args, item=None):
        filename = request.meta["filename"].strip()
        number = request.meta["file_num"]
        return filename + "_" + str(number) + ".jpg"

    def get_media_requests(self, item, info):
        name = item["name"]
        for i, url in enumerate(item["image_urls"]):
            meta = {"filename": name, "file_num": i}
            yield Request(url, meta=meta)

设置.py

ITEM_PIPELINES = {
   'project.pipelines.ImagePipeline': 1,
}
IMAGES_STORE = 'image_dir'
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'

使用所有这些并运行scrapy crawl ImageDownloads,它将创建以下目录:

Project
 | - image_dir
 |     | - LNX-500A_0.jpg
 |     | - LNX-500A_1.jpg
 |     | - LNX-500A_2.jpg
 |     | - LNX-500A_3.jpg
 |     | - LNX-500A_4.jpg  
 |
 | - project
       | - __init__.py
       | - items.py
       | - middlewares.py
       | - pipelines.py
       | - settings.py
       |
       | - spiders
             | - antaira.py

这些是创建的文件。
第一个字符

相关问题