mapreduce—用hadoop方法输出数百万个小型二进制/图像文件

3df52oht 于 2021-06-04 发布在 Hadoop

关注(0)|答案(1)|浏览(690)

我需要在hadoop作业中处理和操纵许多图像，输入将通过网络进行，使用 MultiThreadedMapper .
但是减少输出的最佳方法是什么？我想我应该把原始的二值图像数据写进一个序列文件，把这些文件传输到它们最终的家里，然后写一个小应用程序从序列中提取单个图像 SequenceFile 分为单独的jpg和gif。
还是有更好的选择可以考虑？

hadoop mapreduce reduce

来源：https://stackoverflow.com/questions/14250494/hadoop-approach-to-outputting-millions-of-small-binary-image-files

1条答案

按热度按时间

ogsagwnx1#

如果你觉得合适（或者通过google你可以找到一个实现），你可以写一个fileoutputformat，它用一个zipoutputstream Package fsdataoutputstream，为每个reducer提供一个zip文件（这样就省去了编写seq文件提取程序的工作量）。
不要害怕编写自己的outputformat，其实并不难（而且比编写自定义的inputformat容易得多，因为自定义的inputformat需要担心拆分）。实际上，这里有一个起点-您只需要实现write方法：

// Key: Text (path of the file in the output zip)
// Value: BytesWritable - binary content of the image to save
public class ZipFileOutputFormat extends FileOutputFormat<Text, BytesWritable> {
    @Override
    public RecordWriter<Text, BytesWritable> getRecordWriter(
            TaskAttemptContext job) throws IOException, InterruptedException {
        Path file = getDefaultWorkFile(job, ".zip");

        FileSystem fs = file.getFileSystem(job.getConfiguration());

        return new ZipRecordWriter(fs.create(file, false));
    }

    public static class ZipRecordWriter extends
            RecordWriter<Text, BytesWritable> {
        protected ZipOutputStream zos;

        public ZipRecordWriter(FSDataOutputStream os) {
            zos = new ZipOutputStream(os);
        }

        @Override
        public void write(Text key, BytesWritable value) throws IOException,
                InterruptedException {
            // TODO: create new ZipEntry & add to the ZipOutputStream (zos)
        }

        @Override
        public void close(TaskAttemptContext context) throws IOException,
                InterruptedException {
            zos.close();
        }
    }
}

赞(0）回复(0）举报 2021-06-04

我来回答

mapreduce—用hadoop方法输出数百万个小型二进制/图像文件

1条答案

相关问题

热门标签

最新问答