sequencefile压缩程序，只包含一个文件中的几个小文件seq

wwtsj6pe 于 2021-05-30 发布在 Hadoop

关注(0)|答案(2)|浏览(399)

novell在hdfs和hadoop：我正在开发一个程序，其中一个应该得到一个特定目录的所有文件，在那里我们可以找到任何类型的几个小文件。
get everyfile并在sequencefile compressed中进行append，其中key必须是文件的路径，value必须是file get，现在我的代码是：

import java.net.*;
    import org.apache.hadoop.fs.*;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.io.compress.BZip2Codec;
public class Compact {
        public static void main (String [] args) throws Exception{
                try{
                        Configuration conf = new Configuration();
                        FileSystem fs =
                                FileSystem.get(new URI("hdfs://quickstart.cloudera:8020"),conf);
                        Path destino = new Path("/user/cloudera/data/testPractice.seq");//test args[1]
                        if ((fs.exists(destino))){
                            System.out.println("exist : " + destino);
                            return;
                        }
                        BZip2Codec codec=new BZip2Codec();
                        SequenceFile.Writer outSeq = SequenceFile.createWriter(conf
                                   ,SequenceFile.Writer.file(fs.makeQualified(destino))
                                   ,SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,codec)
                                   ,SequenceFile.Writer.keyClass(Text.class)
                                   ,SequenceFile.Writer.valueClass(FSDataInputStream.class));
                        FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
                        for (int i=0;i<status.length;i++){
                                FSDataInputStream in = fs.open(status[i].getPath());
                                outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new FSDataInputStream(in));
                                fs.close();
                        }
                        outSeq.close();
                        System.out.println("End Program");
                }catch(Exception e){
                        System.out.println(e.toString());
                        System.out.println("File not found");
                }
        }
}

但每次执行完之后，我都会收到一个例外：

java.io.IOException: Could not find a serializer for the Value class: 'org.apache.hadoop.fs.FSDataInputStream'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization.

找不到文件
我知道错误一定是在类型的文件，我正在创建和类型的对象，我定义添加到sequencefile，但我不知道谁应该添加，有人能帮我吗？
提前谢谢

Java hadoop hdfs sequencefile

来源：https://stackoverflow.com/questions/29685548/sequencefile-compactor-of-several-small-files-in-only-one-file-seq

2条答案

按热度按时间

snvhrwxg1#

非常感谢您的评论，问题是您所说的序列化程序，最后我使用了byteswritable：

FileStatus[] status = fs.globStatus(new Path("/user/cloudera/data/*.txt"));//args[0]
                    for (int i=0;i<status.length;i++){
                        FSDataInputStream in = fs.open(status[i].getPath());
                        byte[] content = new byte [(int)fs.getFileStatus(status[i].getPath()).getLen()];                    
                        outSeq.append(new org.apache.hadoop.io.Text(status[i].getPath().toString()), new org.apache.hadoop.io.BytesWritable(in));
                    }
                        outSeq.close();

可能在hadoop生态系统中还有其他更好的解决方案，但是这个问题是我正在开发的一个学位的练习，现在我们正在重新设计理解概念的轮子；-）。

赞(0）回复(0）举报 2021-05-30

ttygqcqt2#

fsdatainputstream与任何其他inputstream一样，不打算序列化。在字节流上序列化“迭代器”应该做什么？
您最想做的是将文件的内容存储为值。例如，您可以将值类型从fsdatainputstream切换到byteswritable，只需从fsdatainputstream中获取所有字节。使用键/值sequencefile实现这种目的的一个缺点是，每个文件的内容必须适合内存。对于小文件来说可以，但是你必须意识到这个问题。
我不知道你真正想达到什么目的，但也许你可以避免用hadoop档案之类的东西来重新发明轮子？

赞(0）回复(0）举报 2021-05-30

我来回答

sequencefile压缩程序，只包含一个文件中的几个小文件seq

2条答案

相关问题

热门标签

最新问答