java—将输入arff文件分割成更小的块来处理非常大的数据集

mepcadol 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(434)

我试图在map reduce上运行一个weka分类器，加载整个arff文件（甚至200mb）会导致堆空间错误，所以我想将arff文件分割成块，但问题是它必须维护块信息，即每个块中的arff属性信息，以便在每个Map器中运行分类器。这是我试图分割数据但效率不高的代码，

List<InputSplit> splits = new ArrayList<InputSplit>();
        for (FileStatus file: listStatus(job)) {
            Path path = file.getPath();
            FileSystem fs = path.getFileSystem(job.getConfiguration());

            //number of bytes in this file
            long length = file.getLen();
            BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

            // make sure this is actually a valid file
            if(length != 0) {
                // set the number of splits to make. NOTE: the value can be changed to anything
                int count = job.getConfiguration().getInt("Run-num.splits",1);
                for(int t = 0; t < count; t++) {
                    //split the file and add each chunk to the list
                    splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts())); 
                }
            }
            else {
                // Create empty array for zero length files
                splits.add(new FileSplit(path, 0, length, new String[0]));
            }
        }
        return splits;

Java hadoop mapreduce weka

来源：https://stackoverflow.com/questions/30080643/to-split-input-arff-file-into-smaller-chunks-to-process-very-large-dataset

1条答案

按热度按时间

iq3niunx1#

你先试过这个吗？
在mapred-site.xml中，添加以下属性：

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>

//mr-jobs的内存分配

赞(0）回复(0）举报 2021-06-03

我来回答

java—将输入arff文件分割成更小的块来处理非常大的数据集

1条答案

相关问题

热门标签

最新问答