hadoop方法将输出发送到多个目录

sqxo8psd 于 2021-06-03 发布在 Hadoop

关注(0)|答案(2)|浏览(332)

我的 MapReduce 作业按日期处理数据，并需要将输出写入特定的文件夹结构。目前的预期是产生以下结构：

等。
在任何时候，我只得到12个月的数据，所以，我使用 MultipleOutputs 类使用驱动程序中的以下函数创建12个输出：

public void createOutputs(){
    Calendar c = Calendar.getInstance();
    String monthStr, pathStr;

    // Create multiple outputs for last 12 months
    // TODO make 12 configurable
    for(int i = 0; i < 12; ++i ){
        //Get month and add 1 as month is 0 based index
        int month = c.get(Calendar.MONTH)+1; 
        //Add leading 0
        monthStr = month > 10 ? "" + month : "0" + month ;  
        // Generate path string in the format 2013/03/etl
        pathStr = c.get(Calendar.YEAR) + "" + monthStr + "etl";
        // Add the named output
        MultipleOutputs.addNamedOutput(config, pathStr );  
        // Move to previous month
        c.add(Calendar.MONTH, -1); 
    }
}

在reducer中，我添加了一个cleanup函数来将生成的输出移动到适当的目录中。

protected void cleanup(Context context) throws IOException, InterruptedException {
        // Custom function to recursively process data
        moveFiles (FileSystem.get(new Configuration()), new Path("/MyOutputPath"));
}

问题：在将输出从临时目录移到输出目录之前，正在执行reducer的清除功能。由于这个原因，上面的函数在执行时看不到任何输出，因为所有的数据仍然在临时目录中。
实现所需功能的最佳方法是什么？感谢你的真知灼见。
思考以下问题：
有没有办法使用自定义outputcommitter？
把另一份工作链起来是更好，还是对这件事来说太过了？
有没有一个更简单的选择，我只是不知道。。
下面是来自 cleanup 功能：

MyMapReduce: filepath:hdfs://localhost:8020/dev/test
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_logs
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_logs/history/job_201310301015_0224_1383763613843_371979_HtmlEtl
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary/_attempt_201310301015_0224_r_000000_0
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary/_attempt_201310301015_0224_r_000000_0/201307etl-r-00000
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary/_attempt_201310301015_0224_r_000000_0/part-r-00000

Java hadoop hdfs mapreduce

来源：https://stackoverflow.com/questions/19820985/hadoop-method-to-send-output-to-multiple-directories

2条答案

按热度按时间

llmtgqce1#

很可能你在清理过程中没有关闭mos。
如果在mapper或reducer中有如下设置：

public void setup(Context context) {mos = new MultipleOutputs(context);}

您应该在清理开始时关闭mos，如下所示。。

public void cleanup(Context context ) throws IOException, InterruptedException {mos.close();}

赞(0）回复(0）举报 2021-06-04

rbpvctlc2#

你不应该需要第二份工作。我目前正在使用multipleoutputs在我的一个程序中创建大量的输出目录。尽管有超过30个目录，我只能使用几个multipleoutputs对象。这是因为您可以在写入时设置输出目录，所以只有在需要时才能确定。如果要以不同格式输出（例如，一个带有key:text.class，value:text.class，另一个带有key:text.class，value:intwritable.class），实际上只需要多个namedoutput
设置：

MultipleOutputs.addNamedOutput(job, "Output", TextOutputFormat.class, Text.class, Text.class);

减速器设置：

mout = new MultipleOutputs<Text, Text>(context);

在减速器中调用mout：

String key; //set to whatever output key will be
String value; //set to whatever output value will be
String outputFileName; //set to absolute path to file where this should write

mout.write("Output",new Text(key),new Text(value),outputFileName);

您可以让一段代码在编码时确定目录。例如，假设要按月份和年份指定目录：

int year;//extract year from data
int month;//extract month from data
String baseFileName; //parent directory to all outputs from this job
String outputFileName = baseFileName + "/" + year + "/" + month;

mout.write("Output",new Text(key),new Text(value),outputFileName);

希望这有帮助。
编辑：以上示例的输出文件结构：

赞(0）回复(0）举报 2021-06-03

我来回答

hadoop方法将输出发送到多个目录

2条答案

相关问题

热门标签

最新问答