完全取消默认输出目录-mapreduce

9rnv2umw 于 2021-06-03 发布在 Hadoop

关注(0)|答案(3)|浏览(389)

我有一个使用 org.apache.hadoop.mapreduce.lib.output.MultipleOutputs .
reducer将结果写入预先创建的位置，因此我不需要默认的o/p目录（其中包含 _history 以及 _SUCCESS 目录）。
每次在再次运行作业之前，我都必须删除它们。
所以我取下了 TextOutputFormat.setOutputPath(job1,new Path(outputPath)); 线路。但是，这给了我（预期的）错误 org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set 驾驶员等级：

MultipleOutputs.addNamedOutput(job1, "path1", TextOutputFormat.class, Text.class,LongWritable.class);
MultipleOutputs.addNamedOutput(job1, "path2", TextOutputFormat.class, Text.class,LongWritable.class);
LazyOutputFormat.setOutputFormatClass(job1,TextOutputFormat.class);

减速器等级：

if(condition1)
    mos.write("path1", key, new LongWritable(value), path_list[0]);
else
    mos.write("path2", key, new LongWritable(value), path_list[1]);

是否有避免指定默认输出目录的解决方法？

hadoop hdfs cloudera Output

来源：https://stackoverflow.com/questions/18976844/do-away-with-default-output-directory-completely-mapreduce

3条答案

按热度按时间

3duebb1j1#

你在运行什么版本的hadoop？
为了快速解决问题，可以通过编程设置输出位置，并调用filesystem.delete在作业完成时将其删除。

赞(0）回复(0）举报 2021-06-03

nbnkbykc2#

我不认为 _SUCCESS 是一个目录，另一个是 history 目录驻留在 _logs 目录。
首先 TextOutputFormat.setOutputPath(job1,new Path(outputPath)); 这一点很重要，因为当作业运行时，hadoop将此路径作为工作目录，以便为不同的任务创建临时文件等（\u temporary dir）。这个临时目录和文件最终会在作业结束时被删除。文件\u success and history目录实际上保留在工作目录下，并在作业成功完成后保留_success file是表示作业实际运行成功的标志。请看这个链接。
文件的创建\u成功是由 TextOutputFormat 类，您实际使用的是 FileOutputComitter 班级。FileOutputCommitter类定义了如下函数--

public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";
/**
   * Delete the temporary directory, including all of the work directories.
   * This is called for all jobs whose final run state is SUCCEEDED
   * @param context the job's context.
   */
  public void commitJob(JobContext context) throws IOException {
    // delete the _temporary folder
    cleanupJob(context);
    // check if the o/p dir should be marked
    if (shouldMarkOutputDir(context.getConfiguration())) {
      // create a _success file in the o/p folder
      markOutputDirSuccessful(context);
    }
  }

// Mark the output dir of the job for which the context is passed.
  private void markOutputDirSuccessful(JobContext context)
  throws IOException {
    if (outputPath != null) {
      FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());
      if (fileSys.exists(outputPath)) {
        // create a file in the folder to mark it
        Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);
        fileSys.create(filePath).close();
      }
    }
  }

因为markoutputdirsuccessful（）是私有的，所以您必须重写commitjob（）以绕过成功的\u文件\u名称创建过程并实现您想要的。
如果以后要使用hadoop historyviewer实际获取作业运行情况的报告，那么下一个目录日志非常重要。
我认为，当您使用相同的输出目录作为另一个作业的输入时，由于hadoop中的过滤器设置，文件*\u success和目录\u logs*将被忽略。
此外，当您为multipleoutputs定义namedoutput时，您可以改为写入textoutputformat.setoutputpath（）函数中提到的输出路径内的子目录，然后将该路径用作将要运行的下一个作业的输入。
我真的不知道成功和日志会怎么困扰你？
谢谢

赞(0）回复(0）举报 2021-06-03

qzwqbdag3#

问题已经很老了，仍然有答案，
这个答案很适合问题中的情况。
定义outputformat以表示不需要任何输出。你可以这样做：

job.setOutputFormat(NullOutputFormat.class);

或
你也可以用 LazyOutputFormat ```
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

信用卡@charlesmenguy

赞(0）回复(0）举报 2021-06-03

我来回答

完全取消默认输出目录-mapreduce

3条答案

相关问题

热门标签

最新问答