hbase、map/reduce和sequencefiles:mapred.output.format.class与新的map api模式不兼容

ne5o7dgx  于 2021-06-04  发布在  Hadoop
关注(0)|答案(2)|浏览(451)

我试图从hbase表生成mahout向量。mahout需要向量序列文件作为输入。我得到的印象是,我不能从使用hbase作为源的map reduce作业写入序列文件。这里什么都没有:

  1. public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
  2. JobConf jobConf = new JobConf();
  3. jobConf.setMapOutputKeyClass(LongWritable.class);
  4. jobConf.setMapOutputValueClass(VectorWritable.class);
  5. // we want the vectors written straight to HDFS,
  6. // the order does not matter.
  7. jobConf.setNumReduceTasks(0);
  8. jobConf.setOutputFormat(SequenceFileOutputFormat.class);
  9. Path outputDir = new Path("/home/cloudera/house_vectors");
  10. FileSystem fs = FileSystem.get(configuration);
  11. if (fs.exists(outputDir)) {
  12. fs.delete(outputDir, true);
  13. }
  14. FileOutputFormat.setOutputPath(jobConf, outputDir);
  15. // I want the mappers to know the max and min value
  16. // so they can normalize the data.
  17. // I will add them as properties in the configuration,
  18. // by serializing them with avro.
  19. String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
  20. maximumHouse));
  21. jobConf.set("minmax", minmax);
  22. Job job = Job.getInstance(jobConf);
  23. Scan scan = new Scan();
  24. scan.addFamily(Bytes.toBytes("data"));
  25. TableMapReduceUtil.initTableMapperJob("homes", scan,
  26. HouseVectorizingMapper.class, LongWritable.class,
  27. VectorWritable.class, job);
  28. job.waitForCompletion(true);
  29. }

我有一些测试代码来运行它,但是我得到了:

  1. java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
  2. at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
  3. at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
  4. at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
  5. at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
  6. at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
  7. at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
  8. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  9. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  10. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  11. at java.lang.reflect.Method.invoke(Method.java:606)
  12. at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
  13. at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
  14. at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
  15. at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
  16. at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
  17. at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
  18. at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
  19. at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
  20. at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
  21. at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
  22. at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
  23. at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
  24. at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
  25. at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
  26. at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
  27. at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
  28. at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
  29. at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
  30. at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

所以我认为我的问题是我使用的是import org.apache.hadoop.mapreduce.job,setoutputformat方法需要org.apache.hadoop.mapreduce.outputformat的示例,这是一个类。该类只有四个实现,并且没有一个是用于序列文件的。以下是它的javadocs:
http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/outputformat.html
如果可以的话,我会使用job类的旧api版本,但是hbase的tablemapreduceutil只接受新api的作业。
我想我可以先把结果写成文本,然后再做第二个map/reduce作业,将输出转换成序列文件,但这听起来效率很低。
还有旧的org.apache.hadoop.hbase.mapred.tablemapreduceutil,但我不赞成使用它。
我的mahout jar是版本0.7-cdh4.5.0我的hbase jar是版本0.94.6-cdh4.5.0我所有的hadoop jar都是2.0.0-cdh4.5.0
有人能告诉我在我的情况下如何从m/r写序列文件吗?

wfsdck30

wfsdck301#

实际上,sequencefileoutputformat是新outputformat的后代。为了找到javadoc中的直接子类,您必须进一步查看。
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/lib/output/sequencefileoutputformat.html
您可能在驱动程序类中导入了错误的(旧的)驱动程序。从您的问题中无法确定这一点,因为您的代码示例中没有包含导入。

jyztefdp

jyztefdp2#

这是我在使用oozie时丢失的类似问题。从braindump:

  1. <!-- New API for map -->
  2. <property>
  3. <name>mapred.mapper.new-api</name>
  4. <value>true</value>
  5. </property>
  6. <!-- New API for reducer -->
  7. <property>
  8. <name>mapred.reducer.new-api</name>
  9. <value>true</value>
  10. </property>

相关问题