我试图从hbase表生成mahout向量。mahout需要向量序列文件作为输入。我得到的印象是,我不能从使用hbase作为源的map reduce作业写入序列文件。这里什么都没有:
public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
JobConf jobConf = new JobConf();
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(VectorWritable.class);
// we want the vectors written straight to HDFS,
// the order does not matter.
jobConf.setNumReduceTasks(0);
jobConf.setOutputFormat(SequenceFileOutputFormat.class);
Path outputDir = new Path("/home/cloudera/house_vectors");
FileSystem fs = FileSystem.get(configuration);
if (fs.exists(outputDir)) {
fs.delete(outputDir, true);
}
FileOutputFormat.setOutputPath(jobConf, outputDir);
// I want the mappers to know the max and min value
// so they can normalize the data.
// I will add them as properties in the configuration,
// by serializing them with avro.
String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
maximumHouse));
jobConf.set("minmax", minmax);
Job job = Job.getInstance(jobConf);
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("data"));
TableMapReduceUtil.initTableMapperJob("homes", scan,
HouseVectorizingMapper.class, LongWritable.class,
VectorWritable.class, job);
job.waitForCompletion(true);
}
我有一些测试代码来运行它,但是我得到了:
java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
所以我认为我的问题是我使用的是import org.apache.hadoop.mapreduce.job,setoutputformat方法需要org.apache.hadoop.mapreduce.outputformat的示例,这是一个类。该类只有四个实现,并且没有一个是用于序列文件的。以下是它的javadocs:
http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/outputformat.html
如果可以的话,我会使用job类的旧api版本,但是hbase的tablemapreduceutil只接受新api的作业。
我想我可以先把结果写成文本,然后再做第二个map/reduce作业,将输出转换成序列文件,但这听起来效率很低。
还有旧的org.apache.hadoop.hbase.mapred.tablemapreduceutil,但我不赞成使用它。
我的mahout jar是版本0.7-cdh4.5.0我的hbase jar是版本0.94.6-cdh4.5.0我所有的hadoop jar都是2.0.0-cdh4.5.0
有人能告诉我在我的情况下如何从m/r写序列文件吗?
2条答案
按热度按时间wfsdck301#
实际上,sequencefileoutputformat是新outputformat的后代。为了找到javadoc中的直接子类,您必须进一步查看。
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/lib/output/sequencefileoutputformat.html
您可能在驱动程序类中导入了错误的(旧的)驱动程序。从您的问题中无法确定这一点,因为您的代码示例中没有包含导入。
jyztefdp2#
这是我在使用oozie时丢失的类似问题。从braindump: