用hadoop实现java索引

e0bqpujr  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(303)

我正在处理下面的代码,在编译时遇到了问题。我试图实现的是一个索引词的用法,这样对于每个词,它引用文件中的位置和编号,对于每个文件。假设.txt中有“boy”,我们会得到
男孩/usr/.txt:13
意思是男孩是文件中的第一个和第三个单词
我正在使用下面的代码,在编译时看到两个错误。一个是找不到GenericOptions解析器,另一个是找不到文件名。我试图修改通用的wordcount代码。有人能给我指出正确的方向吗?

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordIndex {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {

      //context.getInputSplit();
      //Path filePath = ((FileSplit) context.getInputSplit()).getPath();
      //String filename = ((FileSplit)context.getInputSplit()).getPath().getName();

      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      //StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {

      String fileName = ((org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit()).getPath().getName();
        word.set(itr.nextToken().toLowerCase().replaceAll("[^a-z]+","") +" "+ filename); // get rid of special char
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);

    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(DocWordIndex.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
   }
  }
}
368yc8dk

368yc8dk1#

我将您的代码保持原样,并在进行了3次修改后能够编译:
在下面的语句中,更改 filenamefileName (大写) fileName )
更改:

word.set(itr.nextToken().toLowerCase().replaceAll("[^a-z]+","") +" "+ filename);

收件人:

word.set(itr.nextToken().toLowerCase().replaceAll("[^a-z]+","") +" "+ fileName);

进口 Package GenericOptionsParser :
添加以下导入:

import org.apache.hadoop.util.GenericOptionsParser;
``` `job.setJarByClass()` 他错了。它被设定为 `DocWordIndex.class` 而不是 `WordIndex.class` .
更改:

job.setJarByClass(DocWordIndex.class);

收件人:

job.setJarByClass(WordIndex.class);

这为我编译了代码。
我的maven依赖项是(我使用的是hadoop 2.7.0):

相关问题