如何用wordcount mapreduce教程阅读句子而不是行

rwqw0loc 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(325)

我正在学习hadoopmapreduce，并遵循wordcount教程。
在下面的代码中，我理解 map 方法，一次处理一行，由指定的 TextInputFormat . 然后，它通过 StringTokenizer ，并发出 [<word>, 1] :

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
    }
}

如何编辑此代码以便每次读取一个句子而不是一行？
e、 g.输入文本的类型： This is my first sentence. This is the second sentence. 我想先读书 This is my first sentence. 然后 This is the second sentence. 而不是 This , is , my , first , ...
作为输出：

1 This is my first sentence.
1 This is the second sentence.

因为这句话 This is my first sentence. 在输入文本和句子中只出现一次 This is the second sentence. 在文本中出现一次。
假设输入文本如下： This is my first sentence. This is my first sentence. This is the second sentence. 然后输出如下：

2 This is my first sentence.
1 This is the second sentence.

因为这句话 This is my first sentence. 在输入文本和句子中出现两次 This is the second sentence. 在文本中只出现一次。
仅供参考，wordcount的输出为：

2 This
2 is
1 my
1 first
2 sentence
1 second

因为这个词 This 在输入文本中出现两次 is 在文本中出现两次，术语 my 在文本中出现一次等。。
解决方案：conf.set（“textinputformat.record.delimiter”，“）：
作为我设置的分隔符 ". " （用空格）。现在我的代码可以识别这些句子，但是输出文件是错误的。使用以下输入文件： This is my first sentence. This is my first sentence. This is the second sentence. 它生成的输出文件如下（一些空格，然后是数字3）：

而不是这样：

2 This is my first sentence
 1 This is the second sentence

这是我的密码：

public class SentenceCount {
      public static class SentenceMapper extends Mapper<Object, Text, Text, IntWritable>{
           private final static IntWritable one = new IntWritable(1);
           private Text word = new Text();
           public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
           //System.out.println("SENTENCE: " + value.toString());
           context.write(word, one);
     }
 }
 public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
      private IntWritable result = new IntWritable();
     public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
               sum += val.get();
       }
       result.set(sum);
       context.write(key, result);
     }
 }
 public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      conf.set("textinputformat.record.delimiter", ". ");
      Job job = Job.getInstance(conf, "sentence count");
      job.setJarByClass(SentenceCount.class);
      job.setMapperClass(SentenceMapper.class);
      job.setCombinerClass(IntSumReducer.class);
      job.setReducerClass(IntSumReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
 }

我错在哪里？

Java hadoop mapreduce

来源：https://stackoverflow.com/questions/40850176/how-to-read-sentence-instead-of-line-with-wordcount-mapreduce-tutorial

2条答案

按热度按时间

pexxcrt21#

在本例中，您需要基于句子分隔符（句点“.”进行标记化，而不是在“空格”上进行标记化。所以，使用regex可能会有所帮助。
另外，请记住一些角落案例。例如：您想如何对待以下问题？两句话还是三句话？
“这是我的第一句话。这是我的第二句话。现在我有第三句话。
双引号部分是一句话还是两句话（以“”或“”为基础）？

赞(0）回复(0）举报 2021-05-29

vaj7vani2#

最直接的解决方案是预先处理您的输入，将每个句子放在新行中，并继续使用 TextInputFormat 照原样。
另一种方法是 TextInputFormat 的默认分隔符（换行符： \n )
您可以将分隔符更改为 . 像这样： conf.set("textinputformat.record.delimiter", ".") -在驾驶课上。
（但要小心，如果“.”字符出现在句子中（例如。 "This pen costs 1.55 dollars." )或者如果一个句子以感叹号而不是句号结尾。）
然后在你的 map() 方法你不再需要标记这个句子。

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
   context.write(value, one);
}

赞(0）回复(0）举报 2021-05-29

我来回答

如何用wordcount mapreduce教程阅读句子而不是行

2条答案

相关问题

热门标签

最新问答