我正在学习hadoopmapreduce,并遵循wordcount教程。
在下面的代码中,我理解 map
方法,一次处理一行,由指定的 TextInputFormat
. 然后,它通过 StringTokenizer
,并发出 [<word>, 1]
:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
如何编辑此代码以便每次读取一个句子而不是一行?
e、 g.输入文本的类型: This is my first sentence. This is the second sentence.
我想先读书 This is my first sentence.
然后 This is the second sentence.
而不是 This
, is
, my
, first
, ...
作为输出:
1 This is my first sentence.
1 This is the second sentence.
因为这句话 This is my first sentence.
在输入文本和句子中只出现一次 This is the second sentence.
在文本中出现一次。
假设输入文本如下: This is my first sentence. This is my first sentence. This is the second sentence.
然后输出如下:
2 This is my first sentence.
1 This is the second sentence.
因为这句话 This is my first sentence.
在输入文本和句子中出现两次 This is the second sentence.
在文本中只出现一次。
仅供参考,wordcount的输出为:
2 This
2 is
1 my
1 first
2 sentence
1 second
因为这个词 This
在输入文本中出现两次 is
在文本中出现两次,术语 my
在文本中出现一次等。。
解决方案:conf.set(“textinputformat.record.delimiter”,“):
作为我设置的分隔符 ". "
(用空格)。现在我的代码可以识别这些句子,但是输出文件是错误的。使用以下输入文件: This is my first sentence. This is my first sentence. This is the second sentence.
它生成的输出文件如下(一些空格,然后是数字3):
3
而不是这样:
2 This is my first sentence
1 This is the second sentence
这是我的密码:
public class SentenceCount {
public static class SentenceMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
//System.out.println("SENTENCE: " + value.toString());
context.write(word, one);
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("textinputformat.record.delimiter", ". ");
Job job = Job.getInstance(conf, "sentence count");
job.setJarByClass(SentenceCount.class);
job.setMapperClass(SentenceMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
我错在哪里?
2条答案
按热度按时间pexxcrt21#
在本例中,您需要基于句子分隔符(句点“.”进行标记化,而不是在“空格”上进行标记化。所以,使用regex可能会有所帮助。
另外,请记住一些角落案例。例如:您想如何对待以下问题?两句话还是三句话?
“这是我的第一句话。这是我的第二句话。现在我有第三句话。
双引号部分是一句话还是两句话(以“”或“”为基础)?
vaj7vani2#
最直接的解决方案是预先处理您的输入,将每个句子放在新行中,并继续使用
TextInputFormat
照原样。另一种方法是
TextInputFormat
的默认分隔符(换行符:\n
)您可以将分隔符更改为
.
像这样:conf.set("textinputformat.record.delimiter", ".")
-在驾驶课上。(但要小心,如果“.”字符出现在句子中(例如。
"This pen costs 1.55 dollars."
)或者如果一个句子以感叹号而不是句号结尾。)然后在你的
map()
方法你不再需要标记这个句子。