文本文件的wordcount

rjee0c15 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(419)

我想用hadoopmapreduce分析一个文本文件。
cvs文件更容易分析，因为它可以区分带有'，'的列
但是文本文件不能像cvs文件那样进行区分。
这是一种文本文件格式。
2015-8-02 error2014 blahblahblahblah 2015-8-02 blahblahbalh error2014 我想要一个输出作为

date      contents  sum of errors

2015-8-02  error2014  2

我想这样分析。我该如何处理mapreduce程序。

hadoop mapreduce

来源：https://stackoverflow.com/questions/31768048/wordcount-with-the-text-file

1条答案

按热度按时间

6mw9ycah1#

假设文本文件的格式如下：
2015-8-02
错误2014布拉赫布拉赫布拉赫布拉赫布拉赫布拉赫布拉赫布拉赫
2015-8-02
布拉布拉赫巴赫错误2014
您可以使用nlineinputformat。
与 NLineInputFormat 功能，您可以指定一个Map器应该有多少行。
在您的情况下，可以使用为每个Map器输入2行。
编辑：
以下是使用nlineinputformat的示例：
Map器类：

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {

    @Override
    public void map(LongWritable key, Text value, Context context)
          throws IOException, InterruptedException {

        context.write(key, value);
    }

}

驾驶员等级：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Driver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {

        if (args.length != 2) {
            System.out
                  .printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
            return -1;
        }

        Job job = new Job(getConf());
        job.setJobName("NLineInputFormat example");
        job.setJarByClass(Driver.class);

        job.setInputFormatClass(NLineInputFormat.class);
        NLineInputFormat.addInputPath(job, new Path(args[0]));
        job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 2);

        LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(MapperNLine.class);
        job.setNumReduceTasks(0);

        boolean success = job.waitForCompletion(true);
        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
        System.exit(exitCode);
    }
}

然后可以从行中提取日期和错误。在提取完日期和错误之后，您可以像wordcount示例一样，将它们作为复合键或串联字符串作为键和intwritable作为值传递，然后在reducer类中执行类似于wordcount示例的基本添加。
我希望我能回答你的问题。

赞(0）回复(0）举报 2021-06-02

我来回答

文本文件的wordcount

1条答案

相关问题

热门标签

最新问答