何时为map reduce作业选择自定义输入格式

kpbpu008 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(303)

在使用map reduce编程时，我们应该什么时候使用自定义输入格式？
假设我有一个文件，我需要逐行读取，它有15列由管道分隔，我应该去自定义输入格式？
在这种情况下，我可以使用文本输入格式和自定义输入格式。

hadoop mapreduce

来源：https://stackoverflow.com/questions/37405760/when-to-go-for-custom-input-format-for-map-reduce-jobs

2条答案

按热度按时间

1zmg4dgp1#

是的，你可以为你的案例使用文本输入格式。

赞(0）回复(0）举报 2021-05-29

brjng4g32#

当您需要自定义输入记录读取时，可以编写custominputformat。但在您的情况下，您不需要这样的实现。
请参阅下面的custominputformat示例。。。
示例：将段落作为输入记录读取
如果您正在使用hadoopmapreduce或使用aws-emr，那么可能会有这样一个用例：输入文件将段落作为键值记录而不是一行（请考虑分析新闻文章评论之类的场景）。因此，如果您需要一次将一个完整的段落作为单个记录处理，那么您将需要自定义 **TextInputFormat** i、 e.在mapreduce作业中，默认情况下将每一行读入一个完整的段落，作为一个输入键值对进行进一步处理。
这要求我们创建一个定制的记录读取器，可以通过实现 class RecordReader . 这个 next() 方法是告诉记录读取器获取一段而不是一行。请参见以下实现，这是不言自明的：

public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
@Override
public void close() throws IOException {
lineRecord.close();
}
@Override
public LongWritable createKey() {
return new LongWritable();

}
@Override
public Text createValue() {
return new Text("");    
}
@Override
public float getProgress() throws IOException {
return lineRecord.getPos();    
}

@Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);

return isNextLineAvailable;
}

@Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}

对于paragraphrecordreader实现，我们需要扩展textinputformat来创建自定义inputfomat，只需重写getrecordreader方法并返回paragraphrecordreader的对象来重写默认行为。 ParagrapghInputFormat 看起来像：

public class ParagrapghInputFormat extends TextInputFormat
{
@Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}

确保作业配置使用我们的自定义输入格式实现将数据读入mapreduce作业。将inputformat类型设置为paragraphinputformat非常简单，如下所示： conf.setInputFormat(ParagraphInputFormat.class); 通过以上修改，我们可以将段落作为输入记录读入mapreduce程序。
假设输入文件如下所示：
简单的Map程序代码如下所示：

@Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}

赞(0）回复(0）举报 2021-05-29

我来回答

何时为map reduce作业选择自定义输入格式

2条答案

相关问题

热门标签

最新问答