hadoop2：使用自定义inputformat时结果为空

rhfm7lfc 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(489)

我想用自己的 FileInputFormat 有一个习惯 RecordReader 将csv数据读入 <Long><String> 对。
因此我创建了这个类 MyTextInputFormat :

import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
public class MyTextInputFormat extends FileInputFormat<Long, String> {
  @Override
  public RecordReader<Long, String> getRecordReader(InputSplit input, JobConf job, Reporter reporter) throws IOException {
      reporter.setStatus(input.toString());
      return new MyStringRecordReader(job, (FileSplit)input);
  }
  @Override
  protected boolean isSplitable(FileSystem fs, Path filename) {
    return super.isSplitable(fs, filename);
  }
}

还有班级 MyStringRecordReader :

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
public class MyStringRecordReader implements RecordReader<Long, String> {
    private LineRecordReader lineReader;
    private LongWritable lineKey;
    private Text lineValue;
    public MyStringRecordReader(JobConf job, FileSplit split) throws IOException {
        lineReader = new LineRecordReader(job, split);
        lineKey = lineReader.createKey();
        lineValue = lineReader.createValue();
        System.out.println("constructor called");
    }
    @Override
    public void close() throws IOException {
        lineReader.close();
    }
    @Override
    public Long createKey() {
        return lineKey.get();
    }
    @Override
    public String createValue() {
        System.out.println("createValue called");
        return lineValue.toString();
    }
    @Override
    public long getPos() throws IOException {
        return lineReader.getPos();
    }
    @Override
    public float getProgress() throws IOException {
        return lineReader.getProgress();
    }
    @Override
    public boolean next(Long key, String value) throws IOException {
        System.out.println("next called");
        // get the next line
        if (!lineReader.next(lineKey, lineValue)) {
            return false;
        }
        key = lineKey.get();
        value = lineValue.toString();
        System.out.println(key);
        System.out.println(value);
        return true;
    }
}

在我的spark应用程序中，我通过调用 sparkContext.hadoopFile 方法。但我只能从以下代码中得到一个空输出：

public class AssociationRulesAnalysis {
    @SuppressWarnings("serial")
    public static void main(String[] args) {
        JavaRDD<String> inputRdd = sc.hadoopFile(inputFilePath, MyTextInputFormat.class, Long.class, String.class).map(new Function<Tuple2<Long,String>, String>() {
            @Override
            public String call(Tuple2<Long, String> arg0) throws Exception {
                System.out.println("map: " + arg0._2());
                return arg0._2();
            }
        });
        List<String> asList = inputRdd.take(10);
        for(String s : asList) {
            System.out.println(s);
        }
    }
}

我从rdd只收到10条空线。
添加了 prints 如下所示：

=== APP STARTED : local-1467182320798
constructor called
createValue called
next called
0
ä1
map:
next called
8
ö2
map:
next called
13
ü3
map:
next called
18
ß4
map:
next called
23
ä5
map:
next called
28
ö6
map:
next called
33
ü7
map:
next called
38
ß8
map:
next called
43
ä9
map:
next called
48
ü10
map:
next called
54
ä11
map:
next called
60
ß12
map:
next called
12
=====================
constructor called
createValue called
next called
0
ä1
map:
next called
8
ö2
map:
next called
13
ü3
map:
next called
18
ß4
map:
next called
23
ä5
map:
next called
28
ö6
map:
next called
33
ü7
map:
next called
38
ß8
map:
next called
43
ä9
map:
next called
48
ü10
map:
Stopping...

（rdd数据打印在 ===== 输出（10个空行！！！）。输出高于 ===== 似乎是由 RDD.count 打电话。在 next 方法显示正确的键和值！？我做错什么了？

hadoop apache-spark hadoop2 recordreader

来源：https://stackoverflow.com/questions/38050069/hadoop-2-empty-result-when-using-custom-inputformat

1条答案

按热度按时间

zdwk9cvp1#

lineKey 以及 lineValue 从不初始化为 key 以及 value 传递到覆盖的 next 方法 MyStringRecordReader . 因此，当您尝试使用 RecordReader . 如果要为文件中的记录使用不同的键和值，则需要使用传递给 next 方法并用计算的键和值初始化它们。如果您不想更改键/值记录，那么请删除以下内容。每次执行这段代码时，都会用空字符串和0l覆盖从文件中读取的键/值。

key = lineKey.get();
value = lineValue.toString();

赞(0）回复(0）举报 2021-06-02

我来回答

hadoop2：使用自定义inputformat时结果为空

1条答案

相关问题

热门标签

最新问答