hadoopmapreduce:处理带有头的文本文件

ryoqjall 于 2021-06-02 发布在 Hadoop

关注(0)|答案(2)|浏览(577)

我正在玩和学习hadoopmapreduce。
我正在尝试Mapvcf文件（http://en.wikipedia.org/wiki/variant_call_format )：vcf是一个以制表符分隔的文件，以一个（可能较大的）头开始。此头是获取主体中记录的语义所必需的。

我想创建一个使用这些数据的Map器。为了解码这些行，必须可以从此Map器访问标头。
从http://jayunit100.blogspot.fr/2013/07/hadoop-processing-headers-in-mappers.html ，我使用自定义读取器创建了此inputformat：

public static class VcfInputFormat extends FileInputFormat<LongWritable, Text>
    {
    /* the VCF header is stored here */
    private List<String> headerLines=new ArrayList<String>();

    @Override
    public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
            TaskAttemptContext context) throws IOException,
            InterruptedException {
        return new VcfRecordReader();
        }  
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
        }

     private class VcfRecordReader extends LineRecordReader
        {
        /* reads all lines starting with '#' */
         @Override
        public void initialize(InputSplit genericSplit,
                TaskAttemptContext context) throws IOException {
            super.initialize(genericSplit, context);
            List<String> headerLines=new ArrayList<String>();
            while( super.nextKeyValue())
                {
                String row = super.getCurrentValue().toString();
                if(!row.startsWith("#")) throw new IOException("Bad VCF header");
                headerLines.add(row);
                if(row.startsWith("#CHROM")) break;
                }
            }
        }
    }

现在，在mapper中，有没有一种方法可以让指针指向 VcfInputFormat.this.headerLines 为了破译台词？

public static class VcfMapper
       extends Mapper<LongWritable, Text, Text, IntWritable>{

    public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException {
      my.VcfCodec codec=new my.VcfCodec(???????.headerLines);
      my.Variant variant =codec.decode(value.toString());
      //(....)
    }
  }

Java hadoop mapreduce bioinformatics vcf-variant-call-format

来源：https://stackoverflow.com/questions/30052859/hadoop-mapreduce-handling-a-text-file-with-a-header

2条答案

按热度按时间

qxsslcnc1#

我认为你的案例和你联系的例子不一样。在这种情况下，在自定义 RecordReader 类以提供单个“当前值”，该值是由所有筛选词组成的行，并传递给Map器。但是，在您的情况下，您希望在 RecordReader ，即在你的Map，而这是无法实现的。
我还认为，您可以通过提供已经处理的信息来模拟链接的示例行为：通过读取标题、存储标题，然后在获取当前值时，您的Map程序可以接收 my.VcfCodec 对象而不是 Text 对象（即 getCurrentValue 方法返回 my.VcfCodec 对象）。你的Map器可能是。。。

public static class VcfMapper extends Mapper<LongWritable, my.VcfCodec, Text, IntWritable>{
    public void map(LongWritable key, my.VcfCodec value, Context context ) throws IOException, InterruptedException {
        // whatever you may want to do with the encoded data...
}

赞(0）回复(0）举报 2021-06-03

kxe2p93d2#

您的inputformat类很好，因为@frb说inputformat类不能区分元数据和记录。
我可以建议的一个想法是，
在mapper类中为vcf文件的每个元数据属性（如fileformat、date、source等）声明静态全局变量。。
从vcfinputformat类中，如果行以 '##' 然后分析该行并根据当前行中的属性名将值设置为Map器类的静态变量。
如果线路不是以 '##' 然后简单地将这条线传递给Map器
在mapper类中，只需解析记录内容，并借助表示元数据的静态变量导出有用的值。
希望这有帮助。。

赞(0）回复(0）举报 2021-06-03

我来回答

hadoopmapreduce:处理带有头的文本文件

2条答案

相关问题

热门标签

最新问答