mapreduce中不同类型的hadoop数据集的记录定义?

eqqqjvef  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(464)

我想了解 RecordMapReduce hadoop,用于文本以外的数据类型。
一般来说 Text 数据记录以新行结束。
现在,如果我们想处理一个xml数据,这个数据是如何处理的,也就是说,一个 Record 定义是什么 mapper 会有用吗?
我读到有一个概念 InputFormat 以及 RecordReader ,但我没弄清楚。
有谁能帮我了解一下 InputFormat , RecordReader 对于各种类型的数据集(文本除外),如何将数据转换为 Records 哪个 mapper 在上面工作?

ubbxdtey

ubbxdtey1#

Lets start with some basic concept. 

    From perspective of a file.
    1. File -> collection of rows.
    2. Row -> Collection of one or more columns , separated by  delimiter.
    2. File can be of any format=> text file, parquet file, ORC file.   

    Different file format, store Rows(columns) in different way , and the choice of delimiter is also different. 

From Perspective of HDFS, 
     1. File is sequennce of bytes.
     2. It has no idea of the logical structuring of file. ie Rows and    columns. 
     3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks. 

    Input Format :  The code which knows how to read the file chunks from  splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split. 

    Record Reader :  As you read a Split , some code(Record Reader) should be able to understand  how to interpret a row from the bytes read from HDFS.

更多信息:
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

相关问题