使用hbase表作为mapreduce源

inb24sb2 于 2021-06-02 发布在 Hadoop

关注(0)|答案(2)|浏览(539)

据我所知，当使用hbase表作为mapreduce作业的源时，我们已经定义了扫描的值。假设我们将其设置为500，这是否意味着每个Map器只能从hbase表中获得500行？如果我们把它设得很高，有什么问题吗？
如果扫描的大小很小，我们不也会遇到与mapreduce中的小文件相同的问题吗？

hadoop hbase mapreduce

来源：https://stackoverflow.com/questions/29856355/using-an-hbase-table-as-mapreduce-source

2条答案

按热度按时间

mgdq6dx11#

下面是hbase手册中关于如何运行从hbase表读取的mapreduce作业的示例代码。

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
   tableName,        // input HBase table name
   scan,             // Scan instance to control CF and attribute selection
   MyMapper.class,   // mapper
   null,             // mapper output key
   null,             // mapper output value
   job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}

当你说“扫描值”时，那不是真的。你不是说 scan.setCaching() 或者 scan.setBatch() 或者 scan.setMaxResultSize() . setCaching 用于在将结果返回给客户端之前告诉服务器要加载多少行 setBatch 如果表很宽，则用于限制每次调用中返回的列数 setMaxResultSize 用于限制返回给客户端的结果数
通常情况下，你不设置 MaxResultSize 在mapreduce工作中。所以你会看到所有的数据。
以上信息请参考。