将大型hbase表加载到sparkrdd需要很长时间

qltillow 于 2021-06-09 发布在 Hbase

关注(0)|答案(1)|浏览(247)

我正在尝试将一个大型hbase表加载到spark rdd中，以便在实体上运行sparksql查询。对于一个有大约600万行的实体，将其加载到rdd大约需要35秒。是预期的吗？有没有办法缩短装货过程？我从你那里得到了一些建议http://hbase.apache.org/book/perf.reading.html 为了加快这个过程，例如scan.setcaching（cachesize）并只添加必要的属性/列来扫描。我只是想知道是否有其他方法来提高速度？
以下是代码片段：

SparkConf sparkConf = new SparkConf().setMaster("spark://url").setAppName("SparkSQLTest");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
Configuration hbase_conf = HBaseConfiguration.create();
hbase_conf.set("hbase.zookeeper.quorum","url");
hbase_conf.set("hbase.regionserver.port", "60020");
hbase_conf.set("hbase.master", "url");
hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName);
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col1"));
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col2"));
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col3"));
scan.setCaching(this.cacheSize);
hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan));
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD 
= jsc.newAPIHadoopRDD(hbase_conf,
            TableInputFormat.class, ImmutableBytesWritable.class,
            Result.class);
logger.info("count is " + hBaseRDD.cache().count());

hbase apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/27305460/loading-a-large-hbase-table-into-spark-rdd-takes-long-time