在hbase中插入sparkDataframe

m0rkklqb  于 2021-06-09  发布在  Hbase
关注(0)|答案(3)|浏览(554)

我有一个Dataframe,我想把它插入hbase。我看这个文件。
我的Dataframe是这样的:

--------------------
|id | name | address |
|--------------------|
|23 |marry |france   |
|--------------------|
|87 |zied  |italie   |
 --------------------

我使用以下代码创建hbase表:

val tableName = "two"
val conf = HBaseConfiguration.create()
if(!admin.isTableAvailable(tableName)) {
          print("-----------------------------------------------------------------------------------------------------------")
          val tableDesc = new HTableDescriptor(tableName)
          tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
          admin.createTable(tableDesc)
        }else{
          print("Table already exists!!--------------------------------------------------------------------------------------")
        }

现在如何将这个Dataframe插入hbase?
在另一个示例中,我成功地使用以下代码插入到hbase中:

val myTable = new HTable(conf, tableName)
    for (i <- 0 to 1000) {
      var p = new Put(Bytes.toBytes(""+i))
      p.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(""+(i*5)))
      p.add("z1".getBytes(), "age".getBytes(), Bytes.toBytes("2017-04-20"))
      p.add("z2".getBytes(), "job".getBytes(), Bytes.toBytes(""+i))
      p.add("z2".getBytes(), "salary".getBytes(), Bytes.toBytes(""+i))
      myTable.put(p)
    }
    myTable.flushCommits()

但现在我陷入了困境,如何将Dataframe的每条记录插入hbase表。
谢谢你的时间和关注

wnrlj8wa

wnrlj8wa1#

使用answer进行代码格式化doc告诉:

sc.parallelize(data).toDF.write.options(
 Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
 .format("org.apache.hadoop.hbase.spark ")
 .save()

其中sc.parallelize(data).todf是您的Dataframe。doc示例使用sc.parallelize(data).todf将scala集合转换为dataframe
你已经有你的Dataframe了,试着打电话给

yourDataFrame.write.options(
     Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "5"))
     .format("org.apache.hadoop.hbase.spark ")
     .save()

它应该有用。医生很清楚。。。
升级版
给定一个具有指定模式的Dataframe,上面将创建一个具有5个区域的hbase表,并将Dataframe保存在其中。请注意,如果未指定hbasetablecatalog.newtable,则必须预先创建表。
它是关于数据分区的。每个hbase表可以有1…x个区域。你应该仔细挑选区域的数量。区域数低是不好的。高地区数字也不好。

wj8zmpe1

wj8zmpe12#

下面是使用maven提供的hortonworks spark hbase连接器的完整示例。
此示例显示
如何检查hbase表是否存在
创建hbase表(如果不存在)
将Dataframe插入hbase表

import org.apache.hadoop.hbase.client.{ColumnFamilyDescriptorBuilder, ConnectionFactory, TableDescriptorBuilder}
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog

object Main extends App {

  case class Employee(key: String, fName: String, lName: String, mName: String,
                      addressLine: String, city: String, state: String, zipCode: String)

  // as pre-requisites the table 'employee' with column families 'person' and 'address' should exist
  val tableNameString = "default:employee"
  val colFamilyPString = "person"
  val colFamilyAString = "address"
  val tableName = TableName.valueOf(tableNameString)
  val colFamilyP = colFamilyPString.getBytes
  val colFamilyA = colFamilyAString.getBytes

  val hBaseConf = HBaseConfiguration.create()
  val connection = ConnectionFactory.createConnection(hBaseConf);
  val admin = connection.getAdmin();

  println("Check if table 'employee' exists:")
  val tableExistsCheck: Boolean = admin.tableExists(tableName)
  println(s"Table " + tableName.toString + " exists? " + tableExistsCheck)

  if(tableExistsCheck == false) {
    println("Create Table employee with column families 'person' and 'address'")
    val colFamilyBuild1 = ColumnFamilyDescriptorBuilder.newBuilder(colFamilyP).build()
    val colFamilyBuild2 = ColumnFamilyDescriptorBuilder.newBuilder(colFamilyA).build()
    val tableDescriptorBuild = TableDescriptorBuilder.newBuilder(tableName)
      .setColumnFamily(colFamilyBuild1)
      .setColumnFamily(colFamilyBuild2)
      .build()
    admin.createTable(tableDescriptorBuild)
  }

  // define schema for the dataframe that should be loaded into HBase
  def catalog =
    s"""{
       |"table":{"namespace":"default","name":"employee"},
       |"rowkey":"key",
       |"columns":{
       |"key":{"cf":"rowkey","col":"key","type":"string"},
       |"fName":{"cf":"person","col":"firstName","type":"string"},
       |"lName":{"cf":"person","col":"lastName","type":"string"},
       |"mName":{"cf":"person","col":"middleName","type":"string"},
       |"addressLine":{"cf":"address","col":"addressLine","type":"string"},
       |"city":{"cf":"address","col":"city","type":"string"},
       |"state":{"cf":"address","col":"state","type":"string"},
       |"zipCode":{"cf":"address","col":"zipCode","type":"string"}
       |}
       |}""".stripMargin

  // define some test data
  val data = Seq(
    Employee("1","Horst","Hans","A","12main","NYC","NY","123"),
    Employee("2","Joe","Bill","B","1337ave","LA","CA","456"),
    Employee("3","Mohammed","Mohammed","C","1Apple","SanFran","CA","678")
  )

  // create SparkSession
  val spark: SparkSession = SparkSession.builder()
    .master("local[*]")
    .appName("HBaseConnector")
    .getOrCreate()

  // serialize data
  import spark.implicits._
  val df = spark.sparkContext.parallelize(data).toDF

  // write dataframe into HBase
  df.write.options(
    Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "3")) // create 3 regions
    .format("org.apache.spark.sql.execution.datasources.hbase")
    .save()

}

在我的资源中提供了相关的站点xmls(“core site.xml”、“hbase site.xml”、“hdfs site.xml”)时,这对我很有用。

7lrncoxx

7lrncoxx3#

另一种方法是查看rdd.saveasnewapihadoopdataset,将数据插入hbase表。

def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("sparkToHive").enableHiveSupport().getOrCreate()
    import spark.implicits._

    val config = HBaseConfiguration.create()
    config.set("hbase.zookeeper.quorum", "ip's")
    config.set("hbase.zookeeper.property.clientPort","2181")
    config.set(TableInputFormat.INPUT_TABLE, "tableName")

    val newAPIJobConfiguration1 = Job.getInstance(config)
    newAPIJobConfiguration1.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tableName")
    newAPIJobConfiguration1.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

    val df: DataFrame  = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")

    val hbasePuts= df.rdd.map((row: Row) => {
      val  put = new Put(Bytes.toBytes(row.getString(0)))
      put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("value1"), Bytes.toBytes(row.getString(1)))
      put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("value2"), Bytes.toBytes(row.getString(2)))
      (new ImmutableBytesWritable(), put)
    })

    hbasePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration1.getConfiguration())
    }

裁判:https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/

相关问题