根据时间范围从hbase表中删除所有数据?

nue99wik  于 2021-06-10  发布在  Hbase
关注(0)|答案(2)|浏览(621)

我正在尝试从hbase表中删除所有数据,该表的时间戳早于指定的时间戳。它包含所有列族和行。
有没有一种方法可以使用shell和javaapi来实现这一点?

ivqmmu1c

ivqmmu1c1#

Yes, this can be done easily by setting time range to scanner and then deleting the returned result set.

    public class BulkDeleteDriver {
    //Added colum family and column to lessen the scan I/O
    private static final byte[] COL_FAM = Bytes.toBytes("<column family>");
    private static final byte[] COL = Bytes.toBytes("column");
    final byte[] TEST_TABLE = Bytes.toBytes("<TableName>");

    public static void main(final String[] args) throws IOException,
    InterruptedException {
    //Create connection to Hbase
    Configuration conf = null;
    Connection conn = null;

    try {
    conf = HBaseConfiguration.create();
    //Path to HBase-site.xml
    conf.addResource(new Path(hbasepath));
    //Get the connection
    conn = ConnectionFactory.createConnection(conf);
    logger.info("Connection created successfully");
    } 
    catch (Exception e) {
    logger.error(e + "Connection Unsuccessful");
    }

    //Get the table instance
    Table table = conn.getTable(TableName.valueOf(TEST_TABLE));
    List<Delete> listOfBatchDeletes = new ArrayList<Delete>();
    long recordCount = 0;
    // Set scanCache if required
    logger.info("Got The Table : " + table.getName());

    //Get calendar instance and get proper start and end timestamps
    Calendar calStart = Calendar.getInstance();
    calStart.add(Calendar.DAY_OF_MONTH, day);
    Calendar calEnd = Calendar.getInstance();
    calEnd.add(Calendar.HOUR, hour);

    //Get timestamps
    long starTS = calStart.getTimeInMillis();
    long endTS = calEnd.getTimeInMillis();

    //Set all scan related properties
    Scan scan = new Scan();
    //Most important part of code set it properly!
    //here my purpose it to delete everthing Present Time - 6 hours
    scan.setTimeRange(starTS, endTS);
    scan.setCaching(scanCache);
    scan.addColumn(COL_FAM, COL);

    //Scan the table and get the row keys
    ResultScanner resultScanner = table.getScanner(scan);
    for (Result scanResult : resultScanner) {
    Delete delete = new Delete(scanResult.getRow());

    //Create batches of Bult Delete
    listOfBatchDeletes.add(delete);
    recordCount++;
    if (listOfBatchDeletes.size() == //give any suitable batch size here) {
    System.out.println("Firing Batch Delete Now......");
    table.delete(listOfBatchDeletes);
    //don't forget to clear the array list
    listOfBatchDeletes.clear();
    }}
    System.out.println("Firing Final Batch of Deletes.....");
    table.delete(listOfBatchDeletes);
    System.out.println("Total Records Deleted are.... " + recordCount);
    try {
    table.close();
    } catch (Exception e) {
    e.printStackTrace();
    logger.error("ERROR", e);
    }}}
v2g6jxz6

v2g6jxz62#

hbase没有范围删除标记的概念。这意味着如果需要删除多个单元格,则需要为每个单元格放置删除标记,这意味着您必须在客户端或服务器端扫描每一行。这意味着您有两个选择:
bulkdeleteprotocol:它使用协处理器端点,这意味着整个操作将在服务器端运行。这个链接有一个如何使用它的例子。如果您进行web搜索,您可以很容易地找到如何在hbase中启用协处理器端点。
扫描和删除:这是一个干净和最简单的选择。因为您说过需要删除比特定时间戳早的所有列族,所以可以通过使用服务器端筛选来读取每行的第一个键,从而大大优化扫描和删除操作。

Scan scan = new Scan();
scan.setTimeRange(0, STOP_TS);  // STOP_TS: The timestamp in question
// Crucial optimization: Make sure you process multiple rows together
scan.setCaching(1000);
// Crucial optimization: Retrieve only row keys
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL,
    new FirstKeyOnlyFilter(), new KeyOnlyFilter());
scan.setFilter(filters);
ResultScanner scanner = table.getScanner(scan);
List<Delete> deletes = new ArrayList<>(1000);
Result [] rr;
do {
  // We set caching to 1000 above
  // make full use of it and get next 1000 rows in one go
  rr = scanner.next(1000);
  if (rr.length > 0) {
    for (Result r: rr) {
      Delete delete = new Delete(r.getRow(), STOP_TS);
      deletes.add(delete);
    }
    table.delete(deletes);
    deletes.clear();
  }
} while(rr.length > 0);

相关问题