happybase table.scan()和hbase thrift scannergetlist()之间的区别

aiazj4mn 于 2021-06-09 发布在 Hbase

关注(0)|答案(1)|浏览(1006)

我有两个版本的python脚本，它在while循环中将hbase中的表扫描1000行。第一个用happybasehttps://happybase.readthedocs.org/en/latest/user.html#retrieving-行

while variable:
    for key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000):
        print key
    new_key = key

第二个使用hbase thrift接口，如http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/

scanner_id = hbase.scannerOpenWithStop(tablename, '', '', [])
data = hbase.scannerGetList(scanner_id, 1000) 
while len(data):
    for dbpost in data:
        print row_of_dbpost
    data = hbase.scannerGetList(scanner_id, 1000)

数据库中的行是数字。所以我的问题是在某一行发生了一些奇怪的事情：
happybase打印（行）：

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest) 
193622937692155904 
193623435597983745...

和thrift\u扫描仪打印（行）：

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest)
100162267416506368 
10016241167 
10016296927 ...

这不是在接下来的1000行（row\u start=new\u scan或next data=scannergetlist）发生的，而是在批处理的中间。每次都是这样。
我想说第二个脚本和scannergetlist是正确的。
为什么happybase会有不同的做法？它是在考虑时间戳还是happybase/hbase逻辑中的其他东西？它最终会以不同的顺序扫描整个表吗？
另外，我知道happybase版本将扫描并打印第1000行两次，scannergetlist将忽略下一个数据中的第一行。这不是重点，魔术是发生在1000行批中间。

hbase python thrift happybase

来源：https://stackoverflow.com/questions/30939856/difference-between-happybase-table-scan-and-hbase-thrift-scannergetlist

1条答案

按热度按时间

kyvafyod1#

我不确定你的数据，但这些循环是不一样的。旧版本只使用一个扫描仪，而happybase示例则重复创建一个新的扫描仪。此外，happybase版本会设置扫描仪限制，而节俭版本则不会。
有了节俭，你就需要记账，而且你还需要重复的代码 scannerGetList() 呼叫）的循环，所以也许这是造成你的困惑。
happybase的正确方法是：

table = connection.table(tablename)
for key, data in table.scan(row_start=new_key, batch_size=1000):
    print key
    if some_condition:
        break  # this will cleanly close the scanner

注意：这里没有嵌套循环。另一个好处是happybase会在你完成扫描后正确关闭它，而你的旧版本不会。

赞(0）回复(0）举报 2021-06-09

我来回答

happybase table.scan()和hbase thrift scannergetlist()之间的区别

1条答案

相关问题

热门标签

最新问答