使用happybase输出分离的hbase列

mlmc2os5  于 2021-06-09  发布在  Hbase
关注(0)|答案(1)|浏览(385)

我有这样的hbase表:

  1. total date1:tCount1 date2:tCount2 ...
  2. url1 date1:clickCount1 date2:clickCount2 ...
  3. url2 date1:clickCount1 date2:clickCount2 ...
  4. ...
  5. ``` `url1, url2, ...` 是行键。表只有一个列族。
  6. 我有一个日期范围(从 `datei` 至 `datej` )作为输入。我需要输出在一天内为每个网址点击份额。
  7. 输出必须具有以下格式:

datei url1:share1 url2:share1...
...
datej url1:share1 url2:share1...

  1. 哪里

datei.url1:share1 = url1.datei:clickCount1 / total datei:tCount1

  1. 我开始编写happybase脚本,但我不知道如何使用happybase从行中选择单独的列。我的happybase脚本如下:

import argparse
import calendar
import getpass
import happybase
import logging
import random
import sys

USAGE = """

To query daily data for a year, run:
$ {0} --action query --year 2014

To query daily data for a particular month, run:
$ {0} --action query --year 2014 --month 10

To query daily data for a particular day, run:
$ {0} --action query --year 2014 --month 10 --day 27

To compute totals add --total argument.

""".format(sys.argv[0])

logging.basicConfig(level="DEBUG")

HOSTS = ["bds%02d.vdi.mipt.ru" % i for i in xrange(7, 10)]
TABLE = "VisitCountPy-" + getpass.getuser()

def connect():
host = random.choice(HOSTS)
conn = happybase.Connection(host)

  1. logging.debug("Connecting to HBase Thrift Server on %s", host)
  2. conn.open()
  3. if TABLE not in conn.tables():
  4. # Create a table with column family `cf` with default settings.
  5. conn.create_table(TABLE, {"cf": dict()})
  6. logging.debug("Created table %s", TABLE)
  7. else:
  8. logging.debug("Using table %s", TABLE)
  9. return happybase.Table(TABLE, conn)

def query(args, table):
r = list(get_time_range(args))
t = 0L
for key, data in table.scan(row_start=min(r), row_stop=max(r)):
if args.total:
t += long(data["cf:value"])
else:
print "%s\t%s" % (key, data["cf:value"])
if args.total:
print "total\t%s" % t

def get_time_range(args):
cal = calendar.Calendar()
years = [args.year]
months = [args.month] if args.month is not None else range(1, 1+12)

  1. for year in years:
  2. for month in months:
  3. if args.day is not None:
  4. days = [args.day]
  5. else:
  6. days = cal.itermonthdays(year, month)
  7. for day in days:
  8. if day > 0:
  9. yield "%04d%02d%02d" % (year, month, day)

def main():
parser = argparse.ArgumentParser(description="An HBase example", usage=USAGE)
parser.add_argument("--action", metavar="ACTION", choices=("generate", "query"), required=True)
parser.add_argument("--year", type=int, required=True)
parser.add_argument("--month", type=int, default=None)
parser.add_argument("--day", type=int, default=None)
parser.add_argument("--total", action="store_true", default=False)

  1. args = parser.parse_args()
  2. table = connect()
  3. if args.day is not None and args.month is None:
  4. raise RuntimeError("Please, specify a month when specifying a day.")
  5. if args.day is not None and (args.day < 0 or args.day > 31):
  6. raise RuntimeError("Please, specify a valid day.")
  7. query(args, table)

if name == "main":
main()

  1. 那么,我应该如何更改我的脚本(实际上,是 `query()` 函数)以获取定义日期范围中的分隔列?
iezvtpos

iezvtpos1#

我认为应该使用scanner过滤器,您可以通过 scan(filter=...) 争论。
看到了吗https://github.com/wbolster/happybase/issues/11 对于某些指针(示例、文档)。

相关问题