com.clearspring.analytics.stream.cardinality.HyperLogLog类的使用及代码示例

x33g5p2x  于2022-01-20 转载在 其他  
字(10.1k)|赞(0)|评价(0)|浏览(224)

本文整理了Java中com.clearspring.analytics.stream.cardinality.HyperLogLog类的一些代码示例,展示了HyperLogLog类的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。HyperLogLog类的具体详情如下:
包路径:com.clearspring.analytics.stream.cardinality.HyperLogLog
类名称:HyperLogLog

HyperLogLog介绍

[英]Java implementation of HyperLogLog (HLL) algorithm from this paper:

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

HLL is an improved version of LogLog that is capable of estimating the cardinality of a set with accuracy = 1.04/sqrt(m) where m = 2^b. So we can control accuracy vs space usage by increasing or decreasing b.

The main benefit of using HLL over LL is that it only requires 64% of the space that LL does to get the same accuracy.

This implementation implements a single counter. If a large (millions) number of counters are required you may want to refer to:

http://dsiutils.di.unimi.it/

It has a more complex implementation of HLL that supports multiple counters in a single object, drastically reducing the java overhead from creating a large number of objects.

This implementation leveraged a javascript implementation that Yammer has been working on:

https://github.com/yammer/probablyjs

Note that this implementation does not include the long range correction function defined in the original paper. Empirical evidence shows that the correction function causes more harm than good.

Users have different motivations to use different types of hashing functions. Rather than try to keep up with all available hash functions and to remove the concern of causing future binary incompatibilities this class allows clients to offer the value in hashed int or long form. This way clients are free to change their hash function on their own time line. We recommend using Google's Guava Murmur3_128 implementation as it provides good performance and speed when high precision is required. In our tests the 32bit MurmurHash function included in this project is faster and produces better results than the 32 bit murmur3 implementation google provides.
[中]本文中HyperLogLog(HLL)算法的Java实现:
http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
HLL是LogLog的一个改进版本,它能够估计精度为1.04/sqrt(m)的集合的基数,其中m=2^b。因此,我们可以通过增加或减少b来控制精度与空间使用。
在LL上使用HLL的主要好处是,它只需要LL的64%的空间就可以获得相同的精度。
这个实现实现了一个计数器。如果需要大量(百万)计数器,您可能需要参考:
http://dsiutils.di.unimi.it/
它有一个更复杂的HLL实现,支持单个对象中的多个计数器,大大减少了创建大量对象的java开销。
此实现利用了Yammer一直致力于的javascript实现:
https://github.com/yammer/probablyjs
请注意,此实现不包括原始文件中定义的远程校正功能。经验证据表明,修正函数弊大于利。
用户使用不同类型的散列函数有不同的动机。该类允许客户端以哈希整型或长型形式提供值,而不是试图跟上所有可用的哈希函数并消除导致未来二进制不兼容的担忧。这样,客户端就可以在自己的时间线上自由地更改其哈希函数。我们建议使用Google的Guava 3_128实现,因为它在需要高精度时提供了良好的性能和速度。在我们的测试中,这个项目中包含的32位杂音散列函数比google提供的32位杂音3实现更快,产生更好的结果。

代码示例

代码示例来源:origin: addthis/stream-lib

@Override
public HyperLogLog build() {
  return new HyperLogLog(log2m);
}

代码示例来源:origin: addthis/stream-lib

@Test
public void testComputeCount() {
  HyperLogLog hyperLogLog = new HyperLogLog(16);
  hyperLogLog.offer(0);
  hyperLogLog.offer(1);
  hyperLogLog.offer(2);
  hyperLogLog.offer(3);
  hyperLogLog.offer(16);
  hyperLogLog.offer(17);
  hyperLogLog.offer(18);
  hyperLogLog.offer(19);
  hyperLogLog.offer(19);
  assertEquals(8, hyperLogLog.cardinality());
}

代码示例来源:origin: addthis/stream-lib

@Test
public void testSerializationUsingBuilder() throws IOException {
  HyperLogLog hll = new HyperLogLog(8);
  hll.offer("a");
  hll.offer("b");
  hll.offer("c");
  hll.offer("d");
  hll.offer("e");
  HyperLogLog hll2 = HyperLogLog.Builder.build(hll.getBytes());
  assertEquals(hll.cardinality(), hll2.cardinality());
}

代码示例来源:origin: apache/incubator-pinot

public static HyperLogLog clone(HyperLogLog hll, int log2m) {
 try {
  HyperLogLog ret = new HyperLogLog(log2m);
  ret.addAll(hll);
  return ret;
 } catch (CardinalityMergeException e) {
  throw new RuntimeException(e);
 }
}

代码示例来源:origin: apache/incubator-pinot

/**
 * Generate a hll from a single value, and convert it to string type.
 * It is used for default derived field value.
 * @param log2m
 * @param value
 * @return
 */
public static String singleValueHllAsString(int log2m, Object value) {
 HyperLogLog hll = new HyperLogLog(log2m);
 hll.offer(value);
 return convertHllToString(hll);
}

代码示例来源:origin: apache/incubator-pinot

@Override
public HyperLogLog applyRawValue(HyperLogLog value, Object rawValue) {
 if (rawValue instanceof byte[]) {
  try {
   value.addAll(deserializeAggregatedValue((byte[]) rawValue));
  } catch (CardinalityMergeException e) {
   throw new RuntimeException(e);
  }
 } else {
  value.offer(rawValue);
 }
 return value;
}

代码示例来源:origin: addthis/stream-lib

@Test
public void testMerge() throws CardinalityMergeException {
  int numToMerge = 5;
  int bits = 16;
  int cardinality = 1000000;
  HyperLogLog[] hyperLogLogs = new HyperLogLog[numToMerge];
  HyperLogLog baseline = new HyperLogLog(bits);
  for (int i = 0; i < numToMerge; i++) {
    hyperLogLogs[i] = new HyperLogLog(bits);
    for (int j = 0; j < cardinality; j++) {
      double val = Math.random();
      hyperLogLogs[i].offer(val);
      baseline.offer(val);
    }
  }
  long expectedCardinality = numToMerge * cardinality;
  HyperLogLog hll = hyperLogLogs[0];
  hyperLogLogs = Arrays.asList(hyperLogLogs).subList(1, hyperLogLogs.length).toArray(new HyperLogLog[0]);
  long mergedEstimate = hll.merge(hyperLogLogs).cardinality();
  long baselineEstimate = baseline.cardinality();
  double se = expectedCardinality * (1.04 / Math.sqrt(Math.pow(2, bits)));
  System.out.println("Baseline estimate: " + baselineEstimate);
  System.out.println("Expect estimate: " + mergedEstimate + " is between " + (expectedCardinality - (3 * se)) + " and " + (expectedCardinality + (3 * se)));
  assertTrue(mergedEstimate >= expectedCardinality - (3 * se));
  assertTrue(mergedEstimate <= expectedCardinality + (3 * se));
  assertEquals(mergedEstimate, baselineEstimate);
}

代码示例来源:origin: Big-Data-Manning/big-data-code

public void operate(FlowProcess process, BufferCall call) {
    HyperLogLog hll = new HyperLogLog(14);
    Iterator<TupleEntry> it = call.getArgumentsIterator();
    while(it.hasNext()) {
      TupleEntry tuple = it.next();
      hll.offer(tuple.getObject(0));
    }
    try {
      call.getOutputCollector().add(
        new Tuple(hll.getBytes()));
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  }
}

代码示例来源:origin: addthis/stream-lib

int cardinality = 1000000000;
int b = 12;
HyperLogLog baseline = new HyperLogLog(b);
HyperLogLog guava128 = new HyperLogLog(b);
HashFunction hf128 = Hashing.murmur3_128();
for (int j = 0; j < cardinality; j++) {
  Double val = Math.random();
  String valString = val.toString();
  baseline.offer(valString);
  guava128.offerHashed(hf128.hashString(valString, Charsets.UTF_8).asLong());
  if (j > 0 && j % 1000000 == 0) {
    System.out.println("current count: " + j);
long baselineEstimate = baseline.cardinality();
long g128Estimate = guava128.cardinality();
double se = cardinality * (1.04 / Math.sqrt(Math.pow(2, b)));
double baselineError = (baselineEstimate - cardinality) / (double) cardinality;

代码示例来源:origin: addthis/stream-lib

/**
 * should not fail with HyperLogLogMergeException: "Cannot merge estimators of different sizes"
 */
@Test
public void testMergeWithRegisterSet() throws CardinalityMergeException {
  HyperLogLog first = new HyperLogLog(16, new RegisterSet(1 << 20));
  HyperLogLog second = new HyperLogLog(16, new RegisterSet(1 << 20));
  first.offer(0);
  second.offer(1);
  first.merge(second);
}

代码示例来源:origin: apache/incubator-pinot

hyperLogLogs = new ArrayList<>(_numMetricColumns);
 for (int i = 0; i < _numMetricColumns; i++) {
  hyperLogLogs.add(new HyperLogLog(HLL_CONFIG.getHllLog2m()));
 int dictId = _metricValIterators[i].nextIntVal();
 HyperLogLog hyperLogLog = hyperLogLogs.get(i);
 hyperLogLog.addAll(HllUtil.convertStringToHll(_metricDictionaries[i].getStringValue(dictId)));
 hyperLogLogs.set(i, hyperLogLog);
List<Double> finalResult = new ArrayList<>();
for (HyperLogLog hyperLogLog : hyperLogLogs) {
 finalResult.add((double) hyperLogLog.cardinality());

代码示例来源:origin: apache/incubator-pinot

@Test
public void testHyperLogLog() {
 for (int i = 0; i < NUM_ITERATIONS; i++) {
  HyperLogLog expected = new HyperLogLog(7);
  byte[] bytes = ObjectSerDeUtils.serialize(expected);
  HyperLogLog actual = ObjectSerDeUtils.deserialize(bytes, ObjectSerDeUtils.ObjectType.HyperLogLog);
  assertEquals(actual.cardinality(), expected.cardinality(), ERROR_MESSAGE);
 }
}

代码示例来源:origin: LiveRamp/cascading_ext

HyperLogLog card = HyperLogLog.Builder.build(Bytes.getBytes((BytesWritable)tuple.getObject("bytes")));
 countParts.add(card);
 totalSum += card.cardinality();
HyperLogLog merged = (HyperLogLog)new HyperLogLog(hllError).merge(countParts.toArray(new ICardinality[countParts.size()]));
long cardinality = merged.cardinality();

代码示例来源:origin: apache/incubator-pinot

@Nonnull
@Override
public Long extractFinalResult(@Nonnull HyperLogLog intermediateResult) {
 return intermediateResult.cardinality();
}

代码示例来源:origin: apache/incubator-pinot

@Override
public HyperLogLog applyAggregatedValue(HyperLogLog value, HyperLogLog aggregatedValue) {
 try {
  value.addAll(aggregatedValue);
  return value;
 } catch (CardinalityMergeException e) {
  throw new RuntimeException(e);
 }
}

代码示例来源:origin: apache/incubator-pinot

/**
 * Helper method to set value for a groupKey into the result holder.
 *
 * @param groupByResultHolder Result holder
 * @param groupKey Group-key for which to set the value
 * @param value Value for the group key
 */
private static void setValueForGroupKey(@Nonnull GroupByResultHolder groupByResultHolder, int groupKey,
  Object value) {
 HyperLogLog hyperLogLog = getHyperLogLog(groupByResultHolder, groupKey);
 hyperLogLog.offer(value);
}

代码示例来源:origin: LiveRamp/cascading_ext

@Override
public void cleanup(FlowProcess flowProcess, OperationCall operationCall) {
 JobConf conf = (JobConf) flowProcess.getConfigCopy();
 try {
  LOG.info("HLL counter found " + approxCounter.cardinality() + " distinct keys");
  Hfs tap = new Hfs(new SequenceFile(new Fields("bytes")), BloomProps.getApproxCountsDir(conf));
  TupleEntryCollector out = tap.openForWrite(new HadoopFlowProcess(conf));
  out.add(new Tuple(new BytesWritable(approxCounter.getBytes())));
  out.close();
 } catch (IOException e) {
  throw new RuntimeException("couldn't write approximate counts to side bucket", e);
 }
}

代码示例来源:origin: apache/incubator-pinot

public static byte[] toBytes(HyperLogLog hll) {
 try {
  return hll.getBytes();
 } catch (IOException e) {
  throw new RuntimeException(e);
 }
}

代码示例来源:origin: Big-Data-Manning/big-data-code

public void operate(FlowProcess process, BufferCall call) {
    Iterator<TupleEntry> it = call.getArgumentsIterator();
    HyperLogLog curr = null;
    try {
      while(it.hasNext()) {
        TupleEntry tuple = it.next();
        byte[] serialized = (byte[]) tuple.getObject(0);
        HyperLogLog hll = HyperLogLog.Builder.build(
                   serialized);
        if(curr==null) {
          curr = hll;
        } else {
          curr = (HyperLogLog) curr.merge(hll);
        }
      }
      call.getOutputCollector().add(
        new Tuple(curr.getBytes()));
    } catch (IOException e) {
      throw new RuntimeException(e);
    } catch(CardinalityMergeException e) {
      throw new RuntimeException(e);
    }
  }
}

代码示例来源:origin: addthis/stream-lib

@Override
public boolean offer(Object o) {
  final int x = MurmurHash.hash(o);
  return offerHashed(x);
}

相关文章