hyperloglog

package module
v0.2.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2025 License: MIT Imports: 8 Imported by: 138

README

HyperLogLog - an algorithm for approximating the number of distinct elements

GoDoc Go Report Card CircleCI

An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset. This implementation offers enhanced performance, flexibility, and simplicity while maintaining accuracy.

Note on Implementation History

The initial version of this work (tagged as v0.1.0) was based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen". However, the current implementation has evolved significantly from this original basis, notably moving away from the tailcut method.

Current Implementation

The current implementation is based on the LogLog-Beta algorithm, as described in:

"LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting" by Jason Qin, Denys Kim, and Yumei Tung (2016).

Key features of the current implementation:

  • Metro hash used instead of xxhash
  • Sparse representation for lower cardinalities (like HyperLogLog++)
  • LogLog-Beta for dynamic bias correction across all cardinalities
  • 8-bit registers for convenience and simplified implementation
  • Order-independent insertions and merging for consistent results regardless of data input order
  • Removal of tailcut method for a more straightforward approach
  • Flexible precision allowing for 2^4 to 2^18 registers

This implementation is now more straightforward, efficient, and flexible, while remaining backwards compatible with previous versions. It provides a balance between precision, memory usage, speed, and ease of use.

Precision and Memory Usage

This implementation allows for creating HyperLogLog sketches with arbitrary precision between 2^4 and 2^18 registers. The memory usage scales with the number of registers:

  • Minimum (2^4 registers): 16 bytes
  • Default (2^14 registers): 16 KB
  • Maximum (2^18 registers): 256 KB

Users can choose the precision that best fits their use case, balancing memory usage against estimation accuracy.

Note

A big thank you to Prof. Shigang Chen and his team at the University of Florida who are actively conducting research around "Big Network Data".

Contributing

Kindly check our contributing guide on how to propose bugfixes and improvements, and submitting pull requests to the project

License

© Axiom, Inc., 2024

Distributed under MIT License (The MIT License).

See LICENSE for more information.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrorTooShort = errors.New("too short binary")

ErrorTooShort is an error that UnmarshalBinary try to parse too short binary.

Functions

This section is empty.

Types

type Sketch

type Sketch struct {
	// contains filtered or unexported fields
}

func New

func New() *Sketch

New returns a HyperLogLog Sketch with 2^14 registers (precision 14)

func New14

func New14() *Sketch

New14 returns a HyperLogLog Sketch with 2^14 registers (precision 14)

func New16

func New16() *Sketch

New16 returns a HyperLogLog Sketch with 2^16 registers (precision 16)

func New16NoSparse

func New16NoSparse() *Sketch

New16NoSparse returns a HyperLogLog Sketch with 2^16 registers (precision 16) that will not use a sparse representation

func NewNoSparse

func NewNoSparse() *Sketch

NewNoSparse returns a HyperLogLog Sketch with 2^14 registers (precision 14) that will not use a sparse representation

func NewSketch added in v0.2.0

func NewSketch(precision uint8, sparse bool) (*Sketch, error)

func (*Sketch) AppendBinary added in v0.2.3

func (sk *Sketch) AppendBinary(data []byte) ([]byte, error)

AppendBinary implements the encoding.BinaryAppender interface.

func (*Sketch) Clone

func (sk *Sketch) Clone() *Sketch

Clone returns a deep copy of sk.

func (*Sketch) Estimate

func (sk *Sketch) Estimate() uint64

func (*Sketch) Insert

func (sk *Sketch) Insert(e []byte)

func (*Sketch) InsertHash

func (sk *Sketch) InsertHash(x uint64)

func (*Sketch) MarshalBinary

func (sk *Sketch) MarshalBinary() (data []byte, err error)

MarshalBinary implements the encoding.BinaryMarshaler interface.

When the result will be appended to another buffer, consider using AppendBinary to avoid additional allocations and copying.

func (*Sketch) Merge

func (sk *Sketch) Merge(other *Sketch) error

func (*Sketch) UnmarshalBinary

func (sk *Sketch) UnmarshalBinary(data []byte) error

UnmarshalBinary implements the encoding.BinaryUnmarshaler interface.

Directories

Path Synopsis
demo module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL