hyperLogLogs
HyperLogLog is a probabilistic algorithm for estimating the cardinality of a set, i.e., the number of distinct elements, in large data streams or distributed computations. It achieves memory efficiency by hashing each element to a fixed-size bit string and updating a small, fixed set of registers rather than storing elements. The hash is divided into two parts: an index that selects a register and the value that records the position of the first 1 bit in the remaining bits. Each register stores the maximum observed position, and the final cardinality estimate is computed from these registers using a scaled harmonic mean.
The practical strength of HyperLogLog lies in its space-time efficiency and mergeability. With m registers (where
HyperLogLog++ is a widely used improvement that reduces bias, provides better small-cardinality handling through linear counting,