I recently learned about HyperLogLog, which feels like cursed counter-intuitive magic, so I am eager to share.
We want to count unique items, like “how many unique words appear across all books at your local library?” or “how many unique Facebook users logged in over the past month?” For a small set of unique tokens, like counting the unique words in this blog post, you might store each word in a set or hash table as you read them, then count the length of your set when you’re done. This is simple, but means the amount of memory used will scale linearly with the number of unique tokens, making such an approach impractical when counting enormous sets of tokens. But what if I told you we could accurately estimate the number of unique words while storing only a single integer?
To start with, we want to hash each of our words. A hash function takes arbitrary data and translates it to a ‘random’ but consistent number. For example, we’ll use a hash function that takes any word and turns it into a number from zero to
2**64, with a uniform probability across all possible numbers. A good hash function will be unpredictable, so changing a single letter in the word or swapping the order of letters will yield a completely different number.
Next, we take the resulting hash, treat it as binary, and count how many leading bits are zero. An example is shown below:
We repeat this process for every word, tracking only the highest number of leading zero-bits we’ve observed, which we’ll call
n. When we reach the end of our data, we return
2**n as our estimate of how many unique words we’ve seen.
So how in the world does this work? The key is that a good hash function returns hashes uniformly across its range, so we have turned each unique word into random numbers. Since hashing functions are deterministic, duplicate words will return the same hash.
A uniformly random number of fixed bit-length (for example, a random 64-bit integer) will start with a zero-bit with a probability of
1/2, and will start with a 1-bit with a probability of
1/2. It will start with two zero-bits with a probability of
1/4, three zero-bits with a probability of
1/8, and so on. A probability tree for this might look like:
We can run this explanation in reverse: if you have observed a hash that starts with three zero-bits, then on average you will have observed about 8 unique hashes, because around 1 in 8 hashes start with three zero-bits.
This sounds great, but there are two problems. First, the words “on average” are pretty important here: if you only examine one word, and it happens to have a hash starting with four leading zeros, then our probabilistic counting algorithm will guess that you’ve examined sixteen words, rather than one. Over 6% of hashes will start with four leading zeros, so this is easily possible. We need some way to overcome these ‘outliers’ and get a more statistically representative count of leading zeros.
Second, our probabilistic counting function can only return integer powers of two as estimates. It can guess that you’ve observed 8, 256, or 1024 words, but it can never estimate that you’ve observed 800 words. We want an estimator with a higher precision.
One strategy for addressing both limitations of probabilistic counting is to use multiple hashes. If we hash each observed word using ten different hash functions (or one hash function with ten different salts, but that’s a technical tangent), then we can maintain ten different counts of the highest number of leading zeros observed. Then at the end, we return the average of the ten estimates.
The more hash functions we use, the less sensitive our algorithm will be to outliers. Additionally, averaging over multiple counts lets us produce non-integer estimates. For example, if half our hash functions yield a maximum of four leading zeros, and half yield a maximum of five leading zeros, then we could estimate
2**4.5 unique tokens, or around 23.
This approach solves both our problems, but at a severe cost: now we need to calculate ten times as many hashes! If we’re counting upwards of billions of words, then this approach requires calculating nine billion additional hashes. Clearly, this won’t scale well.
Fortunately, there is an alternative solution that requires no additional hashing, known as HyperLogLog. Instead of using multiple hash functions and averaging across the results, we can instead pre-divide our words into buckets, and average across those.
For example, we could make 16 buckets, assign incoming hashes to each bucket uniformly, and maintain a “most leading zero-bits observed” counter for each bucket. Then we calculate an estimated number of unique elements from each bucket, and average across all buckets to get a global estimate.
For an easy approach to assigning hashes to each bucket, we can use the first four bits of each hash as a bucket ID, then count the number of leading zeros after this ID.
Once again, averaging across several sets of “most leading zeros” will minimize the impact of outliers, and afford us greater precision, by allowing non-integer exponents for our powers of two. Unlike the multiple hash solution, however, this approach will scale nicely.
One downside to HyperLogLog is that the bucket-averaging process is a little complicated. Dividing hashes across multiple buckets diminishes the impact of outliers, as desired, but it also diminishes the impact of all our hashes. For example, say we have 64 hashes, spread across 16 buckets, so 4 hashes per bucket. With 64 hashes, we can expect, on average, one hash with six leading zeros. However, each bucket has only four hashes, and therefore an expected maximum of two leading zeros. So while one bucket probably has six, most have closer to two, and taking the arithmetic mean of the buckets would severely underestimate the number of unique hashes we’ve observed. Therefore, HyperLogLog has a more convoluted estimation algorithm, consisting of creating estimates from each bucket, taking their harmonic mean, multiplying by the number of buckets squared, and multiplying by a magic number derived from the number of buckets1. This results in dampening outliers while boosting the estimate back into the appropriate range.
Here’s a plot comparing the accuracy of Probabilistic counting (count leading zeros, no compensation for outliers), Probabilistic-Med counting (run Probabilistic using ten hash functions, return median of results), and HyperLogLog (our fancy bucket solution):
I’ve generated random strings as input, and evaluate at 50 points on the x-axis, with 100 draws of random strings per x-axis point to create a distribution and error bars. The y-axis represents each estimation function’s guess as to the number of unique elements, with a 95% confidence interval.
Unsprisingly, plain probabilistic counting does not fare well. When we generate thousands of strings, the likelihood that at least one will have many leading zeros is enormous, and since our algorithm relies on counting the maximum observed leading zeros, it’s extremely outlier sensitive.
Taking the mean across ten hash algorithms is also outlier-sensitive when the outliers are large enough, which is why I’ve opted for the median in this plot. Probabilistic-Med performs much better, but it suffers the same problems over a larger time-scale: as we read more and more unique tokens, the likelihood goes up that all ten hash functions will see at least one hash with many leading zeros. Therefore, as the number of unique tokens increases, Probabilistic-Med steadily begins to over-estimate the number of unique tokens, with increasing error bars.
HyperLogLog reigns supreme. While error increases with the number of unique hashes, it remains more accurate, with tighter error bars, than the multi-hash strategy, while remaining computationally cheap. We can increase HyperLogLog’s error tolerance and accuracy in high-unique-token scenarios by increasing the number of buckets, although this lowers accuracy when the number of unique tokens is small.
This is so darn cool! Tracking the total number of unique elements without keeping a list of those elements seems impossible - and it is if you need absolute precision - but with some clever statistics we can get a shockingly close estimate.
If you’d like to see a working example, here’s the code I wrote for generating the accuracy plot, which includes implementations of Probabilistic counting, Probabilistic-Med, and HyperLogLog. This is toy code in Python that converts all the hashes to strings of one and zero characters for easy manipulation, so it is not efficient and shouldn’t be treated as anything like an ideal reference.
The derivation of this number is quite complex, so in practice it’s drawn from a lookup table or estimated ↩