Posted 3/20/2023
I recently learned about HyperLogLog, which feels like cursed counter-intuitive magic, so I am eager to share.
We want to count unique items, like “how many unique words appear across all books at your local library?” or “how many unique Facebook users logged in over the past month?” For a small set of unique tokens, like counting the unique words in this blog post, you might store each word in a set or hash table as you read them, then count the length of your set when you’re done. This is simple, but means the amount of memory used will scale linearly with the number of unique tokens, making such an approach impractical when counting enormous sets of tokens. But what if I told you we could accurately estimate the number of unique words while storing only a single integer?
To start with, we want to hash each of our words. A hash function takes arbitrary data and translates it to a ‘random’ but consistent number. For example, we’ll use a hash function that takes any word and turns it into a number from zero to 2**64
, with a uniform probability across all possible numbers. A good hash function will be unpredictable, so changing a single letter in the word or swapping the order of letters will yield a completely different number.
Next, we take the resulting hash, treat it as binary, and count how many leading bits are zero. An example is shown below:
We repeat this process for every word, tracking only the highest number of leading zero-bits we’ve observed, which we’ll call n
. When we reach the end of our data, we return 2**n
as our estimate of how many unique words we’ve seen.
So how in the world does this work? The key is that a good hash function returns hashes uniformly across its range, so we have turned each unique word into random numbers. Since hashing functions are deterministic, duplicate words will return the same hash.
A uniformly random number of fixed bit-length (for example, a random 64-bit integer) will start with a zero-bit with a probability of 1/2
, and will start with a 1-bit with a probability of 1/2
. It will start with two zero-bits with a probability of 1/4
, three zero-bits with a probability of 1/8
, and so on. A probability tree for this might look like:
We can run this explanation in reverse: if you have observed a hash that starts with three zero-bits, then on average you will have observed about 8 unique hashes, because around 1 in 8 hashes start with three zero-bits.
This sounds great, but there are two problems. First, the words “on average” are pretty important here: if you only examine one word, and it happens to have a hash starting with four leading zeros, then our probabilistic counting algorithm will guess that you’ve examined sixteen words, rather than one. Over 6% of hashes will start with four leading zeros, so this is easily possible. We need some way to overcome these ‘outliers’ and get a more statistically representative count of leading zeros.
Second, our probabilistic counting function can only return integer powers of two as estimates. It can guess that you’ve observed 8, 256, or 1024 words, but it can never estimate that you’ve observed 800 words. We want an estimator with a higher precision.
One strategy for addressing both limitations of probabilistic counting is to use multiple hashes. If we hash each observed word using ten different hash functions (or one hash function with ten different salts, but that’s a technical tangent), then we can maintain ten different counts of the highest number of leading zeros observed. Then at the end, we return the average of the ten estimates.
The more hash functions we use, the less sensitive our algorithm will be to outliers. Additionally, averaging over multiple counts lets us produce non-integer estimates. For example, if half our hash functions yield a maximum of four leading zeros, and half yield a maximum of five leading zeros, then we could estimate 2**4.5
unique tokens, or around 23.
This approach solves both our problems, but at a severe cost: now we need to calculate ten times as many hashes! If we’re counting upwards of billions of words, then this approach requires calculating nine billion additional hashes. Clearly, this won’t scale well.
Fortunately, there is an alternative solution that requires no additional hashing, known as HyperLogLog. Instead of using multiple hash functions and averaging across the results, we can instead pre-divide our words into buckets, and average across those.
For example, we could make 16 buckets, assign incoming hashes to each bucket uniformly, and maintain a “most leading zero-bits observed” counter for each bucket. Then we calculate an estimated number of unique elements from each bucket, and average across all buckets to get a global estimate.
For an easy approach to assigning hashes to each bucket, we can use the first four bits of each hash as a bucket ID, then count the number of leading zeros after this ID.
Once again, averaging across several sets of “most leading zeros” will minimize the impact of outliers, and afford us greater precision, by allowing non-integer exponents for our powers of two. Unlike the multiple hash solution, however, this approach will scale nicely.
One downside to HyperLogLog is that the bucket-averaging process is a little complicated. Dividing hashes across multiple buckets diminishes the impact of outliers, as desired, but it also diminishes the impact of all our hashes. For example, say we have 64 hashes, spread across 16 buckets, so 4 hashes per bucket. With 64 hashes, we can expect, on average, one hash with six leading zeros. However, each bucket has only four hashes, and therefore an expected maximum of two leading zeros. So while one bucket probably has six, most have closer to two, and taking the arithmetic mean of the buckets would severely underestimate the number of unique hashes we’ve observed. Therefore, HyperLogLog has a more convoluted estimation algorithm, consisting of creating estimates from each bucket, taking their harmonic mean, multiplying by the number of buckets squared, and multiplying by a magic number derived from the number of buckets^{1}. This results in dampening outliers while boosting the estimate back into the appropriate range.
Here’s a plot comparing the accuracy of Probabilistic counting (count leading zeros, no compensation for outliers), Probabilistic-Med counting (run Probabilistic using ten hash functions, return median of results), and HyperLogLog (our fancy bucket solution):
I’ve generated random strings as input, and evaluate at 50 points on the x-axis, with 100 draws of random strings per x-axis point to create a distribution and error bars. The y-axis represents each estimation function’s guess as to the number of unique elements, with a 95% confidence interval.
Unsprisingly, plain probabilistic counting does not fare well. When we generate thousands of strings, the likelihood that at least one will have many leading zeros is enormous, and since our algorithm relies on counting the maximum observed leading zeros, it’s extremely outlier sensitive.
Taking the mean across ten hash algorithms is also outlier-sensitive when the outliers are large enough, which is why I’ve opted for the median in this plot. Probabilistic-Med performs much better, but it suffers the same problems over a larger time-scale: as we read more and more unique tokens, the likelihood goes up that all ten hash functions will see at least one hash with many leading zeros. Therefore, as the number of unique tokens increases, Probabilistic-Med steadily begins to over-estimate the number of unique tokens, with increasing error bars.
HyperLogLog reigns supreme. While error increases with the number of unique hashes, it remains more accurate, with tighter error bars, than the multi-hash strategy, while remaining computationally cheap. We can increase HyperLogLog’s error tolerance and accuracy in high-unique-token scenarios by increasing the number of buckets, although this lowers accuracy when the number of unique tokens is small.
This is so darn cool! Tracking the total number of unique elements without keeping a list of those elements seems impossible - and it is if you need absolute precision - but with some clever statistics we can get a shockingly close estimate.
If you’d like to see a working example, here’s the code I wrote for generating the accuracy plot, which includes implementations of Probabilistic counting, Probabilistic-Med, and HyperLogLog. This is toy code in Python that converts all the hashes to strings of one and zero characters for easy manipulation, so it is not efficient and shouldn’t be treated as anything like an ideal reference.
If you enjoyed this post, you may enjoy my other writing on dimensional analysis, network science for social modeling, or algorithmic complexity.
The derivation of this number is quite complex, so in practice it’s drawn from a lookup table or estimated ↩
Posted 3/6/2023
This is a post about Big-O notation and measuring algorithmic complexity; topics usually taught to computer science undergraduates in their second to fourth semester. It’s intended for curious people outside the field, or new students. There are many posts on this subject, but this one is mine.
In computer science we often care about whether an algorithm is an efficient solution to a problem, or whether one algorithm is more efficient than another approach. One might be tempted to measure efficiency in terms of microseconds it takes a process to run, or perhaps number of assembly instructions needed. However, these metrics will vary widely depending on what language an algorithm is implemented in, what hardware it’s run on, what other software is running on the system competing for resources, and a host of other factors. We’d prefer to think more abstractly, and compare one strategy to another rather than their implementations. In particular, computer scientists often examine how an algorithm scales, or how quickly it slows down as inputs grow very large.
Let’s start with a trivial example: given a list of numbers, return their sum. Looks something like:
def sum(list): total = 0 for item in list total += item end return total end
Since we need to read the entire list, this algorithm scales linearly with the length of the list - make the list a hundred times longer, and it will take roughly a hundred times longer to get a sum. We write this formally as O(n)
, meaning “scales linearly with n
, the size of the input.” We call this formal syntax “Big O notation,” where the ‘O’ stands for “order of approximation” (or in the original German, “Ordnung”).
Not all algorithms scale. If we were asked “return the third element in the list” then it wouldn’t matter whether the list is three elements long or three million elements long, we can get to the third element in a constant amount of time. This is written as O(1)
, indicating no reliance on the input size.
Search algorithms give us our first example problem with divergent solutions. Given a stack of papers with names on them, tell me whether “Rohan” is in the stack. A trivial solution might look like:
def hasName(list) for name in list if name == "Rohan" return true end end return false end
This scales linearly with the length of the list, just like summing the elements. If the list is in an unknown order then we have no choice but to examine every element. However, if we know the list is in alphabetical order then we can do better. Start in the middle of the list - if the name is Rohan, we’re done. If we’re after Rohan alphabetically, then discard the second half of the list, and repeat on the first half. If we’re before Rohan alphabetically, then discard the first half of the list and repeat on the second. If we exhaust the list, then Rohan’s not in it. This approach is called a binary search, and visually looks like:
In code, a binary search looks something like:
def hasName(list) if( list.length == 0 ) return false end middle = list.length / 2 if( list[middle] == "Rohan" ) return true elsif( list[middle] > "Rohan" ) # Search left half return hasName(list.first(middle)) else # Search right half return hasName(list[middle .. list.length] end end
With every step in the algorithm we discard half the list, so we look at far fewer than all the elements. Our binary search still gets slower as the input list grows longer - if we double the length of the list we need one extra search step - so the algorithm scales logarithmically rather than linearly, denoted O(log n)
.
We’ll end this section by looking at two sorting algorithms: insertion sort, and merge sort.
We want to sort a list, provided to us in random order. One simple approach is to build a new sorted list: one at a time, we take elements from the front of the main list, and find their correct position among the sorted list we’ve built so far. To find the correct position we just look at the value left of our new element, and check whether they should be swapped or not. Keep swapping left until the new element finds its correct position. This visually looks like:
One implementation might look like:
def insertionSort(list) for i in 0.upto(list.length-1) for j in (i-1).downto(0) if( list[j] > list[j+1] ) list[j], list[j+1] = list[j+1], list[j] else break # Done swapping, found the right spot! end end end return list end
Insertion sort is simple and easy to implement. If you were coming up with a sorting algorithm on the spot for something like sorting a deck of cards, you might invent something similar. So what’s the runtime?
In insertion sort, we walk the list from start to end, which is O(n)
. For every new element we examine, however, we walk the list backwards from our current position to the start. This operation also scales linearly with the length of the list, and so is also O(n)
. If we perform a backwards O(n)
walk for every step of the forwards O(n)
walk, that’s O(n) * O(n)
for a total of O(n^2)
. Can we do better?
An alternative approach to sorting is to think of it as a divide-and-conquer problem. Split the list in half, and hand the first half to one underling and the second half to another underling, and instruct them each to sort their lists. Each underling does the same, splitting their lists in half and handing them to two further underlings. Eventually, an underling receives a list of length one, which is by definition already sorted. This splitting stage looks something like:
Now we want to merge our results upwards. Each underling hands their sorted list back up to their superiors, who now have two sorted sub-lists. The superior combines the two sorted lists by first making a new empty “merged” list that’s twice as long. For every position in the merged list, the superior compares the top element of each sorted sub-list, and moves the lower element to the merged list. This process looks like:
Once all elements from the two sub-lists have been combined into a merged list, the superior hands their newly sorted list upwards to their superior. We continue this process until we reach the top of the tree, at which point our work is done. This merge step looks like:
In code, the full algorithm might look something like:
# Combine two sorted lists def merge(left, right) merged = [] while( left.length + right.length > 0 ) if( left.length == 0 ) # Left empty, take from right merged += right.shift(1) elsif( right.length == 0 ) # Right empty, take from left merged += left.shift(1) elsif( left[0] < right[0] ) # Top of left stack is less, take it merged += left.shift(1) else # Top of right stack is less, take it merged += right.shift(1) end end return merged end # Takes a single list, sub-divides it, sorts results def mergeSort(list) if( list.length <= 1 ) return list # Sorted already :) end middle = list.length / 2 left = list[0 .. middle-1] right = list[middle .. list.length-1] leftSorted = mergeSort(left) rightSorted = mergeSort(right) return merge(leftSorted, rightSorted) end
So what’s the runtime of merge sort? Well it takes log n
steps to divide the list in half down to one element. We do this division process for every element in the list. That gives us a runtime of n * log n
to break the list apart and create the full tree diagram.
Merging two sorted lists together scales linearly with the size of the lists, so the merge step is O(n)
. We need to perform a merge each time we move up a “level” of the tree, and there are log n
levels to this tree. Therefore, the full merge process also scales with O(n log n)
.
This gives us a total runtime of O(n log n + n log n)
or O(2n log n)
to create the tree and merge it back together. However, because we are concerned with how algorithms scale as the inputs become very large, we drop constants and all expressions but the dominant term - multiplying by 2 doesn’t mean much as n
approaches infinity - and simplify the run time to O(n log n)
. That’s a lot better than insertion sort’s O(n^2)
!
Big O notation typically describes an “average” or “expected” performance and not a “best case” or “worst-case”. For example, if a list is in a thoroughly random order, then insertion sort will have a performance of O(n^2)
. However, if the list is already sorted, or only one or two elements are out of place, then insertion sort’s best-case performance is O(n)
. That is, insertion sort will walk the list forwards, and if no elements are out of place, there will be no need to walk the list backwards to find a new position for any elements. By contrast, merge sort will always split the list into a tree and merge the branches back together, so even when handed a completely sorted list, merge sort’s best-case performance is still O(n log n)
.
Big O notation also does not describe memory complexity. The description of merge sort above creates a temporary merged
list during the merge step, meaning however long the input list is, merge sort needs at least twice as much memory space for its overhead. By contrast, insertion sort works “in place,” sorting the input list without creating a second list as a workspace. Many algorithms make a trade-off between time and space in this way.
Finally, Big O notation describes how an algorithm scales as n
gets very large. For small values of n
, insertion sort may outperform merge sort, because merge sort has some extra bookkeeping to allocate temporary space for merging and coordinate which minions are sorting which parts of the list.
In summary, Big O notation is a valuable tool for quickly comparing two algorithms, and can provide programmers with easy estimates as to which parts of a problem will be the most time-consuming. However, Big O notation is not the only metric that matters, and should not be treated as such.
All of the algorithms described above can be run in polynomial time. This means their scaling rate, or Big O value, can be upper-bounded by a polynomial of the form O(n^k)
. For example, while merge sort scales with O(n log n)
, and logarithms are not polynomials, n log n
is strictly less than n^2
, so merge sort is considered to run in polynomial time. By contrast, algorithms with runtimes like O(2^n)
or O(n!)
are not bounded by a polynomial, and perform abysmally slowly as n
grows large.
These definitions allow us to describe categories of algorithms. We describe algorithms that run in polynomial time as part of set P, and we typically describe P as a subset of NP - the algorithms where we can verify whether a solution is correct in polynomial time.
To illustrate the difference between running and verifying an algorithm, consider the graph coloring problem: given a particular map, and a set of three or more colors, can you color all the countries so that no two bordering countries have the same color? The known algorithms for this problem are tedious. Brute forcing all possible colorings scales with O(k^n)
for k-colors and n-countries, and the fastest known general algorithms run in O(n * 2^n)
. However, given a colored-in map, it’s easy to look at each country and its neighbors and verify that none violate the coloring rules. At worst, verifying takes O(n^2)
time if all countries border most others, but more realistically O(n)
if we assume that each country only borders a small number of neighbors rather a significant fraction of all countries.
Next, we have NP-Hard: these are the set of problems at least as hard as the most computationally intensive NP problems, but maybe harder - some NP-Hard problems cannot even have their solutions verified in polynomial time. When we describe a problem as NP-Hard we are often referring to this last property, even though the most challenging NP problems are also NP-Hard.
One example of an NP-Hard problem without polynomial verification is the Traveling Salesman: given a list of cities and distances between cities, find the shortest path that travels through every city exactly once, ending with a return to the original city. Trying all paths through cities scales with O(n!)
. More clever dynamic programming solutions improve this to O(n^2 2^n)
. But if someone claims to have run a traveling salesman algorithm, and hands you a path, how do you know it’s the shortest possible path? The only way to be certain is to solve the traveling salesman problem yourself, and determine whether your solution has the same length as the provided answer.
Finally, we have NP-Complete. These are the most challenging problems in NP, meaning:
Solutions to these algorithms can be verified in polynomial time
There is no known polynomial-time solution to these algorithms
Any problem in NP can be translated into an input to an NP-Complete problem in polynomial time, and the result of the NP-Complete algorithm can be translated back, again in polynomial time
Here’s a visualization of these problem classes:
Broad consensus in computer science is that the NP problem space is larger than the P problem space. That is, there are some problems that cannot be solved in polynomial time, but can be verified in polynomial time. However, no one has been able to definitively prove this, in large part because making formal arguments about such abstract questions is exceedingly difficult. There are many problems we do not know how to solve in polynomial time, but how do we prove there isn’t a faster, more clever solution that we haven’t thought of?
Therefore, a minority of computer scientists hold that P = NP, or in other words, all problems that can be verified in polynomial time can also be solved in polynomial time. This would make our set of problem classes look more like:
To prove that P equals NP, all someone would need to do is find a polynomial-time solution to any NP-Complete problem. Since we know all NP problems can be translated back and forth to NP-Complete problems in polynomial time, a fast solution to any of these most challenging problems would be a fast solution to every poly-verifiable algorithm. No such solution has been found.
Posted 2/10/2023
I recently needed to make a graph with a hex lattice shape, like this:
I needed to find distances and paths between different hexagonal tiles, which proved challenging in a cartesian coordinate system. I tried a few solutions, and it was a fun process, so let’s examine each option.
The most “obvious” way to index hexagonal tiles is to label each according to their row and column, like:
This feels familiar if we’re used to a rectangular grid and cartesian coordinate system. It also allows us to use integer coordinates. However, it has a few severe disadvantages:
Moving in the y-axis implies moving in the x-axis. For example, moving from (0,0) to (0,1) sounds like we’re only moving vertically, but additionally shifts us to the right!
Coordinates are not mirrored. Northwest of (0,0) is (-1,1), so we might expect that Southeast of (0,0) would be flipped across the vertical and horizontal, yielding (1,-1). But this is not the case! Southeast of (0,0) is (0,-1) instead, because by dropping two rows we’ve implicitly moved twice to the right already (see point one)
These issues make navigation challenging, because the offsets of neighboring tiles depend on the row. Southeast of (0,0) is (0,-1), but Southeast of (0,1) is (1,0), so the same relative direction sometimes requires changing the column, and sometimes does not.
Rather than using row and column coordinates we could re-index each tile by its “true” cartesian coordinates:
This makes the unintuitive aspects of offset coordinates intuitive:
It is now obvious that moving from (0,0) to (0.5,1) implies both a vertical and horizontal change
Coordinates now mirror nicely: Northwest of (0,0) is (-0.5,1), and Southeast of (0,0) is (0.5,-1).
Following from point 1, it’s now clear why the distance between (0,0) and (3,0) isn’t equal to the distance between (0,0) and (0.5,3).
But while cartesian coordinates are more “intuitive” than offset coordinates, they have a range of downsides:
We no longer have integer coordinates. We could compensate by doubling all the coordinates, but then (0,0) is adjacent to (2,0), and keeping a distance of one between adjacent tiles would be ideal.
While euclidean-distances are easy to calculate in cartesian space, it’s still difficult to calculate tile-distances using these indices. For example, if we want to find all tiles within two “steps” of (0,0) we need to use a maximum range of about 2.237, or the distance to (1,2).
Fortunately there is a third indexing scheme, with integer coordinates, coordinate mirroring, and easy distance calculations in terms of steps! It just requires thinking in three dimensions!
In a cartesian coordinate system we use two axes, since we can move up/down, and left/right. However, on a hexagonal grid, we have three degrees of freedom: we can move West/East, Northwest/Southeast, and Northeast/Southwest. We can define the coordinate of each tile in terms of the distance along each of these three directions, like so:
These “cube coordinates” have one special constraint: the sum of the coordinates is always zero. This allows us to maintain a canonical coordinate for each tile.
To understand why this is necessary, imagine a system where the three coordinates (typically referred to as (q,r,s) to distinguish between systems when we are converting to or from an (x,y) system) correspond directly with the three axes: q refers to distance West/East, r to Northwest/Southeast, and s to Northeast/Southwest. Here’s a visualization of such a scheme:
We could take several paths, such as (0,1,1) or (1,2,0) or (-1,0,2), and all get to the same tile! That would be a mess for comparing coordinates, and would make distance calculations almost impossible. With the addition of this “sum to zero” constraint, all paths to the tile yield the same coordinate of (-1,2,-1).
Distances in cube coordinates are also easy to calculate - just half the “Manhattan distance” between the two points:
def distance(q1, r1, s1, q2, r2, s2): return (abs(q1-q2) + abs(r1-r2) + abs(s1-s2)) / 2
We can add coordinates, multiply coordinates, calculate distances, and everything is simple so long as we remain in cube coordinates.
However, we will unavoidably sometimes need to convert from cube to cartesian coordinates. For example, while I built the above hex grids using cube coordinates, I plotted them in matplotlib, which wants cartesian coordinates to place each hex. Converting to cartesian coordinates will also allow us to find the distance between hex tiles “as the crow flies,” rather than in path-length, which may be desirable. So how do we convert back to xy coordinates?
First, we can disregard the s
coordinate. Since all coordinates sum to zero, s = -1 * (q + r)
, so it represents redundant information, and we can describe the positions of each tile solely using the first two coordinates.
We can also tell through the example above that changing the q
coordinate contributes only to changing the x-axis, while changing the r
coordinate shifts both the x- and y-axes. Let’s set aside the q
coordinate for the moment and focus on how much r
contributes to each cartesian dimension.
Let’s visualize the arrow from (0,0,0) to (0,1,-1) as the hypotenuse of a triangle:
We want to break down the vector of length r=1
into x
and y
components. You may recognize this as a 30-60-90 triangle, or you could use some geometric identities: the internal angles of a hexagon are 120-degrees, and this triangle will bisect one, so theta must be 60-degrees. Regardless of how you get there, we land at our triangle identities:
From here we can easily solve for the x
and y
components of r
, using 2a = r
:
We know that (0,1,-1) is halfway between (0,0) and (1,0,-1) on the x-axis, so q
must contribute twice as much to the x-axis as r
does. Therefore, we can solve for the full cartesian coordinates of a hex using the cube coordinates as follows:
This works great! But it leaves the hexagons with a radius of sqrt(3) / 3
, which may be inconvenient for some applications. For example, if you were physically manufacturing these hexagons, like making tiles for a board-game, they’d be much easier to cut to size if they had a radius of one. Therefore, you will often see the conversion math from cube to cartesian coordinates written with a constant multiple of sqrt(3)
, like:
Since this is a constant multiple, it just re-scales the graph, so all the distance measurements and convenient properties of the system remain the same, but hexagons now have an integer radius.
If you are also excited by these coordinate systems, and want to read more about the logic behind cube coordinates, path-finding, line-drawing, wrapping around the borders of a map, and so on, then I highly recommend the Red Blob Games Hexagon article, which goes into much more detail.
Posted 1/17/2023
In my last post I demonstrated how to perform image dithering to convert colored images to black and white. This consists of converting each pixel to either black or white (whichever is closer), recording the amount of “error,” or the difference between the original luminoscity and the new black/white value, and propagating this error to adjoining pixels to brighten or darken them in compensation. This introduces local error (some pixels will be converted to white when their original value is closer to black, and vice versa), but globally lowers error, producing an image that appears much closer to the original.
I’m still playing with dithering, so in this post I will extend the idea to color images. Reducing the number of colors in an image used to be a common task: while digital cameras may be able to record photos with millions of unique colors, computers throughout the 90s often ran in “256 color” mode, where they could only display a small range of colors at once. This reduces the memory footprint of images significantly, since you only need 8-bits per pixel rather than 24 to represent their color. Some image compression algorithms still use palette compression today, announcing a palette of colors for a region of the image, then listing an 8- or 16-bit palette index for each pixel in the region rather than a full 24-bit color value.
Reducing a full color image to a limited palette presents a similar challenge to black-and-white image dithering: how do we choose what palette color to use for each pixel, and how do we avoid harsh color banding?
We’ll start with a photo of a hiking trail featuring a range of greens, browns, and whites:
Let’s reduce this to a harsh palette of 32 colors. First, we need to generate such a palette:
#!/usr/bin/env python3 import numpy as np def getPalette(palette_size=32): colors = [] values = np.linspace(0, 0xFFFFFF, palette_size, dtype=int) for val in values: r = val >> 16 g = (val & 0x00FF00) >> 8 b = val & 0x0000FF colors.append((r,g,b)) return colors
I don’t know much color theory, so this is far from an “ideal” spread of colors. However, it is 32 equally spaced values on the numeric range 0x000000 to 0xFFFFFF, which we can convert to RGB values. We can think of color as a three dimensional space, where the X, Y, and Z axes represent red, green, and blue. This lets us visualize our color palette as follows:
import matplotlib.pyplot as plt def plotPalette(palette): fig = plt.figure(figsize=(6,6)) ax = fig.add_subplot(111, projection='3d') r = [] g = [] b = [] c = [] for color in palette: r.append(color[0]) g.append(color[1]) b.append(color[2]) c.append("#%02x%02x%02x" % color) g = ax.scatter(r, g, b, c=c, marker='o', depthshade=False) ax.invert_xaxis() ax.set_xlabel('Red') ax.set_ylabel('Green') ax.set_zlabel('Blue') plt.show()
Which looks something like:
Just as in black-and-white image conversion, we can take each pixel and round it to the closest available color - but instead of two colors in our palette, we now have 32. Here’s a simple (and highly inefficient) conversion:
# Returns the closest rgb value on the palette, as (red,green,blue) def getClosest(color, palette): (r,g,b) = color closest = None #(color, distance) for p in palette: # A real distance should be sqrt(x^2 + y^2 + z^2), but # we only care about relative distance, so faster to leave it off distance = (r-p[0])**2 + (g-p[1])**2 + (b-p[2])**2 if( closest == None or distance < closest[1] ): closest = (p,distance) return closest[0] def reduceNoDither(img, palette, filename): pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): pixels[y,x] = getClosest(pixels[y,x], palette) reduced = Image.fromarray(pixels) reduced.save(filename) img = Image.open("bridge.png") palette = getPalette() reduceNoDither(img, palette, "bridge_32.png")
The results are predictably messy:
Our palette only contains four colors close to brown, and most are far too red. If we convert each pixel to the closest color on the palette, we massively over-emphasize red, drowning out our greens and yellows.
Dithering to the rescue! Where before we had an integer error for each pixel (representing how much we’d over or under-brightened the pixel when we rounded it to black/white), we now have an error vector, representing how much we’ve over or under emphasized red, green, and blue in our rounding.
As before, we can apply Atkinson dithering, with the twist of applying a vector error to three dimensional color points:
# Returns an error vector (delta red, delta green, delta blue) def getError(oldcolor, newcolor): dr = oldcolor[0] - newcolor[0] dg = oldcolor[1] - newcolor[1] db = oldcolor[2] - newcolor[2] return (dr, dg, db) def applyError(pixels, y, x, error, factor): if( y >= pixels.shape[0] or x >= pixels.shape[1] ): return # Don't run off edge of image er = error[0] * factor eg = error[1] * factor eb = error[2] * factor pixels[y,x,RED] += er pixels[y,x,GREEN] += eg pixels[y,x,BLUE] += eb def ditherAtkinson(img, palette, filename): pixels = np.array(img) total_pixels = pixels.shape[0] * pixels.shape[1] for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] # Returns reference new = getClosest(old, palette) quant_error = getError(old, new) pixels[y,x] = new applyError(pixels, y, x+1, quant_error, 1/8) applyError(pixels, y, x+2, quant_error, 1/8) applyError(pixels, y+1, x+1, quant_error, 1/8) applyError(pixels, y+1, x, quant_error, 1/8) applyError(pixels, y+1, x-1, quant_error, 1/8) applyError(pixels, y+2, x, quant_error, 1/8) dithered = Image.fromarray(pixels) dithered.save(filename)
Aaaaaand presto!
It’s far from perfect, but our dithered black and white images were facsimiles of their greyscale counterparts, too. Pretty good for only 32 colors! The image no longer appears too red, and the green pine needles stand out better. Interestingly, the dithered image now appears flecked with blue, with a blue glow in the shadows. This is especially striking on my old Linux laptop, but is more subtle on a newer screen with a better color profile, so your mileage may vary.
We might expect the image to be slightly blue-tinged, both because reducing red values will make green and blue stand out, and because we are using an extremely limited color palette. However, the human eye is also better at picking up some colors than others, so perhaps these blue changes stand out disproportionately. We can try compensating, by reducing blue error to one third:
That’s an arbitrary and unscientific compensation factor, but it’s removed the blue tint from the shadows in the image, and reduced the number of blue “snow” effects, suggesting there’s some merit to per-channel tuning. Here’s a side-by-side comparison of the original, palette reduction, and each dithering approach:
Especially at a smaller resolution, we can do a pretty good approximation with a color selection no wider than a big box of crayons. Cool!
Posted 1/16/2023
Dithering means intentionally adding noise to a signal to reduce large artifacts like color banding. A classic example is reducing a color image to black and white. Take this magnificent photo of my neighbor’s cat:
To trivially convert this image to black and white we can take each pixel, decide which color it’s closest to, and set it to that:
#!/usr/bin/env python3 from PIL import Image import numpy as np # Load image as grayscale img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y, row in enumerate(pixels): for x,col in enumerate(row): if( pixels[y,x] >= 127 ): pixels[y,x] = 255 else: pixels[y,x] = 0 bw = Image.fromarray(pixels) bw.save("kacie_bw.png")
But the result is not very satisfying:
The cat is white. Every pixel will be closer to white than black, and we lose the whole cat except the eyes and nose, along with most of the background detail. But we can do better! What if we set the density of black pixels based on the brightness of a region? That is, black regions will receive all black pixels, white regions all white, but something that should be a mid-gray will get closer to a checkerboard of black and white pixels to approximate the correct brightness.
One particularly satisfying way to approach this regional checkerboarding is called error diffusion. For every pixel, when we set it to black or white, we record how far off the original color is from the new one. Then we adjust the color of the adjacent pixels based on this error. For example, if we set a gray pixel to black, then we record that we’ve made an error by making this pixel darker than it should be, and we’ll brighten the surrounding pixels we haven’t evaluated yet to make them more likely to be set to white. Similarly, if we round a gray pixel up to white, then we darken the nearby pixels to make them more likely to be rounded down to black.
In Floyd-Steinberg dithering we process pixels left to right, top to bottom, and propagate the error of each pixel to its neighbors with the following distribution:
That is, pass on 7/16 of the error to the pixel right of the one we’re examining. Pass on 5/16 of the error to the pixel below, and a little to the two diagonals we haven’t examined yet. We can implement Floyd-Steinberg dithering as follows:
def getClosest(color): if( color >= 127 ): return 255 # White return 0 # Black def setAdjacent(pixels, y, x, error): (rows,cols) = pixels.shape[0:2] if( y >= rows or x >= cols ): return # Don't run past edge of image pixels[y,x] += error # Load image as grayscale img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] new = getClosest(old) pixels[y,x] = new quant_error = old - new setAdjacent(pixels, y, x+1, quant_error*(7/16)) setAdjacent(pixels, y+1, x-1, quant_error*(3/16)) setAdjacent(pixels, y+1, x, quant_error*(5/16)) setAdjacent(pixels, y+1, x+1, quant_error*(1/16)) dithered = Image.fromarray(pixels) dithered.save("kacie_dithered_fs.png")
The results are a stunning improvement:
We’ve got the whole cat, ruffles on her fur, the asphalt and wood chips, details on rocks, gradients within shadows, the works! But what are those big black flecks across the cat’s fur? These flecks of “snow” impact the whole image, but they don’t stand out much on the background where we alternate between black and white pixels frequently. On the cat, even small errors setting near-white fur to white pixels build up, and we periodically set a clump of pixels to black.
We can try to reduce this snow by fiddling with the error propagation matrix. Rather than passing all of the error on to adjacent pixels, and mostly to the pixel to the right and below, what if we ‘discount’ the error, only passing on 75% of it? This is the diffusion matrix used in Atkinson dithering:
The code hardly needs a change:
img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] new = getClosest(old) pixels[y,x] = new quant_error = old - new setAdjacent(pixels, y, x+1, quant_error*(1/8)) setAdjacent(pixels, y, x+2, quant_error*(1/8)) setAdjacent(pixels, y+1, x+1, quant_error*(1/8)) setAdjacent(pixels, y+1, x, quant_error*(1/8)) setAdjacent(pixels, y+1, x-1, quant_error*(1/8)) setAdjacent(pixels, y+2, x, quant_error*(1/8)) dithered = Image.fromarray(pixels) dithered.save("kacie_dithered_at.png")
And the snow vanishes:
This is a lot more pleasing to the eye, but it’s important to note that the change isn’t free: if you look closely, we’ve lost some detail on the cat’s fur, particularly where the edges of her legs and tail have been ‘washed out.’ After all, we’re now ignoring some of the error caused by our black and white conversion, so we’re no longer compensating for all our mistakes in nearby pixels. This is most noticeable in bright and dark areas where the errors are small.
I really like this idea of adding noise and propagating errors to reduce overall error. It’s a little counter-intuitive; by artificially brightening or darkening a pixel, we’re making an objectively worse local choice when converting a pixel to black or white. Globally, however, this preserves much more of the original structure and detail. This type of error diffusion is most often used in digital signal processing of images, video, and audio, but I am curious whether it has good applications in more distant domains.
If you enjoyed this post and want to read more about mucking with images and color, you may enjoy reading my post on color filter array forensics.