Posted 3/20/2023
I recently learned about HyperLogLog, which feels like cursed counter-intuitive magic, so I am eager to share.
We want to count unique items, like “how many unique words appear across all books at your local library?” or “how many unique Facebook users logged in over the past month?” For a small set of unique tokens, like counting the unique words in this blog post, you might store each word in a set or hash table as you read them, then count the length of your set when you’re done. This is simple, but means the amount of memory used will scale linearly with the number of unique tokens, making such an approach impractical when counting enormous sets of tokens. But what if I told you we could accurately estimate the number of unique words while storing only a single integer?
To start with, we want to hash each of our words. A hash function takes arbitrary data and translates it to a ‘random’ but consistent number. For example, we’ll use a hash function that takes any word and turns it into a number from zero to 2**64
, with a uniform probability across all possible numbers. A good hash function will be unpredictable, so changing a single letter in the word or swapping the order of letters will yield a completely different number.
Next, we take the resulting hash, treat it as binary, and count how many leading bits are zero. An example is shown below:
We repeat this process for every word, tracking only the highest number of leading zero-bits we’ve observed, which we’ll call n
. When we reach the end of our data, we return 2**n
as our estimate of how many unique words we’ve seen.
So how in the world does this work? The key is that a good hash function returns hashes uniformly across its range, so we have turned each unique word into random numbers. Since hashing functions are deterministic, duplicate words will return the same hash.
A uniformly random number of fixed bit-length (for example, a random 64-bit integer) will start with a zero-bit with a probability of 1/2
, and will start with a 1-bit with a probability of 1/2
. It will start with two zero-bits with a probability of 1/4
, three zero-bits with a probability of 1/8
, and so on. A probability tree for this might look like:
We can run this explanation in reverse: if you have observed a hash that starts with three zero-bits, then on average you will have observed about 8 unique hashes, because around 1 in 8 hashes start with three zero-bits.
This sounds great, but there are two problems. First, the words “on average” are pretty important here: if you only examine one word, and it happens to have a hash starting with four leading zeros, then our probabilistic counting algorithm will guess that you’ve examined sixteen words, rather than one. Over 6% of hashes will start with four leading zeros, so this is easily possible. We need some way to overcome these ‘outliers’ and get a more statistically representative count of leading zeros.
Second, our probabilistic counting function can only return integer powers of two as estimates. It can guess that you’ve observed 8, 256, or 1024 words, but it can never estimate that you’ve observed 800 words. We want an estimator with a higher precision.
One strategy for addressing both limitations of probabilistic counting is to use multiple hashes. If we hash each observed word using ten different hash functions (or one hash function with ten different salts, but that’s a technical tangent), then we can maintain ten different counts of the highest number of leading zeros observed. Then at the end, we return the average of the ten estimates.
The more hash functions we use, the less sensitive our algorithm will be to outliers. Additionally, averaging over multiple counts lets us produce non-integer estimates. For example, if half our hash functions yield a maximum of four leading zeros, and half yield a maximum of five leading zeros, then we could estimate 2**4.5
unique tokens, or around 23.
This approach solves both our problems, but at a severe cost: now we need to calculate ten times as many hashes! If we’re counting upwards of billions of words, then this approach requires calculating nine billion additional hashes. Clearly, this won’t scale well.
Fortunately, there is an alternative solution that requires no additional hashing, known as HyperLogLog. Instead of using multiple hash functions and averaging across the results, we can instead pre-divide our words into buckets, and average across those.
For example, we could make 16 buckets, assign incoming hashes to each bucket uniformly, and maintain a “most leading zero-bits observed” counter for each bucket. Then we calculate an estimated number of unique elements from each bucket, and average across all buckets to get a global estimate.
For an easy approach to assigning hashes to each bucket, we can use the first four bits of each hash as a bucket ID, then count the number of leading zeros after this ID.
Once again, averaging across several sets of “most leading zeros” will minimize the impact of outliers, and afford us greater precision, by allowing non-integer exponents for our powers of two. Unlike the multiple hash solution, however, this approach will scale nicely.
One downside to HyperLogLog is that the bucket-averaging process is a little complicated. Dividing hashes across multiple buckets diminishes the impact of outliers, as desired, but it also diminishes the impact of all our hashes. For example, say we have 64 hashes, spread across 16 buckets, so 4 hashes per bucket. With 64 hashes, we can expect, on average, one hash with six leading zeros. However, each bucket has only four hashes, and therefore an expected maximum of two leading zeros. So while one bucket probably has six, most have closer to two, and taking the arithmetic mean of the buckets would severely underestimate the number of unique hashes we’ve observed. Therefore, HyperLogLog has a more convoluted estimation algorithm, consisting of creating estimates from each bucket, taking their harmonic mean, multiplying by the number of buckets squared, and multiplying by a magic number derived from the number of buckets^{1}. This results in dampening outliers while boosting the estimate back into the appropriate range.
Here’s a plot comparing the accuracy of Probabilistic counting (count leading zeros, no compensation for outliers), Probabilistic-Med counting (run Probabilistic using ten hash functions, return median of results), and HyperLogLog (our fancy bucket solution):
I’ve generated random strings as input, and evaluate at 50 points on the x-axis, with 100 draws of random strings per x-axis point to create a distribution and error bars. The y-axis represents each estimation function’s guess as to the number of unique elements, with a 95% confidence interval.
Unsprisingly, plain probabilistic counting does not fare well. When we generate thousands of strings, the likelihood that at least one will have many leading zeros is enormous, and since our algorithm relies on counting the maximum observed leading zeros, it’s extremely outlier sensitive.
Taking the mean across ten hash algorithms is also outlier-sensitive when the outliers are large enough, which is why I’ve opted for the median in this plot. Probabilistic-Med performs much better, but it suffers the same problems over a larger time-scale: as we read more and more unique tokens, the likelihood goes up that all ten hash functions will see at least one hash with many leading zeros. Therefore, as the number of unique tokens increases, Probabilistic-Med steadily begins to over-estimate the number of unique tokens, with increasing error bars.
HyperLogLog reigns supreme. While error increases with the number of unique hashes, it remains more accurate, with tighter error bars, than the multi-hash strategy, while remaining computationally cheap. We can increase HyperLogLog’s error tolerance and accuracy in high-unique-token scenarios by increasing the number of buckets, although this lowers accuracy when the number of unique tokens is small.
This is so darn cool! Tracking the total number of unique elements without keeping a list of those elements seems impossible - and it is if you need absolute precision - but with some clever statistics we can get a shockingly close estimate.
If you’d like to see a working example, here’s the code I wrote for generating the accuracy plot, which includes implementations of Probabilistic counting, Probabilistic-Med, and HyperLogLog. This is toy code in Python that converts all the hashes to strings of one and zero characters for easy manipulation, so it is not efficient and shouldn’t be treated as anything like an ideal reference.
If you enjoyed this post, you may enjoy my other writing on dimensional analysis, network science for social modeling, or algorithmic complexity.
The derivation of this number is quite complex, so in practice it’s drawn from a lookup table or estimated ↩
Posted 3/6/2023
This is a post about Big-O notation and measuring algorithmic complexity; topics usually taught to computer science undergraduates in their second to fourth semester. It’s intended for curious people outside the field, or new students. There are many posts on this subject, but this one is mine.
In computer science we often care about whether an algorithm is an efficient solution to a problem, or whether one algorithm is more efficient than another approach. One might be tempted to measure efficiency in terms of microseconds it takes a process to run, or perhaps number of assembly instructions needed. However, these metrics will vary widely depending on what language an algorithm is implemented in, what hardware it’s run on, what other software is running on the system competing for resources, and a host of other factors. We’d prefer to think more abstractly, and compare one strategy to another rather than their implementations. In particular, computer scientists often examine how an algorithm scales, or how quickly it slows down as inputs grow very large.
Let’s start with a trivial example: given a list of numbers, return their sum. Looks something like:
def sum(list): total = 0 for item in list total += item end return total end
Since we need to read the entire list, this algorithm scales linearly with the length of the list - make the list a hundred times longer, and it will take roughly a hundred times longer to get a sum. We write this formally as O(n)
, meaning “scales linearly with n
, the size of the input.” We call this formal syntax “Big O notation,” where the ‘O’ stands for “order of approximation” (or in the original German, “Ordnung”).
Not all algorithms scale. If we were asked “return the third element in the list” then it wouldn’t matter whether the list is three elements long or three million elements long, we can get to the third element in a constant amount of time. This is written as O(1)
, indicating no reliance on the input size.
Search algorithms give us our first example problem with divergent solutions. Given a stack of papers with names on them, tell me whether “Rohan” is in the stack. A trivial solution might look like:
def hasName(list) for name in list if name == "Rohan" return true end end return false end
This scales linearly with the length of the list, just like summing the elements. If the list is in an unknown order then we have no choice but to examine every element. However, if we know the list is in alphabetical order then we can do better. Start in the middle of the list - if the name is Rohan, we’re done. If we’re after Rohan alphabetically, then discard the second half of the list, and repeat on the first half. If we’re before Rohan alphabetically, then discard the first half of the list and repeat on the second. If we exhaust the list, then Rohan’s not in it. This approach is called a binary search, and visually looks like:
In code, a binary search looks something like:
def hasName(list) if( list.length == 0 ) return false end middle = list.length / 2 if( list[middle] == "Rohan" ) return true elsif( list[middle] > "Rohan" ) # Search left half return hasName(list.first(middle)) else # Search right half return hasName(list[middle .. list.length] end end
With every step in the algorithm we discard half the list, so we look at far fewer than all the elements. Our binary search still gets slower as the input list grows longer - if we double the length of the list we need one extra search step - so the algorithm scales logarithmically rather than linearly, denoted O(log n)
.
We’ll end this section by looking at two sorting algorithms: insertion sort, and merge sort.
We want to sort a list, provided to us in random order. One simple approach is to build a new sorted list: one at a time, we take elements from the front of the main list, and find their correct position among the sorted list we’ve built so far. To find the correct position we just look at the value left of our new element, and check whether they should be swapped or not. Keep swapping left until the new element finds its correct position. This visually looks like:
One implementation might look like:
def insertionSort(list) for i in 0.upto(list.length-1) for j in (i-1).downto(0) if( list[j] > list[j+1] ) list[j], list[j+1] = list[j+1], list[j] else break # Done swapping, found the right spot! end end end return list end
Insertion sort is simple and easy to implement. If you were coming up with a sorting algorithm on the spot for something like sorting a deck of cards, you might invent something similar. So what’s the runtime?
In insertion sort, we walk the list from start to end, which is O(n)
. For every new element we examine, however, we walk the list backwards from our current position to the start. This operation also scales linearly with the length of the list, and so is also O(n)
. If we perform a backwards O(n)
walk for every step of the forwards O(n)
walk, that’s O(n) * O(n)
for a total of O(n^2)
. Can we do better?
An alternative approach to sorting is to think of it as a divide-and-conquer problem. Split the list in half, and hand the first half to one underling and the second half to another underling, and instruct them each to sort their lists. Each underling does the same, splitting their lists in half and handing them to two further underlings. Eventually, an underling receives a list of length one, which is by definition already sorted. This splitting stage looks something like:
Now we want to merge our results upwards. Each underling hands their sorted list back up to their superiors, who now have two sorted sub-lists. The superior combines the two sorted lists by first making a new empty “merged” list that’s twice as long. For every position in the merged list, the superior compares the top element of each sorted sub-list, and moves the lower element to the merged list. This process looks like:
Once all elements from the two sub-lists have been combined into a merged list, the superior hands their newly sorted list upwards to their superior. We continue this process until we reach the top of the tree, at which point our work is done. This merge step looks like:
In code, the full algorithm might look something like:
# Combine two sorted lists def merge(left, right) merged = [] while( left.length + right.length > 0 ) if( left.length == 0 ) # Left empty, take from right merged += right.shift(1) elsif( right.length == 0 ) # Right empty, take from left merged += left.shift(1) elsif( left[0] < right[0] ) # Top of left stack is less, take it merged += left.shift(1) else # Top of right stack is less, take it merged += right.shift(1) end end return merged end # Takes a single list, sub-divides it, sorts results def mergeSort(list) if( list.length <= 1 ) return list # Sorted already :) end middle = list.length / 2 left = list[0 .. middle-1] right = list[middle .. list.length-1] leftSorted = mergeSort(left) rightSorted = mergeSort(right) return merge(leftSorted, rightSorted) end
So what’s the runtime of merge sort? Well it takes log n
steps to divide the list in half down to one element. We do this division process for every element in the list. That gives us a runtime of n * log n
to break the list apart and create the full tree diagram.
Merging two sorted lists together scales linearly with the size of the lists, so the merge step is O(n)
. We need to perform a merge each time we move up a “level” of the tree, and there are log n
levels to this tree. Therefore, the full merge process also scales with O(n log n)
.
This gives us a total runtime of O(n log n + n log n)
or O(2n log n)
to create the tree and merge it back together. However, because we are concerned with how algorithms scale as the inputs become very large, we drop constants and all expressions but the dominant term - multiplying by 2 doesn’t mean much as n
approaches infinity - and simplify the run time to O(n log n)
. That’s a lot better than insertion sort’s O(n^2)
!
Big O notation typically describes an “average” or “expected” performance and not a “best case” or “worst-case”. For example, if a list is in a thoroughly random order, then insertion sort will have a performance of O(n^2)
. However, if the list is already sorted, or only one or two elements are out of place, then insertion sort’s best-case performance is O(n)
. That is, insertion sort will walk the list forwards, and if no elements are out of place, there will be no need to walk the list backwards to find a new position for any elements. By contrast, merge sort will always split the list into a tree and merge the branches back together, so even when handed a completely sorted list, merge sort’s best-case performance is still O(n log n)
.
Big O notation also does not describe memory complexity. The description of merge sort above creates a temporary merged
list during the merge step, meaning however long the input list is, merge sort needs at least twice as much memory space for its overhead. By contrast, insertion sort works “in place,” sorting the input list without creating a second list as a workspace. Many algorithms make a trade-off between time and space in this way.
Finally, Big O notation describes how an algorithm scales as n
gets very large. For small values of n
, insertion sort may outperform merge sort, because merge sort has some extra bookkeeping to allocate temporary space for merging and coordinate which minions are sorting which parts of the list.
In summary, Big O notation is a valuable tool for quickly comparing two algorithms, and can provide programmers with easy estimates as to which parts of a problem will be the most time-consuming. However, Big O notation is not the only metric that matters, and should not be treated as such.
All of the algorithms described above can be run in polynomial time. This means their scaling rate, or Big O value, can be upper-bounded by a polynomial of the form O(n^k)
. For example, while merge sort scales with O(n log n)
, and logarithms are not polynomials, n log n
is strictly less than n^2
, so merge sort is considered to run in polynomial time. By contrast, algorithms with runtimes like O(2^n)
or O(n!)
are not bounded by a polynomial, and perform abysmally slowly as n
grows large.
These definitions allow us to describe categories of algorithms. We describe algorithms that run in polynomial time as part of set P, and we typically describe P as a subset of NP - the algorithms where we can verify whether a solution is correct in polynomial time.
To illustrate the difference between running and verifying an algorithm, consider the graph coloring problem: given a particular map, and a set of three or more colors, can you color all the countries so that no two bordering countries have the same color? The known algorithms for this problem are tedious. Brute forcing all possible colorings scales with O(k^n)
for k-colors and n-countries, and the fastest known general algorithms run in O(n * 2^n)
. However, given a colored-in map, it’s easy to look at each country and its neighbors and verify that none violate the coloring rules. At worst, verifying takes O(n^2)
time if all countries border most others, but more realistically O(n)
if we assume that each country only borders a small number of neighbors rather a significant fraction of all countries.
Next, we have NP-Hard: these are the set of problems at least as hard as the most computationally intensive NP problems, but maybe harder - some NP-Hard problems cannot even have their solutions verified in polynomial time. When we describe a problem as NP-Hard we are often referring to this last property, even though the most challenging NP problems are also NP-Hard.
One example of an NP-Hard problem without polynomial verification is the Traveling Salesman: given a list of cities and distances between cities, find the shortest path that travels through every city exactly once, ending with a return to the original city. Trying all paths through cities scales with O(n!)
. More clever dynamic programming solutions improve this to O(n^2 2^n)
. But if someone claims to have run a traveling salesman algorithm, and hands you a path, how do you know it’s the shortest possible path? The only way to be certain is to solve the traveling salesman problem yourself, and determine whether your solution has the same length as the provided answer.
Finally, we have NP-Complete. These are the most challenging problems in NP, meaning:
Solutions to these algorithms can be verified in polynomial time
There is no known polynomial-time solution to these algorithms
Any problem in NP can be translated into an input to an NP-Complete problem in polynomial time, and the result of the NP-Complete algorithm can be translated back, again in polynomial time
Here’s a visualization of these problem classes:
Broad consensus in computer science is that the NP problem space is larger than the P problem space. That is, there are some problems that cannot be solved in polynomial time, but can be verified in polynomial time. However, no one has been able to definitively prove this, in large part because making formal arguments about such abstract questions is exceedingly difficult. There are many problems we do not know how to solve in polynomial time, but how do we prove there isn’t a faster, more clever solution that we haven’t thought of?
Therefore, a minority of computer scientists hold that P = NP, or in other words, all problems that can be verified in polynomial time can also be solved in polynomial time. This would make our set of problem classes look more like:
To prove that P equals NP, all someone would need to do is find a polynomial-time solution to any NP-Complete problem. Since we know all NP problems can be translated back and forth to NP-Complete problems in polynomial time, a fast solution to any of these most challenging problems would be a fast solution to every poly-verifiable algorithm. No such solution has been found.
]]>Posted 2/10/2023
I recently needed to make a graph with a hex lattice shape, like this:
I needed to find distances and paths between different hexagonal tiles, which proved challenging in a cartesian coordinate system. I tried a few solutions, and it was a fun process, so let’s examine each option.
The most “obvious” way to index hexagonal tiles is to label each according to their row and column, like:
This feels familiar if we’re used to a rectangular grid and cartesian coordinate system. It also allows us to use integer coordinates. However, it has a few severe disadvantages:
Moving in the y-axis implies moving in the x-axis. For example, moving from (0,0) to (0,1) sounds like we’re only moving vertically, but additionally shifts us to the right!
Coordinates are not mirrored. Northwest of (0,0) is (-1,1), so we might expect that Southeast of (0,0) would be flipped across the vertical and horizontal, yielding (1,-1). But this is not the case! Southeast of (0,0) is (0,-1) instead, because by dropping two rows we’ve implicitly moved twice to the right already (see point one)
These issues make navigation challenging, because the offsets of neighboring tiles depend on the row. Southeast of (0,0) is (0,-1), but Southeast of (0,1) is (1,0), so the same relative direction sometimes requires changing the column, and sometimes does not.
Rather than using row and column coordinates we could re-index each tile by its “true” cartesian coordinates:
This makes the unintuitive aspects of offset coordinates intuitive:
It is now obvious that moving from (0,0) to (0.5,1) implies both a vertical and horizontal change
Coordinates now mirror nicely: Northwest of (0,0) is (-0.5,1), and Southeast of (0,0) is (0.5,-1).
Following from point 1, it’s now clear why the distance between (0,0) and (3,0) isn’t equal to the distance between (0,0) and (0.5,3).
But while cartesian coordinates are more “intuitive” than offset coordinates, they have a range of downsides:
We no longer have integer coordinates. We could compensate by doubling all the coordinates, but then (0,0) is adjacent to (2,0), and keeping a distance of one between adjacent tiles would be ideal.
While euclidean-distances are easy to calculate in cartesian space, it’s still difficult to calculate tile-distances using these indices. For example, if we want to find all tiles within two “steps” of (0,0) we need to use a maximum range of about 2.237, or the distance to (1,2).
Fortunately there is a third indexing scheme, with integer coordinates, coordinate mirroring, and easy distance calculations in terms of steps! It just requires thinking in three dimensions!
In a cartesian coordinate system we use two axes, since we can move up/down, and left/right. However, on a hexagonal grid, we have three degrees of freedom: we can move West/East, Northwest/Southeast, and Northeast/Southwest. We can define the coordinate of each tile in terms of the distance along each of these three directions, like so:
These “cube coordinates” have one special constraint: the sum of the coordinates is always zero. This allows us to maintain a canonical coordinate for each tile.
To understand why this is necessary, imagine a system where the three coordinates (typically referred to as (q,r,s) to distinguish between systems when we are converting to or from an (x,y) system) correspond directly with the three axes: q refers to distance West/East, r to Northwest/Southeast, and s to Northeast/Southwest. Here’s a visualization of such a scheme:
We could take several paths, such as (0,1,1) or (1,2,0) or (-1,0,2), and all get to the same tile! That would be a mess for comparing coordinates, and would make distance calculations almost impossible. With the addition of this “sum to zero” constraint, all paths to the tile yield the same coordinate of (-1,2,-1).
Distances in cube coordinates are also easy to calculate - just half the “Manhattan distance” between the two points:
def distance(q1, r1, s1, q2, r2, s2): return (abs(q1-q2) + abs(r1-r2) + abs(s1-s2)) / 2
We can add coordinates, multiply coordinates, calculate distances, and everything is simple so long as we remain in cube coordinates.
However, we will unavoidably sometimes need to convert from cube to cartesian coordinates. For example, while I built the above hex grids using cube coordinates, I plotted them in matplotlib, which wants cartesian coordinates to place each hex. Converting to cartesian coordinates will also allow us to find the distance between hex tiles “as the crow flies,” rather than in path-length, which may be desirable. So how do we convert back to xy coordinates?
First, we can disregard the s
coordinate. Since all coordinates sum to zero, s = -1 * (q + r)
, so it represents redundant information, and we can describe the positions of each tile solely using the first two coordinates.
We can also tell through the example above that changing the q
coordinate contributes only to changing the x-axis, while changing the r
coordinate shifts both the x- and y-axes. Let’s set aside the q
coordinate for the moment and focus on how much r
contributes to each cartesian dimension.
Let’s visualize the arrow from (0,0,0) to (0,1,-1) as the hypotenuse of a triangle:
We want to break down the vector of length r=1
into x
and y
components. You may recognize this as a 30-60-90 triangle, or you could use some geometric identities: the internal angles of a hexagon are 120-degrees, and this triangle will bisect one, so theta must be 60-degrees. Regardless of how you get there, we land at our triangle identities:
From here we can easily solve for the x
and y
components of r
, using 2a = r
:
We know that (0,1,-1) is halfway between (0,0) and (1,0,-1) on the x-axis, so q
must contribute twice as much to the x-axis as r
does. Therefore, we can solve for the full cartesian coordinates of a hex using the cube coordinates as follows:
This works great! But it leaves the hexagons with a radius of sqrt(3) / 3
, which may be inconvenient for some applications. For example, if you were physically manufacturing these hexagons, like making tiles for a board-game, they’d be much easier to cut to size if they had a radius of one. Therefore, you will often see the conversion math from cube to cartesian coordinates written with a constant multiple of sqrt(3)
, like:
Since this is a constant multiple, it just re-scales the graph, so all the distance measurements and convenient properties of the system remain the same, but hexagons now have an integer radius.
If you are also excited by these coordinate systems, and want to read more about the logic behind cube coordinates, path-finding, line-drawing, wrapping around the borders of a map, and so on, then I highly recommend the Red Blob Games Hexagon article, which goes into much more detail.
]]>Posted 1/17/2023
In my last post I demonstrated how to perform image dithering to convert colored images to black and white. This consists of converting each pixel to either black or white (whichever is closer), recording the amount of “error,” or the difference between the original luminoscity and the new black/white value, and propagating this error to adjoining pixels to brighten or darken them in compensation. This introduces local error (some pixels will be converted to white when their original value is closer to black, and vice versa), but globally lowers error, producing an image that appears much closer to the original.
I’m still playing with dithering, so in this post I will extend the idea to color images. Reducing the number of colors in an image used to be a common task: while digital cameras may be able to record photos with millions of unique colors, computers throughout the 90s often ran in “256 color” mode, where they could only display a small range of colors at once. This reduces the memory footprint of images significantly, since you only need 8-bits per pixel rather than 24 to represent their color. Some image compression algorithms still use palette compression today, announcing a palette of colors for a region of the image, then listing an 8- or 16-bit palette index for each pixel in the region rather than a full 24-bit color value.
Reducing a full color image to a limited palette presents a similar challenge to black-and-white image dithering: how do we choose what palette color to use for each pixel, and how do we avoid harsh color banding?
We’ll start with a photo of a hiking trail featuring a range of greens, browns, and whites:
Let’s reduce this to a harsh palette of 32 colors. First, we need to generate such a palette:
#!/usr/bin/env python3 import numpy as np def getPalette(palette_size=32): colors = [] values = np.linspace(0, 0xFFFFFF, palette_size, dtype=int) for val in values: r = val >> 16 g = (val & 0x00FF00) >> 8 b = val & 0x0000FF colors.append((r,g,b)) return colors
I don’t know much color theory, so this is far from an “ideal” spread of colors. However, it is 32 equally spaced values on the numeric range 0x000000 to 0xFFFFFF, which we can convert to RGB values. We can think of color as a three dimensional space, where the X, Y, and Z axes represent red, green, and blue. This lets us visualize our color palette as follows:
import matplotlib.pyplot as plt def plotPalette(palette): fig = plt.figure(figsize=(6,6)) ax = fig.add_subplot(111, projection='3d') r = [] g = [] b = [] c = [] for color in palette: r.append(color[0]) g.append(color[1]) b.append(color[2]) c.append("#%02x%02x%02x" % color) g = ax.scatter(r, g, b, c=c, marker='o', depthshade=False) ax.invert_xaxis() ax.set_xlabel('Red') ax.set_ylabel('Green') ax.set_zlabel('Blue') plt.show()
Which looks something like:
Just as in black-and-white image conversion, we can take each pixel and round it to the closest available color - but instead of two colors in our palette, we now have 32. Here’s a simple (and highly inefficient) conversion:
# Returns the closest rgb value on the palette, as (red,green,blue) def getClosest(color, palette): (r,g,b) = color closest = None #(color, distance) for p in palette: # A real distance should be sqrt(x^2 + y^2 + z^2), but # we only care about relative distance, so faster to leave it off distance = (r-p[0])**2 + (g-p[1])**2 + (b-p[2])**2 if( closest == None or distance < closest[1] ): closest = (p,distance) return closest[0] def reduceNoDither(img, palette, filename): pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): pixels[y,x] = getClosest(pixels[y,x], palette) reduced = Image.fromarray(pixels) reduced.save(filename) img = Image.open("bridge.png") palette = getPalette() reduceNoDither(img, palette, "bridge_32.png")
The results are predictably messy:
Our palette only contains four colors close to brown, and most are far too red. If we convert each pixel to the closest color on the palette, we massively over-emphasize red, drowning out our greens and yellows.
Dithering to the rescue! Where before we had an integer error for each pixel (representing how much we’d over or under-brightened the pixel when we rounded it to black/white), we now have an error vector, representing how much we’ve over or under emphasized red, green, and blue in our rounding.
As before, we can apply Atkinson dithering, with the twist of applying a vector error to three dimensional color points:
# Returns an error vector (delta red, delta green, delta blue) def getError(oldcolor, newcolor): dr = oldcolor[0] - newcolor[0] dg = oldcolor[1] - newcolor[1] db = oldcolor[2] - newcolor[2] return (dr, dg, db) def applyError(pixels, y, x, error, factor): if( y >= pixels.shape[0] or x >= pixels.shape[1] ): return # Don't run off edge of image er = error[0] * factor eg = error[1] * factor eb = error[2] * factor pixels[y,x,RED] += er pixels[y,x,GREEN] += eg pixels[y,x,BLUE] += eb def ditherAtkinson(img, palette, filename): pixels = np.array(img) total_pixels = pixels.shape[0] * pixels.shape[1] for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] # Returns reference new = getClosest(old, palette) quant_error = getError(old, new) pixels[y,x] = new applyError(pixels, y, x+1, quant_error, 1/8) applyError(pixels, y, x+2, quant_error, 1/8) applyError(pixels, y+1, x+1, quant_error, 1/8) applyError(pixels, y+1, x, quant_error, 1/8) applyError(pixels, y+1, x-1, quant_error, 1/8) applyError(pixels, y+2, x, quant_error, 1/8) dithered = Image.fromarray(pixels) dithered.save(filename)
Aaaaaand presto!
It’s far from perfect, but our dithered black and white images were facsimiles of their greyscale counterparts, too. Pretty good for only 32 colors! The image no longer appears too red, and the green pine needles stand out better. Interestingly, the dithered image now appears flecked with blue, with a blue glow in the shadows. This is especially striking on my old Linux laptop, but is more subtle on a newer screen with a better color profile, so your mileage may vary.
We might expect the image to be slightly blue-tinged, both because reducing red values will make green and blue stand out, and because we are using an extremely limited color palette. However, the human eye is also better at picking up some colors than others, so perhaps these blue changes stand out disproportionately. We can try compensating, by reducing blue error to one third:
That’s an arbitrary and unscientific compensation factor, but it’s removed the blue tint from the shadows in the image, and reduced the number of blue “snow” effects, suggesting there’s some merit to per-channel tuning. Here’s a side-by-side comparison of the original, palette reduction, and each dithering approach:
Especially at a smaller resolution, we can do a pretty good approximation with a color selection no wider than a big box of crayons. Cool!
]]>Posted 1/16/2023
Dithering means intentionally adding noise to a signal to reduce large artifacts like color banding. A classic example is reducing a color image to black and white. Take this magnificent photo of my neighbor’s cat:
To trivially convert this image to black and white we can take each pixel, decide which color it’s closest to, and set it to that:
#!/usr/bin/env python3 from PIL import Image import numpy as np # Load image as grayscale img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y, row in enumerate(pixels): for x,col in enumerate(row): if( pixels[y,x] >= 127 ): pixels[y,x] = 255 else: pixels[y,x] = 0 bw = Image.fromarray(pixels) bw.save("kacie_bw.png")
But the result is not very satisfying:
The cat is white. Every pixel will be closer to white than black, and we lose the whole cat except the eyes and nose, along with most of the background detail. But we can do better! What if we set the density of black pixels based on the brightness of a region? That is, black regions will receive all black pixels, white regions all white, but something that should be a mid-gray will get closer to a checkerboard of black and white pixels to approximate the correct brightness.
One particularly satisfying way to approach this regional checkerboarding is called error diffusion. For every pixel, when we set it to black or white, we record how far off the original color is from the new one. Then we adjust the color of the adjacent pixels based on this error. For example, if we set a gray pixel to black, then we record that we’ve made an error by making this pixel darker than it should be, and we’ll brighten the surrounding pixels we haven’t evaluated yet to make them more likely to be set to white. Similarly, if we round a gray pixel up to white, then we darken the nearby pixels to make them more likely to be rounded down to black.
In Floyd-Steinberg dithering we process pixels left to right, top to bottom, and propagate the error of each pixel to its neighbors with the following distribution:
That is, pass on 7/16 of the error to the pixel right of the one we’re examining. Pass on 5/16 of the error to the pixel below, and a little to the two diagonals we haven’t examined yet. We can implement Floyd-Steinberg dithering as follows:
def getClosest(color): if( color >= 127 ): return 255 # White return 0 # Black def setAdjacent(pixels, y, x, error): (rows,cols) = pixels.shape[0:2] if( y >= rows or x >= cols ): return # Don't run past edge of image pixels[y,x] += error # Load image as grayscale img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] new = getClosest(old) pixels[y,x] = new quant_error = old - new setAdjacent(pixels, y, x+1, quant_error*(7/16)) setAdjacent(pixels, y+1, x-1, quant_error*(3/16)) setAdjacent(pixels, y+1, x, quant_error*(5/16)) setAdjacent(pixels, y+1, x+1, quant_error*(1/16)) dithered = Image.fromarray(pixels) dithered.save("kacie_dithered_fs.png")
The results are a stunning improvement:
We’ve got the whole cat, ruffles on her fur, the asphalt and wood chips, details on rocks, gradients within shadows, the works! But what are those big black flecks across the cat’s fur? These flecks of “snow” impact the whole image, but they don’t stand out much on the background where we alternate between black and white pixels frequently. On the cat, even small errors setting near-white fur to white pixels build up, and we periodically set a clump of pixels to black.
We can try to reduce this snow by fiddling with the error propagation matrix. Rather than passing all of the error on to adjacent pixels, and mostly to the pixel to the right and below, what if we ‘discount’ the error, only passing on 75% of it? This is the diffusion matrix used in Atkinson dithering:
The code hardly needs a change:
img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] new = getClosest(old) pixels[y,x] = new quant_error = old - new setAdjacent(pixels, y, x+1, quant_error*(1/8)) setAdjacent(pixels, y, x+2, quant_error*(1/8)) setAdjacent(pixels, y+1, x+1, quant_error*(1/8)) setAdjacent(pixels, y+1, x, quant_error*(1/8)) setAdjacent(pixels, y+1, x-1, quant_error*(1/8)) setAdjacent(pixels, y+2, x, quant_error*(1/8)) dithered = Image.fromarray(pixels) dithered.save("kacie_dithered_at.png")
And the snow vanishes:
This is a lot more pleasing to the eye, but it’s important to note that the change isn’t free: if you look closely, we’ve lost some detail on the cat’s fur, particularly where the edges of her legs and tail have been ‘washed out.’ After all, we’re now ignoring some of the error caused by our black and white conversion, so we’re no longer compensating for all our mistakes in nearby pixels. This is most noticeable in bright and dark areas where the errors are small.
I really like this idea of adding noise and propagating errors to reduce overall error. It’s a little counter-intuitive; by artificially brightening or darkening a pixel, we’re making an objectively worse local choice when converting a pixel to black or white. Globally, however, this preserves much more of the original structure and detail. This type of error diffusion is most often used in digital signal processing of images, video, and audio, but I am curious whether it has good applications in more distant domains.
If you enjoyed this post and want to read more about mucking with images and color, you may enjoy reading my post on color filter array forensics.
]]>Posted 12/03/2022
My lab group recently asked me to give a tutorial on using SQL databases in science. While we are all complex systems scientists, my background is in computer science and STS, and many of my colleagues come from physics, mathematics, and philosophy, so we learn a great deal from one another. I’ve turned my slides into a blog post here, like my last lab talk on using Git for scientific software development.
A database is a piece of software for storing and organizing your data. Most importantly, databases make it easy to query your data, asking for subsets of your data that match a specific pattern you are interested in.
If you currently store your data in formats like CSV or JSON, and write lots of code for reading this data and searching through it for pieces relevant to your research question, our goal will be to offload all of this logic from your own code to a database. It will run much faster, it will be faster to write, and it will help you avoid bugs while expressing complicated questions simply.
There are many types of databases, but for this post I’ll split them along two axis: do they run locally (as part of your research code, storing data in a single file) or remotely (running as an independent process you speak to over the network), and does the database use SQL (a language for expressing sophisticated data queries) or not. Here’s a small subset of databases along these axes:
Local | Remote | |
SQL | SQLite, DuckDB | Postgresql, MySQL, MariaDB, MSSQL, … |
NoSQL | Pandas (sorta), BerkeleyDB, … | Redis, MongoDB, Firebase, … |
In this post I’ll be focusing on SQLite and Postgresql as examples. I’ll briefly talk about NoSQL databases at the end, and the scenarios where they might be preferable to SQL databases.
SQLite stores all data in one file on your hard drive. SQLite is a library, so the database software runs inside of the software you write. It is trivial to set up, pretty fast (especially for queries), and has most database features we will want.
Critically, SQLite is ill-suited to concurrency. Since SQLite runs inside of your software, two different Python scripts can easily try to write to the same database file at the same time, risking catastrophic data corruption. You can build sophisticated locking mechanisms to ensure only one program accesses a database at once, but this adds a serious performance bottleneck. SQLite is really intended for a single piece of software to store data, not live setups where several applications write data at the same time.
Postgres runs as a software daemon; it runs all the time, storing data in a series of files and caches that it manages. Whether postgres is running on your own computer or another computer, your research software will communicate with postgresql over the network.
This difference in design means that postgres requires some additional bureaucracy to set up: users and passwords, databases and permissions and authentication. In return however, postgres is even faster than SQLite, and handles concurrent access from many applications trivially. Postgres also has a number of advanced features that are unavailable in SQLite.
Relational databases store data in tables (think spreadsheets), and in the relationships between tables.
userid | firstname | lastname | status |
zerocool | Dade | Murphy | Undergradute |
acidburn | Kate | Libby | Undergradute |
joey | Joey | Pardella | Undergradute |
cerealkiller | Emmanuel | Goldstein | Undergradute |
phantomphreak | Ramon | Sanchez | Undergradute |
lord_nikon | Paul | Cook | Graduate |
building | room | desk | userid |
Sage | 113 | 1 | zerocool |
Sage | 113 | 2 | acidburn |
Perkins | 208 | 7 | joey |
West | 302 | 4 | lord_nikon |
You request data from a table using a SELECT
statement of the form SELECT columns FROM table WHERE row matches condition
. For example:
SELECT userid FROM desks WHERE building='Innovation' AND room=413;
You can also combine multiple tables during a SELECT
to gather related information. Here we fetch the names of all graduate students with a desk assigned to them, by selecting rows from the desks table and combining them with matching entries from the user table where the user IDs of both rows match:
SELECT firstname,lastname FROM desks LEFT JOIN users ON desks.userid=users.userid WHERE status='Graduate';
The following are the main commands for interacting with a SQL database to create relations, add, remove, and update information in relations, and select information from relations:
Command | Description |
SELECT | Return some columns from a relation |
INSERT | Add data to a relation |
DELETE | Remove data from a relation |
UPDATE | Modify data in a relation |
CREATE | Create a new relation (table, view, index) |
DROP | Remove a relation |
EXPLAIN | Show how a query will access data |
So that’s all fine in theory, but how do we write software that actually uses a database?
To connect to a SQLite database from Python we first supply a database filename, open a cursor from the connection, then use the cursor to send a query and get back a 2D array of results.
import sqlite3 conn = sqlite3.connect("university.db") c = conn.cursor() c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid") for (f,l,b,r) in c.fetchall(): print("%s %s has a desk in %s %d" % (f,l,b,r)) conn.commit() # Save any CREATE/INSERT changes to the database conn.close()
You can think of the cursor as a finger tracking your position in the database. Multiple cursors allow us to make multiple queries from the same database and track which results were associated with which request.
Interacting with Postgres is similar to SQLite: we connect to the database, then open a cursor from the connection, and use the cursor to send queries and get results. However, Postgres is a daemon accessible over the network, so we’ll need to supply a hostname and port number where the SQL server can be found, the name of the database we want to reach, and a username and password authorized to connect to that database.
import psycopg2 try: conn = psycopg2.connect(host="127.0.0.1", port="5432", user="registrar", password="hunter2", database="university_users") c = conn.cursor() c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid") for (f,l,b,r) in c.fetchall(): print("%s %s has a desk in %s %d" % (f,l,b,r)) except (Exception, Error) as error: print("Error while connecting to PostgreSQL", error) finally: if( conn ): conn.commit() conn.close()
Often your SQL statements will depend on other variables, and can’t be written as constant strings ahead of time. It may be tempting to assemble the SQL statement using string concatenation to insert variables. Never do this.
Consider the following example:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '" + name + "'") matches = c.fetchall()
Given a student’s last name, look up all students with that name. You might find functionality like this on your university’s student directory. But what if a user enters input like ' OR 'a'='a
? The query now reads:
SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR 'a'='a'
While a little clunky, this will return every user in the database. Worse yet, a malicious user might construct a query like:
SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR password LIKE 'A%'
This would get them a list of all users whose password hashes start with ‘A’, then another query for ‘AA’, ‘AB’, and slowly an attacker can reconstruct the password hashes of every user at the university. This kind of attack is called a SQL Injection, and is a common vulnerability in websites. While scientific code is less likely to be directly attacked than a website, if you’re working with real-world data, especially web-scraped or user-gathered data, there can be all kinds of garbage in your input.
To avoid this vulnerability you can write your query first with placeholders for parameters, then tell SQL to complete the statement on your behalf. In SQLite this looks like:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '?'", [name])
Or in Postgresql:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '%s'", [name])
In either case, the SQL engine will properly escape any text in the name
field, ensuring that it’s interpreted as a string, and never as a SQL statement.
Real-world data is messy. Maybe you assume that every office at a school is assigned to an employee or graduate student, but some were assigned to recent graduates or retirees and haven’t been re-assigned. Maybe you assume that all students have last names, but some international students come from cultures that use mononyms, and the last name field is empty. If you don’t check these underlying assumptions, you might not learn that you’ve made a mistake until hours of debugging later! Fortunately, SQL provides an easy way to describe and enforce your assumptions about data through constraints.
CREATE TABLE users( userid TEXT PRIMARY KEY, firstname TEXT NOT NULL, lastname TEXT, status TEXT NOT NULL );
This definition of the user table includes four text fields, three of which cannot be empty. Further, the userid
field must be unique: you can have two students with the same first and last name, but they must have different usernames. We can add more detailed restrictions to the desk-assignment table:
CREATE TABLE desks( building TEXT NOT NULL, room INT NOT NULL, desk INT NOT NULL, userid TEXT, FOREIGN KEY(userid) REFERENCES users(userid), UNIQUE(building,room,desk) );
Here, we’ve explicitly said that the userid
field must match some user ID in the users
table. We’ve also said that while there can be multiple rooms in a building, and multiple desks in a room, there cannot be multiple desk 4’s in room 112 of Sage hall: the combination of building name, room number, and desk number must be unique.
If we try to insert any data into these tables that violates the described constraints, SQL will throw an exception instead of adding the new rows. Like unit testing but for your input data, constraints can help you be confident that your data follows the logic you think it does.
Compared to parsing CSV or JSON in Python and searching for the data you want, SQL will run inconceivably fast. But if you’re storing several gigabytes or more in your tables, even SQL databases will slow down. With a little bit of forethought we can make SQL queries run much faster.
Let’s say you have all your email stored in a database, with a structure something like:
CREATE TABLE emails( msgid TEXT PRIMARY KEY, from_address TEXT NOT NULL, to_address TEXT NOT NULL, subject TEXT NOT NULL, sent_date INT NOT NULL );
If we want to search for all emails received in the last week then SQL needs to search through every email in the table to check their sent dates. This is obviously highly inefficient, but we can warn SQL that we’ll be making these kinds of queries:
CREATE INDEX time_index ON emails(sent_date);
Creating an index tells SQL to build a binary tree, sorting the emails by sent_date
to reduce lookups from O(n)
to O(log n)
, dramatically improving performance. We can also build indices on multiple columns at once:
CREATE INDEX from_time_index ON emails(from_address,sent_date);
And now we can look up emails from a particular user in a particular time window in O(log n)
- even better! Both SQLite and Postgresql will automatically create indices for primary keys and unique constraints, since they’ll need to perform lookups during every new insert to make sure the constraints aren’t violated. You’ll often be selecting data based on unique characteristics too, so in practice it isn’t always necessary to declare indices explicitly.
Many SQL functions aggregate information from multiple rows. For example, we can count the number of users with:
SELECT COUNT(*) FROM users;
There are a variety of aggregate functions, include AVG
, MAX
, MIN
, and SUM
.
Often we don’t want to apply aggregate functions to every row, but to a sub-group of rows. Imagine we have a table of course registrations, like:
CREATE TABLE course_registration( userid TEXT, coursecode INT, credits INT, FOREIGN KEY(userid) REFERENCES users(userid), UNIQUE(userid,coursecode) );
To ask how many credits each student is registered for we might query:
SELECT userid,SUM(credits) FROM course_registration GROUP BY userid;
The GROUP BY
clause clusters rows based on their userid
, then runs the aggregate function on each group rather than on all rows. We could also list the students in descending order by credit count like:
SELECT userid,SUM(credits) AS total_credits FROM course_registration GROUP BY userid ORDER BY total_credits DESC;
Pandas is a ubiquitous Python package in data science. It makes it easy to store a table or a sequence of values as a Python object and perform some data analysis. It also integrates well with Seaborn, a package for statistical data visualization built on top of matplotlib. The two also integrate well with SQL. In just a couple lines, we can plot a histogram of how many credits students have registered for, from SQL to Pandas to Seaborn:
import sqlite3 import pandas as pd import seaborn as sns import matplotlib.pyplot as plt conn = sqlite3.connect("registration.db") df = pd.read_sql("SELECT userid,SUM(credits) AS registered_credits FROM course_registration GROUP BY userid", conn) sns.histplot(data=df, x="registered_credits") plt.set_title("Credit load across student body") plt.savefig("credit_load.pdf", bbox_inches="tight") conn.close()
Pandas will run a query for us (against SQLite, Postgresql, or a variety of other database types), put the result in a table with appropriate column names, and hand it off to Seaborn, which understands those same column names. Data analysis made easy!
For all the awesomeness of SQL, there are some tasks it is ill-suited to. If all you need is a way to store a dictionary so it persists, and make that dictionary accessible to multiple programs, then SQL is way more complexity and overhead than you need. Redis fills this niche well, and is simple and fast as long as you use it for this purpose.
If you have an enormous amount of data, terabytes worth, and need to update that data continuously, then SQL is a poor fit: every SQL server can only make one change to a table at a time and will have some kind of internal locking mechanism to prevent multiple writes from conflicting. This would be disastrous if, for example, you wanted to store all tweets in a database and need to save data as millions of users tweet at once. Here, tools like MongoDB step up, offering multiple database “shards” that will periodically sync with one another. This setup offers “eventual consistency” where a new tweet might not be available to all users right away, but things propagate pretty quickly, and in return we can handle huge numbers of updates at once.
More generally, SQL is a poor choice for:
Storing large files: maybe store metadata about files in a table, along with a pathname to where the file can be found on disk?
Storing unstructured data: you need to know what your rows and columns will be to put information in a spreadsheet. If your data is not uniform enough to describe in this way, then a spreadsheet is inappropriate.
Storing arbitrarily structured or nested data: if your data comes in the form of deeply nested JSON or XML, then a spreadsheet may be a poor choice. This is not always the case: if you have some nested JSON representing a tree of comments and replies on a website, then you may be able to “flatten” the tree by making each comment into a unique row, and including a “parent commentID” as a column. However, if different levels of nesting can have a wide variety of tags and meanings, this conversion may not always make sense. If you find that you’re storing a blob of JSON as a column in your table, then a table may not be the best representation for you.
For very specific types of data, like GIS, or network/graph data, there are specialized databases that may offer more task-appropriate tools than SQL.
SQL databases are an invaluable tool for any data science. They allow researchers to organize a wide variety of data so that it is easily and quickly queried to identify patterns and answer questions. SQL can simplify your code, preclude nasty bugs via constraints, and integrates nicely with most programming languages, and especially with common data science software packages.
This post offers a brief tutorial on using SQL, but there is an enormous depth available to accomplish more complicated tasks. In particular, I have left out:
Views and Materialized Views
Subqueries and Common Table Expressions
Much more detail on joins, unions, grouping, and partitioning
Functions, stored procedures, and triggers
Hopefully this gives you enough of a foundation to start using SQL in scientific contexts, and look up more details as you need them.
]]>Posted 8/21/2022
Distributed Denial of Secrets publishes most of our datasets via torrent. This minimizes infrastructural requirements for us: every time someone downloads a release, if they leave their torrent client running, they help us upload to other interested people. Once many people have mirrored our release it can remain available even if we stop seeding, completely self-hosted by the public. This is ideal, because with our budget we’re unable to maintain seed boxes for every release simultaneously; we keep offline backups, but virtual machine storage is more expensive.
This system typically works well, especially for rapidly distributing new releases. However, occasionally an older release will become unavailable, either because interest has waned and seeds have dropped offline, or because the trackers used by the torrent are no longer functional. If someone reports that a torrent is unavailable then we can pull the data from our backups and resume seeding, and issue a new magnet link containing an updated list of trackers. Unfortunately, that’s reactive, slow, and tedious. How can we proactively monitor the availability of all our torrents, to notify us when one requires attention?
Specifically, we want to build a dashboard that displays a list of torrents, and for each indicates how many trackers are online, how many peers those trackers are aware of, and how many peers can be found in the distributed hash table (DHT). It should track this information over the course of a day, a week, and a month, so we can distinguish between short-term and permanent loss of availability.
Every torrent client has the functionality to locate peers through trackers, and most modern clients can also find peers through the DHT. However, most clients do not provide a way to use that functionality without starting a download for the torrent, nor do they provide a way to export that peer information so we can plot availability over time. There are a few libraries for handling torrents, like libtorrent, but these also don’t easily expose peer-discovery independently from downloading. Fortunately, there are libraries for performing bittorrent DHT lookups, so our primary hurdle is implementing the client side of the bittorrent tracker protocol, described in BEP 0003, BEP 0015, and BEP 0023.
Torrent trackers are conceptually simple:
A torrent or magnet link contains a list of trackers
Any client interested in downloading the torrent data contacts each tracker
The client announces the hash of the torrent they’re interested in, registering their interest with the tracker
The tracker returns a list of any other IP addresses that have recently registered interest in the same content
The client periodically re-registers its interest with the tracker, to identify any new peers, and ensure it remains discoverable to others
From there the client contacts each discovered peer directly, and negotiates a download. Since we’re only interested in peer discovery, we don’t have to follow along further than this.
Clients can communicate with trackers using two protocols: older trackers communicate using HTTP, but far more common is the newer, simpler, faster UDP-based protocol. In both protocols, clients can make announce requests, which announce their interest in a torrent, and scrape requests, which fetch some aggregated metadata about the number of clients interested in a torrent.
Unfortunately, scrape requests have little utility for our purposes: If one tracker says that it knows 7 peers, and another tracker says it knows 3, how many peers are there? 7? 10? Somewhere in-between? We can’t aggregate information across trackers without fetching the list of peer IP addresses from each tracker, which requires using an announce request.
The tracker HTTP protocol is deceptively simple. A tracker URL looks something like http://tracker.opentrackr.org:1337/announce
. This contains the domain name of the tracker, the port number, and the resource for the request (typically “announce”). To send a request, the client adds several fields:
Field | Description |
---|---|
info_hash | A URL-encoded version of the torrent sha256 hash |
peer_id | A random string uniquely identifying the client |
port | The port number on which the client can be reached |
uploaded | The number of blocks the client has already uploaded |
downloaded | The number of blocks the client has downloaded |
left | How many blocks the client still needs to download |
Therefore a full request to a tracker may look something like:
http://tracker.opentrackr.org:1337/announce?info_hash=%5Bg%03%95%28%0A%3F%3F**%0A%CFs%D4K%2C%CE%0F%E1%AE&peer_id=foo&port=1234&uploaded=0&downloaded=0&left=0
Note that the uploaded
, downloaded
, and left
fields are required, but are only hints. If the client is downloading a magnet link, it may not know how large the torrent data is, and therefore how much is left to download. This self-reported metadata isn’t verified in any way, the tracker just uses it to report some analytics.
Once the client makes an announce request to a tracker, the tracker responds with either an HTTP error, or with a text-encoded dictionary describing available peer data for the torrent. Great, so does the tracker respond with some JSON? XML? YAML? No, it responds with Bencode! This is a custom text-encoding scheme made for bittorrent metadata that can encode:
Field type | Encoding rule | Example |
---|---|---|
integers | Prefix with an i , then the integer in ascii-base10, then an e |
7 becomes i7e |
bytestrings | Length-prefixed, then a colon, then the string | “foo” becomes 3:foo |
lists | Start with an l , then the contents of the list, then an e |
[2,3] becomes li2ei3ee |
dictionaries | Start with a d , then the contents of the dictionary, then an e . Each entry consists of a string key, followed immediately by a value |
{"foo": 1, "bar": 2} becomes d3:fooi1e3:bari2ee |
The tracker may respond with a Bencoded dictionary with a key of failure reason
and a value of some explanatory text string like “this tracker doesn’t have information on that torrent” or “you’ve been rate-limited”. Otherwise, it’ll respond in one of two ways:
In the older bittorrent tracker standard 3, trackers respond with a dictionary containing the key peers
and a value of a list, where each entry is a dictionary, containing contact information for that peer. For example (translated to json):
{
"peers":
[
{"ip": "1.2.3.4", "port": 4567},
{"ip": "2.3.4.5", "port": 5678}
]
}
Or in the glorious bencode:
d5:peersl2:ip7:1.2.3.44:porti4567e22:ip7:2.3.4.54:porti5678eee
There may be a variety of other keys (a “peer ID” in the peer dictionary, or metadata like “number of seeds, peers, and leeches” at the top level), but this is all we need for our purposes.
All this text encoding gets a little tedious, so in an amendment to the tracker spec (standard 23), trackers may now instead return a binary string in the “peers” field, like:
{
"peers": "\x04\x03\x02\x01\x04\xD2\x05\x04\x03\x02\t)"
}
Or in bencode again:
d5:peers12:\x04\x03\x02\x01\x04\xD2\x05\x04\x03\x02\t)e
This is equivalent to the dictionary above: the first four bytes are an integer IP address, followed by two bytes for a port, then another six bytes for the next IP address and port. The hex-escaping is added here for illustration purposes; the tracker would return those raw bytes.
While this string compression doesn’t save much in our two-peer example, it’s significantly more compact when handling dozens or hundreds of peers.
HTTP is unwieldy. It takes many packets, the server might use gzip compression, maybe the server requires HTTPS, or goes through some redirects before responding. Once the server responds, it might respond with a variety of HTTP errors, and while it should respond with bencoded data, servers often return HTML in error. Even when they return bencoded data, they sometimes follow the bencode spec incorrectly. In short, supporting HTTP in torrent clients is a complicated mess. But it doesn’t need to be this way! The information the client and server are exchanging is relatively simple, and we can communicate it in just a handful of UDP packets. So begins bittorrent specification 15.
First, we need to perform a handshake with the server:
The client sends a magic number confirming that they are using the torrent tracker protocol, as opposed to random Internet traffic like a port scan. Then they send an action (0: connect), and a random transaction ID to identify datagrams connected to this session.
If the tracker is online, it will respond to complete the handshake:
The tracker sends back action 0 (responding to the connect request), the same transaction ID the client sent, and a random connection ID. The client will include this connection ID in future datagrams. This handshake prevents IP address spoofing, as used in DNS amplification attacks where an attacker coerces a DNS server into flooding a third party with traffic.
The client may now send its announce
request (action code 1: announce):
This uses the same connection ID and a new transaction ID from the previous step, followed by the info hash of the torrent, and a peer ID representing this client. Then the client sends some metadata regarding how far along its download is (matching the downloaded
, left
, and uploaded
fields in the HTTP spec). Finally, the client sends the IP address and port it can be reached at, although trackers will typically ignore the IP address field and use the IP that sent the request (again to prevent spoofing), a key identifying the client, and an unused num_wanted
field.
If the client has both an IPv4 and an IPv6 address, and is therefore looking for both v4 and v6 peers, then it must make two announce
requests, over v4 and v6, using the same key. This allows the tracker to avoid “double-counting” the number of peers interested in a torrent.
Finally, the tracker responds with peer data:
Here, the action and transaction ID match the previous datagram, and the interval indicates how long the client should cache results for before polling the tracker again. The leechers
and seeders
counts are the tracker’s guess as to how many peers are mostly-downloading or mostly-uploading based on the downloaded
, left
, and uploaded
fields from each announce request. These counts are not authoritative, or confirmed by the tracker in any way.
And at last, the tracker responds with a series of IP addresses and port numbers: 4 bytes per address (assuming IPv4, 16 bytes for IPv6), and two bytes per port number.
That’s all there is to the UDP protocol! Keep in mind that all values should follow network byte-order (big endian). While the diagrams make this protocol look complicated, there’s far less parsing or error handling needed than for the HTTP version, no external libraries required, and the entire exchange occurs in just 4 packets. No wonder the majority of torrents only use UDP trackers!
With the tracker protocol implemented, we can take a list of torrents, extract their list of trackers, and look up peers from each tracker. We can also look up peers in the DHT using third party code. From here, it’s a simple process to make a SQL database to track all that information with timestamps, select results from those tables based on their age, and at last throw up an interface to peruse it:
In the hopes that this code might benefit others, it’s been released on GitHub.
]]>Posted 6/8/2022
This post is a non-academic summary of my most recent paper, which can be found here. It’s in a similar theme as a previous paper, which I discussed here, but this post can be read on its own. An enormous thank you to my fantastic co-authors Josh Minot, Sam Rosenblatt, Guillermo de Anda Jáuregui, Emily Moog, Briane Paul V. Samson, Laurent Hébert-Dufresne, and Allison M. Roth.
If you wanted to find QAnon believers on Twitter, YouTube, or Reddit, you might search for some of their flavorful unique vocabulary like WWG1WGA (“Where we go one, we go all”). To find cryptocurrency enthusiasts, you might search for in-group phrases like HODL or WAGMI, or “shitcoins”, or specific technologies like “NFT” or “ETH”. This works well for new, obscure communities, when no one else has picked up on their vocabulary. However, once a community reaches the limelight, the keyword-search strategy quickly deteriorates: a search for “WWG1WGA” is now as likely to find posts discussing QAnon, or ridiculing them, as it is to identify true believers.
Human observers with some contextual understanding of a community can quickly distinguish between participants in a group, and discussion about (or jokes about) a group. Training a computer to do the same is decidedly more complicated, but would allow us to examine exponentially more posts. This could be useful for tasks like identifying covid conspiracy communities (but distinguishing them from people talking about the conspiracists) or identifying a hate group (but distinguishing from people discussing hate groups). This, in turn, could help us to study the broad effects of deplatforming, by more systematically examining where communities migrate when they’re kicked off a major site. Those possibilities are a long way off, but distinguishing participants in a group from onlookers talking about the group is a step towards the nuance in language processing we need.
Our study focuses on a simple version of this problem: given a subreddit representing an in-group, and a subreddit dedicated to discussing the in-group, automatically label commenters as being part of the in-group or onlookers based on the text of their comments. We use the following list of subreddit pairs:
In-Group | Onlooker | Description |
---|---|---|
r/NoNewNormal | r/CovIdiots | NoNewNormal discussed perceived government overreach and fear-mongering related to Covid-19 |
r/TheRedPill | r/TheBluePill | TheRedPill is part of the “manosphere” of misogynistic anti-feminist communities |
r/BigMouth | r/BanBigMouth | Big Mouth is a sitcom focusing on puberty; BanBigMouth claimed the show was associated with pedophilia and child-grooming, and petitioned for the show to be discontinued |
r/SuperStraight | r/SuperStraightPhobic | SuperStraight was an anti-trans subreddit, SuperStraightPhobic antagonized its userbase and content |
r/ProtectAndServe | r/Bad_Cop_No_Donut | ProtectAndServe is a subreddit of verified law-enforcement officers, while Bad_Cop_No_Donut documents law enforcement abuse of power and misconduct |
r/LatterDaySaints | r/ExMormon | LatterDaySaints is an unofficial subreddit for Mormon practitioners, while ExMormon hosts typically critical discussion about experiences with the church |
r/vegan | r/antivegan | Vegan discusses cooking tips, environmental impact, animal cruelty, and other vegan topics. AntiVegan is mostly satirical, making fun of “vegan activists” |
Some of these subreddit pairs are directly related: r/TheBluePill is explicitly about r/TheRedPill. Other subreddit pairs are only conceptually connected: r/Bad_Cop_No_Donut is about law enforcement, but it’s not specifically about discussing r/ProtectAndServe. This variety should help illustrate under what conditions we can clearly distinguish in-groups from onlookers.
For each subreddit pair, we downloaded all comments made in each subreddit during the last year in which they were both active. In other words, if one or both subreddits have been banned, we grab the year of comments leading up to the first ban. If both subreddits are still active, we grab the comments from the last 365 days to present.
We discarded comments from bots, and comments from users with an in-subreddit average karma below one. This is to limit the effect of users from an onlooking subreddit “raiding” the in-group subreddit (or vice versa), and therefore muddying our understanding of how each subreddit typically writes.
Next, we want to identify the words used far more in the in-group than the onlooking group, or vice versa. There are a variety of ways of measuring changes in word-usage, including rank turbulence divergence (which words have changed the most in terms of their order of occurrence between one dataset and another) and Jensen-Shannon divergence (the difference in word frequency between each subreddit and a combination of the two subreddits).
For example, here’s a plot illustrating which words appear more prominently in r/NoNewNormal or r/CovIdiots, based on the words “rank”, where rank 1 is the most used word, and rank 10,000 is the 10,000th most-used word:
While we know both subreddits feature terms like “vaccine”, “mask”, and “covid”, this plot tells us that terms like “doomer”, “trump”, and “lockdown” are used disproportionately in our in-group, while disparaging terms like “idiot”, “stupid”, and “moron” are far more common in the onlooker group.
We can already see one limitation of this study: the most distinguishing term between our two subreddits is “covidiot”, a term developed on r/CovIdiots. We’re not just capturing some context around the in-group’s use of terminology, we’re identifying keywords specific to this community of onlookers, too.
Now that we’ve had a peek at the data, and have confirmed that there are terms that strongly distinguish one community from its onlookers, we want to build a classifier around these distinguishing terms. Specifically, for every user we want to get a big text string consisting of all of their comments, the classifier should take this comment string as input, and return whether the user is in the in-group or the onlooker group.
Since we know whether each user participates mostly in the in-group subreddit, or the onlooking subreddit, we’ll treat that as ground-truth to measure how well our classifier performs.
We built two classifiers: a very simple linear-regression approach that’s easy to reverse-engineer and examine, and a “Longformer” transformer deep-learning model that’s much closer to state-of-the-art, but more challenging to interrogate. This is a common approach that allows us to examine and debug our results using our simple method, while showing the performance we can achieve with modern techniques.
We trained the linear regression model on term frequency-inverse document frequency; basically looking for words common in one subreddit and uncommon in another, just like in the plot above. We configured the Longformer model as a sequence classifier; effectively “given this sequence of words, classify which subreddit they came from, based on a sparse memory of prior comments from each subreddit.”
Here’s our performance on a scale from -1 (labeled every user incorrectly) to 0 (did no better than proportional random guessing) to 1 (labeled every user correctly):
In-Group | Onlooker | Logistic Regression Performance | Longformer Performance |
---|---|---|---|
r/NoNewNormal | r/CovIdiots | 0.41 | 0.48 |
r/TheRedPill | r/TheBluePill | 0.55 | 0.65 |
r/BigMouth | r/BanBigMouth | 0.64 | 0.80 |
r/SuperStraight | r/SuperStraightPhobic | 0.35 | 0.43 |
r/ProtectAndServe | r/Bad_Cop_No_Donut | 0.50 | 0.55 |
r/LatterDaySaints | r/ExMormon | 0.65 | 0.72 |
r/vegan | r/antivegan | 0.49 | 0.56 |
Or, visually:
Much better than guessing in all cases, and for some subreddits (BigMouth, LatterDaySaints, and TheRedPill) quite well!
If a user has barely commented, or their comments all consist of responses like “lol”, classification will be near-impossible. Therefore, we can re-run our analysis, this time only considering users who have made at least ten comments, with at least one hundred unique words.
In-Group | Onlooker | Logistic Regression Performance | Longformer Performance |
---|---|---|---|
r/NoNewNormal | r/CovIdiots | 0.57 | 0.60 |
r/ProtectAndServe | r/Bad_Cop_No_Donut | 0.65 | 0.76 |
r/LatterDaySaints | r/ExMormon | 0.80 | 0.83 |
r/vegan | r/antivegan | 0.65 | 0.72 |
And visually again:
For a few subreddit pairs, the onlooking subreddit has too few comments left over after filtering for analysis to be meaningful. For the four pairs that remain, performance improves significantly when we ignore low-engagement users.
Similarly, we can examine what kinds of users the classifier labels correctly most-often:
The classifier performs better on users with more comments (and therefore more text to draw from), and more karma in the subreddit (which typically correlates with number of comments unless the user is immensely unpopular), but does not significantly differ with mean subreddit karma. In other words, popular users who receive lots of karma on many of their comments, and therefore might be more representative of the subreddit’s views, are not easier to classify.
For a first attempt at solving a new problem, we have some promising results. We can consistently distinguish users from an in-group and users from a specific onlooking group, based on the language of users’ posts. Our study focuses on subreddits, which provide a best-case scenario for classification: comments are neatly partitioned into the in-group and onlooker subreddits. If we studied Twitter users, for example, we’d have no baseline to determine whether our classifier was guessing correctly, or even a good way to feed it training data, without human annotators labeling thousands of Twitter accounts by hand.
It’s also unclear how well this classifier would function in a cross-platform environment. For example, could we train the classifier on a subreddit, and then classify Twitter or Discord users based on their comments? Theoretically, the same community will discuss the same topics on multiple platforms, likely with similar keywords. However, the design of each platform (such as the short character limits on Tweets) may constrain authors enough to make classification harder.
Finally, it’s unclear how well this classification will hold up over time. Would a classifier trained on last year’s comments still perform well on users from this year? Or will the discussion topics of a community have drifted too far for those old word frequencies to be useful? This could be especially important when communities migrate between platforms, when we may for example have old Reddit data and new Discord data.
Lots more to do, but I’m excited about these first steps!
]]>Posted 5/10/2022
I recently gave a talk to the Joint Lab on using git to write code collaboratively with other scientists. I’m going to recycle that talk in blog form, in case anyone else would like a quick crash-course. This post assumes that you’ve used git a few times before, and are trying to get a better handle on it.
My examples use the command line and diagrams to explain the concepts behind git operations. You may find graphical git software, like GitHub Desktop, or the git integration in your text editor, more approachable. That’s perfectly fine; I just find the command line makes git operations more explicit, and so is clearer for teaching.
Git is a version-control system: it tracks changes to files in a folder, typically source code, allowing developers to undo their changes, examine the history of changes, and merge changes that they and others have made.
A group of changes to one or more files is called a commit. Each commit includes a reference to the previous commit, creating a timeline of every change since the start of the project.
The current state of the repository can be described as the sum of the most recent commit, and every ancestor commit that came before it. The current state of the repository is referred to as HEAD
.
A git branch is a series of commits. The default branch is typically called main
or master
. More on branches later.
Git can technically be used offline: you can create a local repository on your computer, track changes to files in a project, and never share the commits with anyone. However, we most frequently use git as a collaboration tool, typically by creating a repository on a central site like GitHub or GitLab and giving multiple developers access.
When you clone
a GitHub repository, your computer assigns a name to the remote server, origin
by default. Your computer then downloads the repository contents, and creates two branches: an origin/main
branch, representing the commits GitHub is aware of, and a main
branch, representing your own work.
When you pull
new changes down from GitHub, git first downloads the commits to the origin/main
branch, then fast-forwards your own main
branch to match:
Similarly, when you push
local changes to GitHub, you’re sending changes from your own main
branch to GitHub, so the origin/main
branch catches up to your own work:
If you are the only developer on a repository, and you only develop from one computer, then you can push
and pull
to your heart’s content. However, if someone else has pushed changes to GitHub since you have, then you have a conflict, where GitHub has commits that you don’t, and you have commits that GitHub doesn’t:
In order to sync your local commits with GitHub, you need to resolve the conflict with one of two strategies.
If the commits on GitHub (C
in this diagram) edit different files than the local commits (D
and E
), then the “conflict” is purely bureaucratic. The file changes aren’t incompatible, you just need to re-order the commits to create a coherent timeline. The rebase
command will rewrite your local commits so that they come after C
:
The full commands to accomplish this are:
git fetch origin main
git rebase origin/main
The first downloads changes from GitHub (origin
) to the origin/main
branch, but does not attempt to combine them with the local main
branch. The second rebases the local commits to occur after the origin/main
commits.
If commit C
does change the same files as commits D
or E
, then the two histories may be incompatible. In order to combine them, we need to add a new commit, F
, that incorporates the changes from all three commits, possibly changing files further to make the changes mesh.
Similarly, the commands for this are:
git fetch origin main
git merge origin/main
If you’ve never run these commands before, that’s because git pull
is shorthand for the above two commands!
So far we’ve only considered repositories with a single branch: main
. However, using multiple branches is key to successful collaboration. You can create side branches to work on a new feature, or otherwise make large breaking changes to a codebase. Branches let you work in a corner, contain your mess, and only merge those changes back to the main branch when you’re done and ready to share.
Creating a local branch is as easy as running:
git checkout -b new_feature
This creates a new branch called new_feature
, and switches HEAD
to track the new branch. When you add new commits, they’ll now be added to the new_feature
branch instead of main
:
To switch back to the main
branch, run git checkout main
:
This will un-apply the changes from commits D
and E
, reverting the codebase to the state of the main
branch. If you’ve made changes that you haven’t committed, git will not let you change branches. In other words, switching branches will never destroy your work.
When you’re ready to combine your changes from your side branch to the main branch, simply checkout the main branch, and run git merge new_feature
or git rebase new_feature
as appropriate.
By default, newly created branches only exist in your local repository, and are not pushed up to GitHub. This is so that you can do as much internal bookkeeping and branching as you want, without cluttering what other developers see. Make side branches to your heart’s content!
However, if you want to share your branch with other developers, to collaborate on a side branch without merging everything to main
, then you need to create a “remote” branch.
If you already have a local branch checked out, you can create a corresponding remote branch (and then push to send those changes to GitHub) with:
git branch --set-upstream-to origin/new_feature
git push
If you are creating a new branch for the first time, and already know you want to share it with others, you can create the branch like:
git checkout -b new_feature origin/new_feature
In either case, your commit tree will now look like:
From now on, you can git push
and git pull
in the new_feature
branch to sync it to GitHub’s origin/new_feature
branch.
If you’ve made some changes to the main
branch that you wanted to put on new_feature
and you haven’t committed yet, then you can move those changes over by stashing them in a corner:
git stash
git checkout new_feature
git stash pop
Stashing uncommitted changes bypasses git’s rule about not permitting you to switch branches before committing, because your changes will not be overwritten this way.
If you’ve already committed your changes to main
instead of new_feature
and you haven’t pushed yet then you can fix the mishap with:
git checkout new_feature
git merge main
git checkout main
git reset --hard HEAD~3
This will move the changes to the new_feature
branch (by merging them in from main), then on the main branch, undoes the last three commits, rewriting history.
Since the commits are still referenced by the new_feature
branch, no work is lost, but they no longer appear in the main
branch’s history.
Usually when you make a mistake, like committing a typo, the solution is to add a new commit that corrects the mistake. There is rarely any need to rewrite history and un-commit something.
One glaring exception is when you commit security-sensitive information. For example, if you accidentally commit your Amazon EC2 API keys to a public GitHub repository, then adding a new commit to erase them is insufficient. Anyone could look in the commit history of the project and fetch the keys back out.
In this rare circumstance, you can roll back back the most recent commits on the local main branch with:
git reset --hard HEAD~3 # Replace '3' with the number of commits to undo
And then push those changes up to GitHub, being very explicit about forcing GitHub to update their origin/main
to match your local main
, even though you aren’t adding new commits:
git push origin main --force
Note that if anyone else has pulled the mistaken commit down from GitHub, they’ll still be able to view the withdrawn credentials, and they’ll now get an error when they try to pull new changes down. This is because their own git history is now incompatible with the one on GitHub. To fix this, they need to reset their main branch to overwrite their local history with GitHub’s:
git fetch origin main
git reset --hard origin/main
Needless to say, this is extremely messy, and should be avoided whenever possible.
Git has a special configuration file, .gitignore
, that tells it to ignore specific files. For example, you might include a .gitignore
file at the top level of the repository containing:
*.png
database_credentials.json
logs/
This will tell git to ignore all PNG files, the database credentials file, and the entire logs folder. You will never be prompted to add or commit those files, so there is no risk of accidentally pushing them to a public repository.
Git has poor support for moving and renaming files. When you rename a file without telling git, it compares the contents of the “new” file and the “missing” file, and if the files are identical or extremely similar, it assumes the file has simply been moved. However, if you rename a file and then make some changes to it, chances are git will no longer recognize it as the same file, and will prompt you to commit deleting the old file and creating a new file. This is inconvenient, since it destroys the history associated with the file, making it difficult to track changes. Fortunately, the solution is simple - move your files using git:
git mv oldfile newfile
This guarantees that git knows the file is renamed, and maintains history.
Git is a powerful tool for collaborating with others on code. It has an extraordinary breadth of functionality, because it was created by the Linux kernel developers to help them manage what is likely the single most complicated open source project in human history. However, this post covers most of the basics, and as much as I’ve needed for almost any project. I picked up git on my own, as I think many scientists do, and for several years felt like I half-understood what I was doing and stumbled my way through getting git to do what I wanted. Hopefully this can help someone in a similar place!
]]>Posted 4/10/2022
Distributed Denial of Secrets (DDoSecrets) is a transparency collective. We’ve seen some questions and confusion recently as to what that means, so I’d like to elaborate. See also our about page, our Wikipedia page, and articles here and here.
When whistleblowers within an organization, or hackers that have gained access to files, want to get the word out, DDoSecrets can:
Announce the release to a wide audience
Check the files for personally identifying information
Package a release for public consumption, via torrents or web-search interfaces
Contact journalists at a wide range of publications
Collaborate with academic researchers
Share more sensitive documents with journalists and researchers, without making all data public
That’s our primary role: connectors and proliferators of data.
For most releases, a source contacts us anonymously. They describe what data they have, we have a conversation to establish veracity and the scope of documents, and negotiate a safe way to move the data to us. We typically do not know who our sources are, and do not have contact with them after they submit documents. Occasionally a source will ask to be credited publicly, in which case we verify their identity and attribute the leak to them.
We also mirror releases published by other groups. For example, ransomware groups sometimes publish their victims’ data to criminal forums when their (typically corporate) victims refuse to pay. Other criminals can then peruse the release looking for usernames and passwords, personal information, and anything else they can profit off of. By making a copy of these releases for journalists, we hope that some social good can come out of what would otherwise only benefit other bad actors.
We’ve also published datasets on behalf of other public organizations, notably including the .Win Network archive originally scraped by the Social Media Analysis Toolkit (SMAT) team, and the Patriot Front files, sourced by Unicorn Riot.
When considering a new release, we have four main outcomes: discard the data, publish it publicly in full, publish it only to journalists and researchers, or publish a redacted subset of the documents publicly.
When we publish a release publicly, we typically release the documents via torrent, with private backups to reseed the torrents if needed. When possible, we also release documents through websites like DDoS Search, which can make data easier to browse, explore, and analyze.
When we release data with limited distribution, journalists and researchers contact us (or vice versa), and we negotiate a way to share files with them. If it’s feasible to filter out personal information, then we’ll publish the subset of low-risk documents publicly. However, given our limited staff, and the large size of many of these datasets (millions of emails and documents), we frequently mark datasets for limited distribution to vetted reserchers.
In either case, we add a detailed summary of the release to our website and usually announce publication via our newsletter, our Telegram channel, Twitter, and whatever other platforms are useful.
Usually our role ends as a publisher, but when resources permit, we also work with some journalists and academics to analyze datasets. Sometimes this aid is short-term; when our releases include content like database dumps that not all newsrooms are equipped to investigate, we’ve been able to help some news groups run a SQL query to find relevant information and export it in a simpler format like a spreadsheet.
In other cases, our collaborations have been more involved. DDoSecrets has worked on months-long investigations with:
The Organized Crime and Corruption Reporting Project (OCCRP) on 29Leaks
The University of Southern Maine and Harvard’s Institute of Quantitative Social Science, resulting in this paper so far
65 journalists and 20 outlets, including Forbidden Stories, on Mining Secrets
Our role in these collaborations has ranged from coordinating who’s investigating which sections of a release and collating results, to providing infrastructure (wikis for coordination, interactive sites for document search, and virtual machines for analyzing datasets), to writing our own document parsers and aiding in network analysis.
Finally, DDoSecrets aggregates investigations into our releases, summarizing and linking to articles on our newsletter and our wiki entries. Some of us interview with journalists, speak on podcasts or on television, and present at conferences about our work, and what our releases and investigations have revealed.
If you’re a fan of what we do, please consider donating, volunteering, and helping spread information about our work. We’re a small team with limited funding, and every bit helps!
]]>Posted 3/30/2022
We know that metadata is incredibly revealing; given information about who you talk to, we can construct a social graph that shows which social circles you’re in and adjacent to, we can predict your politics, age, and a host of other attributes.
But how does social graph prediction work? How do you filter out noise from “real” social connections? How accurate is it, and in what cases does it make mistakes? This post introduces one approach based on the expectation-maximization algorithm, to start a discussion.
We have texting logs from 20 individuals, showing how many times they texted every other participant over the course of a week. We have no additional data, like the timestamps of messages, length, or contents. We also won’t consider the directionality of who sent the texts, just “how many texts were sent between person 1 and 2?” Given these texting logs, we want to find the most probable friendship graph between these 20 people. We will also assume that friendships are always reciprocated, no one-way friends.
We can represent this input as a matrix, where the row and column indicate who is speaking, and the value represents number of texts:
Person 1 | Person 2 | Person 3 | … | |
---|---|---|---|---|
Person 1 | 0 | 6 | 1 | |
Person 2 | 6 | 0 | 11 | |
Person 3 | 1 | 11 | 0 | |
… |
It may be tempting to apply a cutoff here. For example, if person 1 and 2 text more than X
times we’ll assume they’re friends. However, this doesn’t easily let us represent uncertainty: If the number of texts is close to X
, how do we represent how sure we are that they might be friends? Even for values much greater or lower than X
, how do we represent our confidence that we haven’t found two non-friends who text a surprising amount, or two friends who text surprisingly infrequently? Instead, we’ll use a slightly more sophisticated approach that lends itself to probability.
We will assume that friends text one another at an unknown rate, and that non-friends text one another at a lower unknown rate. This is a big assumption, and we’ll revisit it later, but for now take it as a given.
We can represent the two texting rates using Poisson distributions. This allows us to ask “what’s the probability of seeing k
events (texts), given an underlying rate at which events occur?” The math for this looks like:
We can use this building block to ask a more useful question: Given that we did see k
texts, is it more likely that these texts came from a distribution of friends texting, or a distribution of non-friends texting?
This is equivalent to asking “what is the probability that person A and B are friends, given the number of texts sent between them?” So, all we need to do now is run through every pair of people, and calculate the probability that they’re friends!
There’s just one problem: We have no idea what the two texting rates are. To estimate them, we’ll need to add a new level of complexity.
When determining our friends texting rate and non-friends texting rate, it would be very helpful if we knew the probability that two people are friends. For example, if 80% of all possible friendships exist, then we know most logs of texts represent the logs of friends texting, and only about the lowest 20% of text counts are likely to represent non-friends.
This sounds like it’s making things worse: now we have a third unknown variable, the likelihood of friendship, which we also don’t know the value of! In reality, it will make the problem much easier to solve.
Let’s make a second huge starting assumption: There is an equal likelihood that any two randomly chosen people in the group will be friends. This is generally not true in social graphs - highly charismatic, popular people usually have far more friends, so the probability of friendship is not at all equal - but it will make the math simpler, and it’s not a terrible assumption with only 20 people in our social graph.
We can represent this probability as follows:
To re-iterate the second line, the probability of any given friendship network F
is equal to the probability of each friendship in the network existing, times the probability of each non-friendship not existing. In other words, if our friendship probability is 0.8, then about 80% of all possible friendships should exist, and if we propose a friendship network with only five friendships then the above math will tell us that the scenario is highly unlikely.
It’s important to note that this network model represents our prior assumption about the underlying friendship network, but doesn’t lock us in: given enough evidence (text messages) we will override this prior assumption, and add friendship edges even if they are unlikely under a random network.
Next, let’s approach the original problem backwards: Given a friendship network F
, what’s the probability that we’d get the text logs we’ve received?
That is, for each friendship that does exist, get the probability of seeing our text observations from the friends texting distribution, and for each friendship that does not exist, get the probability of our text observations from the non-friends texting distribution. Multiply all those probabilities together, and you have the probability of seeing our full set of logs.
We can combine the above pieces and solve in terms of the most likely values for our friends texting rate, our non-friends texting rate, and our friendship probability. I will not include the details in this post, because it’s about five pages of calculus and partial derivatives. The high level idea is that we can take the probability of a friendship network given observed texts and parameters, and take the partial derivative with respect towards one of those parameters. We multiply this across the distribution of all possible friendship networks, weighted by the probability of each network occurring. We set the entire mess equal to zero, and solve for our parameter of interest. When the derivative of a function is zero it’s at either a local minimum or maximum, and for out-of-scope reasons we know that in this context it yields the global maximum. Ultimately, this gives us the most likely value of a parameter, given the probability that each pair of people are friends:
Where n
is our number of participants, in this case 20.
But wait! Didn’t the probability that two people are friends depend on the texting rates? How can we solve for the most likely texting rate, based off of the texting rates? We’ll do it recursively:
Start with arbitrary guesses as to the values of our two texting rates, and the friendship probability
Calculate the probability that each pair of people are friends, based on our three parameters
Calculate the most likely values of our three parameters, given the above friendship probabilities
Loop between steps 2 and 3 until our three parameters converge
One quick python script to parse the log files and run our four math equations, and:
We’ve got an answer! An average of 10 texts per week among friends, 2 among non-friends, and a 20% chance that any two people will be friends. The parameters converge after only 25 steps or so, making this a quick computational optimization.
With our three parameters we can calculate the likelihood that any two individuals are friends based on observed texts, and plot those likelihoods graphically:
This is our “answer”, but it’s not easy to understand in current form. We’d prefer to render this as a friendship network, where nodes represent people, with an edge between every two people who are friends. How do we translate from this probability matrix to a network? Here, it’s a little more appropriate to apply cutoffs: We can plot all friendships we’re at least 70% confident exist, 90%, and 98%:
Unconvinced by the “detecting friendships from text messages” example? Let’s apply the exact same code to a similar problem with better defined ground truth: predicting the political party of senators.
We can take data on the voting records for session 1 of the 2021 U.S. Senate. For every pair of senators, we can count the number of times they voted the same way on bills (both voted “yea”, or both voted “nay”). We will assume that senators in the same political party vote the same way at one rate, and senators in different political parties vote together at a lower rate. The signal will be noisy because some senators are absent for some bills, or vote against party lines because of local state politics, occasional flickers of morality, etc.
As in the texting problem, we’ll place an edge between two senators if we believe there is a high chance they are in the same party. We can also anticipate the value of rho: Since the senate is roughly split between Democrats and Republicans, there should be close to a 50% chance that two randomly chosen senators will be in the same party. (This is an even better environment for our “random network” assumption than in the texting friendship network, since all senators will have close to the same degree)
First, the optimization:
As expected, the probability of any two senators being in the same party is 0.494 - close to 50%. Members of the same party agree on bills roughly 442 times per senate session, while members of opposing parties agree roughly 111 times. And now, the resulting network:
Nodes are colored according to their declared party (red for Republicans, blue for Democrats, green for independents) to confirm the network clusters we’ve found. I’ve also removed the Vice President (who only votes to break ties), and a senator that held a partial term after their predecessor resigned. Since both the Vice President and the short-term senator voted in far fewer bills than their peers, there were no edges between them and other senators.
The remaining results are accurate! Not counting the people I removed because of insufficient voting history, the algorithm struggled to classify two senators, who are shown in the center. These two senators, Collins (R-ME) and Murkowski (R-AK), are considered some of the most moderate Republicans in the senate, and are swing votes. All other senators are clearly placed in the correct cluster with their peers.
We’ve created a model for detecting social relationships from event data. We assume that events occur at a higher rate between people with a social relationship, and a lower rate between people without a relationship. This is general enough to describe a wide variety of scenarios.
But our model also assumes consistency: what if our assumption about two friend texting and non-friend texting rates didn’t hold? For example, in a larger social graph some people may not text much. They still text friends more often than non-friends, but both rates are much lower than their peers. Our current model would mark these less-active users as friendless, and create no edges to them. We could extend our model by switching from two global texting rates to two individual rates, but then we’d have 2N+1
variables to optimize instead of 3, and will need much more training data to optimize.
We also assumed that the underlying social network was a simple random graph: Every two members have an equal chance of being connected to one another. That assumption is appropriate for our senate network, where we’re trying to ascertain group membership. It works relatively well in the texting network because the population is very small. In many scenarios, however, we expect member degree to follow a power law, where a small number of participants are far more connected than others. We could switch our network model from random to many varieties of exponential or scale-free networks, but this will complicate the math and likely add more parameters to tune.
My main takeaway from this is the need to understand assumptions made by the model, which dictate where it can be meaningfully applied, and where it will produce deeply flawed results.
]]>