Posted 08/08/2023
I’ve submitted a new paper! Here’s the not-peer-reviewed pre-print. This post will discuss my work for non-network-scientist audiences.
There is broad disillusionment regarding the influence major tech companies have over our online interactions. Social media is largely governed by Meta (Facebook, Instagram, Whatsapp), Google (YouTube), and Twitter. In specific sub-communities, like open source software development, a single company like GitHub (owned by Microsoft) may have near-monopolistic control over online human collaboration. These companies define both the technology we use to communicate, and thereby the actions we can take to interact with one another, and the administrative policies regarding what actions and content are permissible on each platform.
In addition to debates over civic responsibility and regulation of online platforms, pushback to the centralized influence of these companies has taken two practical forms:
Alt-Tech. Communities that are excluded from mainstream platforms, often right-wing hate and conspiracy groups, have built an ecosystem of alternative platforms that mirrors their mainstream counterparts, but with administrations more supportive of their political objectives. These include Voat and the .Win-network (now defunct Reddit-clones), BitChute and Parler (YouTube-clones), Gab (Twitter-clone), and many others.
The Decentralized Web. Developers concerned about centralized control of content have built a number of decentralized platforms that aim to limit the control a single entity can have over human communication. These efforts include Mastodon, a Twitter alternative consisting of federated Twitter-like subcommunities, and ad-hoc communities like a loose network of self-hosted git servers. The decentralized web also encompasses much older decentralized networks like Usenet and email, and bears similarity to the motivations behind some Web3 technologies.
It is this second category, of ostensibly self-governed online communities, that interests me. Building a community-run platform is a laudable goal, but does the implementation of Mastodon and similar platforms fall short of those aspirations? How do we measure how ‘decentralized’ a platform is, or inversely, how much influence an oligarchy has over a platform?
One common approach to measuring social influence is to examine population size. The largest three Mastodon instances host over half of the entire Mastodon population. Therefore, the administrators of those three instances have disproportionate influence over permitted speech and users on Mastodon as a whole. Users who disagree with their decisions are free to make their own Mastodon instances, but if the operators of the big three instances refuse to connect to yours then half the Mastodon population will never see your posts.
A size disparity in community population is inevitable without intervention. Social networks follow rich-get-richer dynamics: new users are likely to join an existing vibrant community rather than a fledgling one, increasing its population and making it more appealing to future users. This is fundamentally a social pressure, but it is even further amplified by search engines, which are more likely to return links to larger and more active sites, funneling potential users towards the largest communities.
But is size disparity necessarily a failure of decentralization? Proponents of Mastodon have emphasized the importance of small communities that fit the needs of their members, and the Mastodon developers have stated that most Mastodon instances are small, topic-specific communities, with their mastodon.social
as a notable exception. If smaller communities operate comfortably under the shadow of larger ones, perhaps this is a healthy example of decentralized governance.
Before exploring alternative methods for measuring social centralization, let’s compare a few of these decentralized and alt-tech platforms using the lens of community size. Below is a plot of sub-community population sizes for five platforms.
The y-axis represents the population of each community as a fraction of the largest community’s size. In other words, the largest community on each platform has a size of “1”, while a community with a tenth as many users has a size of “0.1”. The x-axis is what fraction of communities have at least that large a population. This allows us to quickly show that about 2% of Mastodon instances are least 1% the size of the largest instance, or alternatively, 98% of Mastodon instances have fewer than 1% as many users as the largest instance.
This puts Mastodon in similar territory as two centralized platforms, BitChute and Voat. Specifically, the number of commenters on BitChute channels follows a similar distribution to Mastodon instance sizes, while the distribution of Voat “subverse” (analogous to “subreddits”) populations is even more skewed.
By contrast, the number of users on self-hosted Git servers (the Penumbra of Open-Source), and unique authors on Polish Usenet newsgroups, is far more equitable: around a third of git servers have at least 1% as many users as the largest, while the majority of newsgroups are within 1% of the largest.
If smaller communities exist largely independently of larger ones, then the actions of administrators on those large communities does not matter to the small community, and even in the face of a large population disparity a platform can be effectively decentralized. How can we measure this notion of “independence” in a platform-agnostic way such that we can compare across platforms?
Each of the five platforms examined above has some notion of cross-community activity. On Mastodon, users can follow other users on both their own instance and external instances. On the other four platforms, users can directly participate in multiple communities, by contributing to open source projects on multiple servers (Penumbra), or commenting on multiple channels (BitChute), subverses (Voat), or newsgroups (Usenet).
In network science terminology, we can create a bipartite graph, or a graph with two types of vertices: one for communities, and one for users. Edges between users and communities indicate that a user interacts with that community. For example, here’s a diagram of Mastodon relationships, where an edge of ‘3’ indicates that a user follows three accounts on a particular instance:
This allows us to simulate the disruption caused by removing an instance: if mastodon.social
went offline tomorrow, how many follow relationships from users on kolektiva.social
and scholar.social
would be disrupted? More globally, what percentage of all follow relationships by remaining users have just been pruned? If the disruption percentage is high, then lots of information flowed from the larger community to the smaller communities. Conversely, if the disruption percentage is low, then users of the smaller communities are largely unaffected.
Here is just such a plot, simulating removing the largest community from each platform, then the two largest, three largest, etcetera:
From this perspective on inter-community relationships, each platform looks a little different. Removing the largest three Mastodon instances has a severe effect on the remaining population, but removing further communities has a rapidly diminished effect. Removing Usenet newsgroups and BitChute channels has a similar pattern, but less pronounced.
Voat and the Penumbra require additional explanation. Voat, like Reddit, allowed users to subscribe to “subverses” to see posts from those communities on the front page of the site. New users were subscribed to a set of 27 subverses by default. While the two largest subverses by population (QRV
and 8chan
) were topic-specific and non-default, the third largest subverse, news
, was a default subverse with broad appeal and high overlap with all other communities. Therefore, removing the largest two communities would have had little impact on users uninvolved in QAnon discussions, but removing news
would impact almost every user on the site and cuts nearly 10% of interactions site-wide.
The Penumbra consists of independently operated git servers, only implicitly affiliated in that some developers contributed to projects hosted on multiple servers. Since servers are largely insular, most developers only contribute to projects on one, and so those developers are removed entirely along with the git server. If a user contributed to projects hosted on two servers then disruption will increase when the first server is removed, but will decrease when the second server is removed, and the developer along with it. This is shown as spiky oscillations, where one popular git server is removed and drives up disruption, before another overlapping git server is removed and severs the other side of those collaborations.
Sometimes you may be uninterested in the impact of removing the largest 2, 3, or 10 instances, and want a simple summary statistic for whether one platform is “more centralized” than another. One way to approximate this is to calculate the area under the curve for each of the above curves:
This scores Mastodon as the most centralized, because removing its largest instances has such a large effect on its peers. By contrast, while the Voat curve is visually striking, it’s such a sharp increase because removing the largest two communities doesn’t have a large impact on the population.
“Centralization” is an ill-defined term, and network scientists have a range of ways of measuring centralization for different scenarios. These metrics fall into three broad categories:
Scale | Description | Examples |
---|---|---|
Vertex | Measures how central a role a single node plays in the network | Betweenness centrality, Eigenvector centrality |
Cluster | Measures aspects of a particular group of vertices | Assortativity / Homophily, Modularity, Insularity / Border index |
Graph | A summary attribute of an entire graph | Diameter, Density, Cheeger numer |
These metrics can capture aspects of centrality like “this vertex is an important bridge connecting two regions of a graph” or “this vertex is an important hub because many shortest paths between vertices pass through it.” They can measure how tight a bottleneck a graph contains (or, phrased another way, how well a graph can be partitioned in two), they can measure how much more likely similar vertices are to connect with one another, or how skewed the degree distribution of a graph is.
However, these metrics are mostly intended for fully connected unipartite graphs, and do not always have clear parallels in disconnected or bipartite graphs. Consider the following examples:
Many would intuitively agree that the left-most graph is central: one community in the center is larger than the rest, and serves as a bridge connecting several other communities together. By contrast, the middle graph is decentralized, because while the communities aren’t all the same size, none are dramatically larger than one another, and none serve a critical structural role as a hub or bridge.
The graph on the right is harder to describe. One community is much larger than its peers, but the remaining graph is identical to the decentralized example. By degree distribution, the graph would appear to be centralized. If we add a single edge connecting the giant community to any user in the main graph, then the giant community’s betweenness centrality score would skyrocket because of its prominent role in so many shortest-paths between users. However, it would still be inappropriate to say that the largest community plays a pivotal role in the activity of the users in the rest of the graph - it’s hardly connected at all!
My disruption metric is a cluster-level or mesoscale measurement for bipartite graphs that measures the influence of each community on its peers, although you can calculate the area under the disruption curve to make a graph-scale summary statistic. Using this approach, the centralized community is decidedly centralized, and the decentralized and ambiguous graphs are decidedly not.
Community size disparity is natural. Some communities will have broader appeal, and benefit from more from rich-get-richer effects than their smaller, more focused peers. Therefore, even a thriving decentralized platform may have a highly skewed population distribution. To measure the influence of oligarchies on a platform, we need a more nuanced view of interconnection and information flow between communities.
I have introduced a ‘disruption’ metric that accounts for both the size of a community and its structural role in the rest of the graph, measuring its potential influence on its peers. While the disruption metric illustrates how population distributions can be deceptive, it is only a preliminary measurement. Follows across communities and co-participation in communities are a rough proxy for information flow, or a network of potential information flow. A more precise metric for observed information flow might measure the number of messages that are boosted (“retweeted”) from one Mastodon instance to another, or might measure how frequently a new discussion topic, term, or URL appears first in one community, and later appears in a “downstream” community.
Does population size correlate with these measurements of information flow and influence? Are some smaller communities more influential than their size would suggest? How much does the graph structure of potential information flow predict ‘social decentralization’ in practice? There are many more questions to explore in this domain - but this is a start!
]]>Posted 08/07/2023
Ant nests look kind of like networks - they have rooms, and tunnels between the rooms, analogous to vertices and edges on a graph. A graph representation of a nest might help us answer questions about different ant species like:
Do some species create more rooms than others?
Do some species have different room layouts, such as a star with a central room, or a main corridor rooms sprout off of, closer to a random network, or something like a small-world network?
Do some species dig their rooms deeper, perhaps to better insulate from cold weather, or with additional ‘U’ shaped bends to limit flooding in wetter climates?
I’m no entomologist, and I will not answer those questions today. I will however, start work on a tool that can take photos of ant farms and produce corresponding network diagrams. I don’t expect this tool to be practical for real world research: ant farms are constrained to two dimensions, while ants in the wild will dig in three, and this tool may miss critical information like the shapes of rooms. But it will be a fun exercise, and maybe it will inspire something better.
We’ll start with a photo of an ant farm, cropped to only include the dirt:
I want to reduce this image to a Boolean map of where the ants have and have not excavated. For a first step, I’ll flatten it to black and white, adjusting brightness and contrast to try to mark the tunnels as black, and the remaining dirt as white. Fortunately, ImageMagick makes this relatively easy:
convert -white-threshold 25% -brightness-contrast 30x100 -alpha off -threshold 50% original.png processed.png
Clearly this is a noisy representation. Some flecks of dirt are dark enough to flatten to ‘black,’ and the ants have left some debris in their tunnels that appear ‘white.’ The background color behind the ant farm is white, so some regions that are particularly well excavated appear bright instead of dark. We might be able to improve that last problem by coloring each pixel according to its distance from either extreme, so that dark tunnels and bright backgrounds are set to ‘black’ and the medium brown dirt is set to ‘white’ - but that’s more involved, and we’ll return to that optimization later if necessary.
In broad strokes, we’ve set excavated space to black and dirt to white. If we aggregate over regions of the image, maybe we can compensate for the noise.
My first thought was to overlay a square graph on the image. For each, say, 10x10 pixel region of the image, count the number of black pixels, and if they’re above a cutoff threshold then set the whole square to black, otherwise set it to white. This moves us from a messy image representation to a simpler tile representation, like a board game.
There are a few problems with this approach. Looking ahead, I want to identify rooms and tunnels based on clumps of adjacent black tiles. A square has only four neighbors - eight if we count diagonals, but diagonally-adjacent tiles don’t necessarily imply that the ants have dug a tunnel between the two spaces. So, we’ll use hexagons instead of squares: six-neighbors, no awkwardness about ‘corners,’ and we can still construct a regular lattice:
So far so good! A hexagonal coordinate system is a little different from a Cartesian grid, but fortunately I’ve worked with cube coordinates before. For simplicity, we’ll set the diameter of a hexagon to the diameter of a tunnel. This should help us distinguish between tunnels and rooms later on, because tunnels will be around one tile wide, while rooms will be much wider.
Unfortunately, a second problem still remains: there’s no good threshold for how many black pixels should be inside a hexagon before we set it to black. A hexagon smack in the middle of a tunnel should contain mostly black pixels. But what if the hexagons aren’t centered? In a worst-case scenario a tunnel will pass right between two hexagons, leaving them both with half as many black pixels. If we set the threshold too tight then we’ll set both tiles to white and lose a tunnel. If we set the threshold too loose then we’ll set both tiles to black and make a tunnel look twice as wide as is appropriate - perhaps conflating some tunnels with rooms.
So, I’m going to try dithering! This is a type of error propagation used in digital signal processing, typically in situations like converting color images to black and white. In our case, tiles close to white will still be set to white, and tiles close to black will still be darkened to black - but in an ambiguous case where two adjoining tiles are not-quite-dark-enough to be black, we’ll round one tile to white, and the other to black. The result is mostly okay:
We’re missing some of the regions in the upper right that the ants excavated so completely that the white background shone through. We’re also missing about two hexagons needed to connect the rooms and tunnels on the center-left with the rest of the nest. We might be able to correct both these issues by coloring pixels according to contrast and more carefully calibrating the dithering process, but we’ll circle back to that later.
So far we’ve reduced a messy color photograph to a much simpler black-and-white tile board, but we still need to identify rooms, junctions, and tunnels. I’m going to approach this with a depth first search:
Define a global set of explored tiles, and a local set of “neighborhood” tiles. Select an unexplored tile at random as a starting point.
Mark the current tile as explored, add it to the neighborhood, and make a list of unexplored neighbors
If the list is longer than three, recursively explore each neighbor starting at step 2
Once there are no more neighbors to explore, mark the neighborhood as a “room” if it contains at least ten tiles, and a “junction” if it contains at least four. Otherwise, the neighborhood is part of a tunnel, and should be discarded.
If any unexplored tiles remain, select one and go to step 2
Once all tiles have been explored, we have a list of “rooms” and a list of “junctions,” each of which are themselves lists of tiles. We can visualize this by painting the rooms blue and the junctions red:
Looking good so far!
We’re most of the way to a graph representation. We need to create a ‘vertex’ for each room or junction, with a size proportional to the number of tiles in the room, and a position based on the ‘center’ of the tiles.
Then we need to add edges. For this we’ll return to a depth-first flood fill algorithm. This time, however, we’ll recursively explore all tiles adjacent to a room that aren’t part of another room or junction, to see which other vertices are reachable. This won’t preserve the shape, length, or width of a tunnel, but it will identify which areas of the nest are reachable from which others:
And there we have it!
We’ve gone from a color photo of an ant farm to a network diagram, all using simple algorithms, no fancy machine learning. I think we have a decent result for a clumsy first attempt!
There are many caveats. We’re missing some excavated spaces because of the wall color behind the ant farm in our sample photo. The dithering needs finer calibration to identify some of the smaller tunnels. Most importantly, an enormous number of details need to be calibrated for each ant farm photo. The brightness and contrast adjustments and noise reduction, the hexagon size, the dithering thresholds, and the room and junction sizes for flood filling, may all vary between each colony photo.
For all these reasons, I’m pausing here. If I think of a good way to auto-calibrate those parameters and improve on the image flattening and dithering steps, then maybe I’ll write a part two. Otherwise, this has progressed beyond a couple-evening toy project, so I’ll leave the code as-is.
]]>Posted 7/28/2023
A recent research paper, Freaky Leaky SMS: Extracting User Locations by Analyzing SMS Timings (PDF), purports to geolocate phone numbers by texting them and analyzing response times. This is creepy, interesting, and hopefully a warning that can perhaps help phone companies to better protect their customers’ privacy in the future. Today I’m writing up a short summary, context, and some of my thoughts about the study. The original paper is intended for computer security and machine learning scientists, but I intend to write for a broader audience in this post.
When Alice sends Bob a text message, Bob’s phone sends back an acknowledgement automatically - “I received your text!” If Alice’s phone doesn’t receive that acknowledgement before a timeout, Alice gets a “Failed to deliver text” error.
If Alice is standing next to Bob in Chicago, that text should be delivered quickly, and the acknowledgement should arrive almost instantly. If Alice is in Chicago and Bob is in Hong Kong, it should take slightly longer for the round-trip text message and acknowledgement.
So, if the delay before a text acknowledgement correlates with the distance between the phones, can we text Bob from three different phones, and by analyzing the delays, triangulate his position? What level of precision can we obtain when tracking Bob in this way?
In reality, text message delays will be messy. If Alice’s texts travel through a telecommunications hub in Chicago, then there may be a delay related to the amount of congestion on that hub. If there are multiple paths between Alice and Bob across telecommunications equipment, then each path may incur a different delay. Finally, the routes of telecommunications equipment may not take birds-eye-view shortest paths between locations. For example, if Alice and Bob are on opposite sides of a mountain range, the phone switches connecting them may divert around the mountains or through a pass, rather than directly over.
However, “messy” does not mean random or uncorrelated. If we text Bob enough times from enough phones, and apply some kind of noise reduction (maybe taking the median delay from each test-phone?), we may be able to overcome these barriers and roughly identify Bob’s location.
The researchers set up a controlled experiment: they select 34 locations across Europe, the United States, and the United Arab Emirates, and place a phone at each. They assign three of these locations as “senders” and all 34 as “receivers.”
To gather training data, they send around 155K text messages, in short bursts every hour over the course of three days. This provides a baseline of round-trip texting time from the three senders to the 34 receivers during every time of day (and therefore, hopefully, across a variety of network congestion levels).
For testing, the researchers can text a phone number from their three senders, compare the acknowledgement times to their training data, and predict which of the 34 locations a target phone is at. The researchers compare the test and training data using a ‘multilayer perceptron’, but the specific machine learning model isn’t critical here. I’m curious whether a much simpler method, like k-nearest-neighbors or a decision-tree, might perform adequately, but that’s a side tangent.
The heart of the research paper consists of two results, in sections 5.1 and 5.2. First, they try to distinguish whether a target is ‘domestic’ or ‘abroad.’ For example, the sensors in the UAE can tell whether a phone number is also at one of the locations in the UAE with 96% accuracy. This is analogous to our starting example of distinguishing between a Chicago-Chicago text and a Chicago-Hong-Kong text, and is relatively easy, but a good baseline. They try distinguishing ‘domestic’ and ‘abroad’ phones from a variety of locations, and retain high accuracy so long as the two countries are far apart. Accuracy drops to between 75 and 62% accuracy when both the sensor and target are in nearby European countries, where timing differences will be much smaller. Still better than random guessing, but no longer extremely reliable.
Next, the researchers pivot to distinguishing between multiple target locations in a single country - more challenging both because the response times will be much closer, and because they must now predict from among four or more options rather than a simple “domestic” and “abroad”. Accuracy varies between countries and the distances between target locations, but generally, the technique ranges between 63% and 98% accurate.
The rest of the paper has some auxiliary results, like how stable the classifier accuracy is over time as congestion patterns change, how different phones have slightly different SMS acknowledgement delays, and how well the classifier functions if the target individual travels between locations. There’s also some good discussion on the cause of errors in the classifier, and comparisons to other types of SMS attacks.
These results are impressive, but it’s important to remember that they are distinguishing only between subsets of 34 predefined locations. This study is a far cry from “enter any phone number and get a latitude and longitude,” but clearly there’s a lot of signal in the SMS acknowledgement delay times.
So what can be done to fix this privacy leak? Unfortunately, I don’t see any easy answers. Phones must return SMS acknowledgements, or we’d never know if a text message was delivered successfully. Without acknowledgements, if someone’s phone battery dies, or they put it in airplane mode, or lose service while driving through a tunnel, text messages to them would disappear into the void.
Phones could add a random delay before sending an acknowledgement - or the telecommunications provider could add such a delay on their end. This seems appealing, but the delay would have to be short - wait too long to send an acknowledgement, and the other phones will time out and report that the text failed to deliver. If you add a short delay, chosen from, say, a uniform or normal distribution, then sending several texts and taking the median delay will ‘de-noise’ the acknowledgement time.
Right now there are two prominent “defenses” against this kind of attack. The first is that it’s a complicated mess to pull off. To generalize from the controlled test in the paper to finding the geolocation of any phone would require more ‘sending’ phones, lots more receiving phones for calibration, and a ton of training data, not to mention a data scientist to build a classifier around that data. The second is that the attack is “loud:” texting a target repeatedly to measure response times will bombard them with text messages. This doesn’t prevent the attack from functioning, but at least the victim receives some indication that something weird is happening to them. There is a type of diagnostic SMS ping called a silent SMS that does not notify the user, but these diagnostic messages can only be sent by a phone company, and are intended for things like confirming reception between a cell phone and tower.
Overall, a great paper on a disturbing topic. I often find side-channel timing attacks intriguing; the researchers haven’t identified a ‘bug’ exactly, the phone network is functioning exactly as intended, but this is a highly undesired consequence of acknowledgement messages, and a perhaps unavoidable information leak if we’re going to provide acknowledgement at all.
]]>Posted 7/15/2023
A recent excellent paper performs a sophisticated natural language processing task, usually solved using complicated deep-learning neural networks, using a shockingly simple algorithm and gzip. This post will contextualize and explain that paper for non-computer scientists, or for those who do not follow news in NLP and machine learning.
Text Classification is a common task in natural language processing (NLP). Here’s an example setting:
Provided are several thousand example questions from Yahoo! Answers, pre-categorized into bins like ‘science questions,’ ‘health questions,’ and ‘history questions.’ Now, given an arbitrary new question from Yahoo! Answers, which category does it belong in?
This kind of categorization is easy for humans, and traditionally much more challenging for computers. NLP researchers have spent many years working on variations of this problem, and regularly host text classification competitions at NLP conferences. There are a few broad strategies to solving such a task.
One of the oldest computational tools for analyzing text is the Bag of Words model, which dates back to the 1950s. In this approach, we typically discard all punctuation and capitalization and common “stop words” like “the,” “a,” and “is” that convey only structural information. Then we count the number of unique words in a sample of text, and how many times each occur, then normalize by the total number of words.
For example, we may take the sentence “One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them” and reduce it to a bag of words:
{
'one': 0.27,
'ring': 0.27,
'rule': 0.09,
'find': 0.09,
'bring': 0.09,
'darkness': 0.09,
'bind': 0.09
}
We could then take another passage of text, reduce it to a bag of words, and compare the bags to see how similar the word distributions are, or whether certain words have much more prominence in one bag than another. There are many tools for performing this kind of distribution comparison, and many ways of handling awkward edge cases like words that only appear in one bag.
The limitations of bags of words are obvious - we’re destroying all the context! Language is much more than just a list of words and how often they appear: the order of words, and their co-occurrence, conveys lots of information, and even structural elements like stop words and punctuation convey some information, or we wouldn’t use them. A bag of words distills language down to something that basic statistics can wrestle with, but in so doing boils away much of the humanity.
Natural Language Processing has moved away from bags of words in favor of word embeddings. The goal here is to capture exactly that context of word co-occurrence that a bag of words destroys. For a simple example, let’s start with Asimov’s laws of robotics:
A robot may not injure a human being or, through inaction, allow a human being to come to harm. A robot must obey orders given it by human beings except where such orders would conflict with the First Law. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law
Removing punctuation and lower-casing all terms, we could construct a window of size two, encompassing the two words before and after each term as context:
Term | Context |
---|---|
robot | a, may, not, harm, must, obey, law, protect |
human | injure, a, being, or, allow, to, it, by, except |
orders | must, obey, given, it, where, such, would, conflict |
… | … |
This gives us a small amount of context for each term. For example, we know that “orders” are things that can be “obeyed,” “given,” and may “conflict.” You can imagine that if we used a larger corpus for training, such as the complete text of English Wikipedia, we would get a lot more context for each word, and a much better sense of how frequently words appear in conjunction with one another.
Now let’s think of each word as a point in space. The word “robot” should appear in space close to other words that it frequently appears near, such as “harm,” “obey,” and “protect,” and should appear far away from words it never co-occurs with, such as “watermelon.” Implicitly, this means that “robot” will also appear relatively close to other words that share the same context - for example, while “robot” does not share context with “orders,” both “orders” and “robot” share context with “obey,” so the words “robot” and “orders” will not be too distant.
This mathematical space, where words are points with distance determined by co-occurrence and shared context, is called an embedding. The exact process for creating this embedding, including how many dimensions the space should use, how much context should be included, how points are initially projected into space, how words are tokenized, whether punctuation is included, and many finer details, vary between models. For more details on the training process, I recommend this Word Embedding tutorial from Dave Touretzky.
Once we have an embedding, we can ask a variety of questions, like word association: kitten is to cat as puppy is to X? Mathematically, we can draw a vector from kitten to cat, then transpose that vector to “puppy” and look for the closest point in the embedding to find “dog.” This works because “cat” and “dog” are in a similar region of the embedding, as they are both animals, and both pets. The words “kitten” and “puppy” will be close to their adult counterparts, and so also close to animal and pet associations, but will additionally be close to youth terms like “baby” and “infant”.
(Note that these embeddings can also contain undesired metadata: for example, “doctor” may be more closely associated with “man” than “woman”, and the inverse for “nurse”, if the training data used to create the embedding contains such a gender bias. Embeddings represent word adjacency and similar use in written text, and should not be mistaken for an understanding of language or a reflection of the true nature of the world.)
In addition to describing words as points in an embedding, we can now describe documents as a series of points, or as an average of those points. Given two documents, we can now calculate the average distance from points in one document to points in another document. Returning to the original problem of text classification, we can build categories of documents as clouds of points. For each new prompt, we can calculate its distance from each category, and place it in the closest category.
These embedding techniques allow us to build software that is impressively flexible: given an embedded representation of ‘context’ we can use vectors to categorize synonyms and associations, and build machines that appear to ‘understand’ and ‘reason’ about language much more than preceding Bag of Words models, simple approaches at representing context like Markov Chains, or attempts at formally parsing language and grammar. The trade-off is that these models are immensely complicated, and require enormous volumes of training data. Contemporary models like BERT have hundreds of millions of parameters, and can only be trained by corporations with vast resources like Google and IBM.
In modern Natural Language Processing, deep neural networks using word embeddings dominate. They produce the best results in a wide variety of tasks, from text classification to translation to prediction. While variations between NLP models are significant, the general consensus is that more parameters and more training data increase performance. This focuses most of the field on enormous models built by a handful of corporations, and has turned attention away from simpler or more easily understood techniques.
Zhiying Jiang, Matthew Y.R. Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, and Jimmy Lin, did not use a deep neural network and a large embedding space. They did not use machine learning. They used gzip.
Their approach is simple: compression algorithms, like gzip, are very good at recognizing patterns and representing them succinctly. If two pieces of text are similar, such as sharing many words, or especially entire phrases, then compressing the two pieces of text together should be quite compact. If the two pieces of text have little in common, then their gzipped representation will be less compact.
Specifically, given texts A
, B
, and C
, if A
is more similar to B
than C
, then we can usually expect:
len(gzip(A+B)) - len(gzip(A)) < len(gzip(A+C)) - len(gzip(A))
So, given a series of pre-categorized texts in a training set, and given a series of uncategorized texts in a test set, the solution is clear: compress each test text along with each training text to find the ‘distance’ between the test text and each training text. Select the k
nearest neighbors, and find the most common category among them. Report this category as the predicted category for the test text.
Their complete algorithm is a fourteen line Python script:
import gzip import numpy as np for (x1,_) in test_set: Cx1 = len(gzip.compress(x1.encode())) distance_from_x1 = [] for (x2,_) in training_set: Cx2 = len(gzip.compress(x2.encode()) x1x2 = " ".join([x1,x2]) Cx1x2 = len(gzip.compress(x1x2.encode()) ncd = (Cx1x2 - min(Cx1,Cx2)) / max (Cx1,Cx2) distance_from_x1.append(ncd) sorted_idx = np.argsort(np.array(distance_from_x1)) top_k_class = training_set[sorted_idx[:k], 1] predict_class = max(set(top_k_class), key = top_k_class.count)
Shockingly, this performs on par with most modern NLP classifiers: it performs better than many for lots of common English classification data sets, and on most data sets it performs above average. BERT has higher accuracy on every data set, but not by much. A fourteen line Python script with gzip in lieu of machine learning performs almost as well as Google’s enormous embedded deep learning neural network. (See table 3 in the original paper, page 5)
A more recent variant on the classification challenge is to classify text in a language not included in the training data. For example, if we expended enormous resources training BERT on English text, is there any way to pivot that training and apply that knowledge to Swahili? Can we use embeddings from several languages to get some general cross-language fluency at text categorization in other languages? Or, if we do need to re-train, how little training data can we get away with to re-calibrate our embeddings and function on a new language? This is unsurprisingly a very difficult task. The gzip classifier outperformed all contemporary machine learning approaches that the authors compared to. (See table 5 in the original paper, page 6)
This paper is a great reminder that more complicated tools, like ever-larger machine-learning models, are not always better. In particular, I think their approach hits upon an interesting balance regarding complexity. Bag of words models discard context and punctuation, making computation simple, but at the cost of destroying invaluable information. However, keeping all of this information in the form of an embedding, and attempting to parse human language, incurs a heavy complexity cost. There’s a lot of “fluff” in language that we do not necessarily need for classification. The gzip approach keeps the extra context of word order and punctuation, but does not try to tackle the harder problem of understanding language in order to address the simpler problem of looking for similarities. In general, tools should be as simple as possible to complete their task, but no simpler.
It appears that the authors have a made an unusual choice in their accuracy calculations, which inflate their scores compared to contemporary techniques. In summary, they use a kNN classifier with k=2
, but rather than choosing a tie-breaking metric for when the two neighbors diverge, they mark their algorithm as correct if either neighbor has the correct label. This effectively makes their accuracy a “top 2” classifier rather than a kNN classifier, which misrepresents the performance of the algorithm. This isn’t necessarily an invalid way to measure accuracy, but it does need to be documented, and isn’t what we’d expect in traditional kNN. The gzip scores under a standard k=2 kNN remain impressive for such a simple approach and are still competitive - but they’re no longer beating deep neural network classifiers for non-English news datasets (table 5).
Here’s the problem in a little more detail:
The authors compress all training texts along with the test prompt, to find the gzip distance between the prompt and each possible category example
Rather than choosing the closest example and assuming the categories match, the authors choose the k
closest examples, take the mode of their categories, and predict that. This k-nearest-neighbors (kNN) strategy is common in machine learning, and protects against outliers
When the number of categories among neighbors are tied, one must have a tie-breaking strategy. A common choice is to pick the tie that is a closer neighbor. Another choice might be to expand the neighborhood, considering one additional neighbor until the tie is broken - or inversely, to shrink the neighborhood, using a smaller k until the tie is broken. Yet another choice might be to randomly choose one of the tied categories.
The authors use k=2
, meaning that they examine the two closest neighbors, which will either be of the same category, or will be a tie. Since they will encounter many ties, their choice of tie-breaking algorithm is very important
In the event of a tie between two neighbors, the authors report success if either neighbor has the correct label
The code in question can be found here. Further analysis by someone attempting to reproduce the results of the paper can be found here and is discussed in this GitHub issue. In conversations with the author this appears to be an intentional choice - but unfortunately it’s one that makes the gzip classifier appear to outperform BERT, when in reality if it gives you two potentially correct classes and you always pick the correct choice you will outperform BERT.
I’ve heard the first author is defending their thesis tomorrow - Congratulations, good luck, hope it goes great!
]]>Posted 4/8/2023
Sticking with a theme from my last post on HyperLogLog I’m writing about more probabilistic data structures! Today: Bloom Filters.
Bloom filters track sets: you can add elements to the set, and you can ask “is this element in the set?” and you can estimate the size of the set. That’s it. So what’s the big deal? Most languages have a set
in their standard library. You can build them with hash tables or trees pretty easily.
The magic is that Bloom filters can store a set in constant space, while traditional set
data structures scale linearly with the elements they contain. You allocate some space when you create the Bloom filter - say, 8 kilobytes of cache space - and the Bloom filter will use exactly 8 kilobytes no matter how many elements you add to it or how large those elements are.
There are two glaring limitations:
You cannot enumerate a Bloom filter and ask what elements are in the set, you can only ask whether a specific element may be in the set
Bloom filters are probabilistic: they can tell you that an element is not in the set with certainty (no false negatives), but they can only tell you that an element may be in the set, with uncertainty
When creating a Bloom filter, you tune two knobs that adjust their computational complexity and storage requirements, which in turn control their accuracy and the maximum number of unique elements they can track.
Why would we want a non-deterministic set that can’t tell us definitively what elements it includes? Even if constant-space storage is impressive, what use is a probabilistic set?
Your web-browser stores images, videos, CSS, and other web elements as you browse, so that if you navigate to multiple pages on a website that re-use elements, or you browse from one website to another and back again, it doesn’t need to re-download all those resources. However, spinning hard drives are slow, so checking an on-disk cache for every element of a website will add a significant delay, especially if we learn that we don’t have the element cached and then need to fetch it over the Internet anyway. One solution here is using a Bloom filter as a pre-cache: check whether the URL of a resource is in the Bloom filter, and if we get a “maybe” then we check the disk cache, but if we get a “no” then we definitely don’t have the asset cached and need to make a web request. Because the Bloom filter takes a small and fixed amount of space we can cache it in RAM, even if a webpage contains many thousands of assets.
Databases can use Bloom filters in a similar way. SQL databases are typically stored as binary trees (if indexed well) to faciliate fast lookup times in queries. However, if a table is large, and lots of data must be read from a spinning hard drive, then even a well-structured table can be slow to read through. If queries often return zero rows, then this is an expensive search for no data! We can use Bloom filters as a kind of LOSSY-compressed version of rows or columns in a table. Does a row containing the value the user is asking for exist in the table? If the Bloom filter returns “maybe” then evaluate the query. If the Bloom filter returns “no” then return an empty set immediately, without loading the table at all.
Social media sites may want to avoid recommending the same posts to users repeatedly in their timeline - but maintaining a list of every tweet that every user has ever seen would require an unreasonable amount of overhead. One possible solution is maintaining a Bloom filter for each user, which would use only a small and fixed amount of space and can identify posts that are definitely new to the user. False positives will lead to skipping some posts, but in an extremely high-volume setting this may be an acceptable tradeoff for guaranteeing novelty.
Bloom filters consist of an array of m
bits, initially all set to 0, and k
hash functions (or a single function with k
salts). To add an element to the set, you hash it with each hash function. You use each hash to choose a bucket from 0
to m-1
, and set that bucket to 1. In psuedocode:
def add(element) for i in 0..k bin = hash(element, i) % m Bloomfilter[bin] = 1
As a visual example, consider a ten-bit Bloom filter with three hash functions. Here we add two elements:
Querying the Bloom filter is similar to adding elements. We hash our element k
times, check the corresponding bits of the filter, and if any of the bits are zero then the element does not exist in the set.
def isMaybePresent(element) for i in 0..k bin = hash(element, i) % m if( Bloomfilter[bin] == 0 ) return false return true
For example, if we query ‘salmon’, we find that one of the corresponding bits is set, but the other two are not. Therefore, we are certain that ‘salmon’ has not been added to the Bloom filter:
If all of the corresponding bits are one then the element might exist in the set, or those bits could be the result of a full- or several partial-collisions with the hashes of other elements. For example, here’s the same search for ‘bowfin’:
While ‘bowfin’ hasn’t been added to the Bloom filter, and neither of the added fish have a complete hash collision, the partial hash collisions with ‘swordfish’ and ‘sunfish’ cover the same bits as ‘bowfin’. Therefore, we cannot be certain that ‘bowfin’ has or has not been added to the filter.
There are two ways to estimate the number of elements in the set. One is to maintain a counter: every time we add a new element to the set, if those bits were not all already set, then we’ve definitely added a new item. If all the bits were set, then we can’t distinguish between adding a duplicate element and an element with hash collisions.
Alternatively, we can retroactively estimate the number of elements based on the density of 1-bits, the number of total bits, and the number of hash functions used, as follows:
In other words, the density of 1-bits should correlate with the number of elements added, where we add k
or less (in the case of collision) bits with each element.
Both estimates will begin to undercount the number of elements as the Bloom filter “fills.” Once many bits are set to one, hash collisions will be increasingly common, and adding more elements will have little to no effect on the number of one-bits in the filter.
Increasing the number of hash functions lowers the chance of a complete collision. For example, switching from two hash functions to four means you need twice as many bits to be incidentally set by other elements of the set before a query returns a false positive. While I won’t include the full derivation, the optimal number of hash functions is mathematically determined by the desired false-positive collision rate (one in a hundred, one in a thousand, etc):
However, increasing the number of hash functions also fills the bits of the Bloom filter more quickly, decreasing the total number of elements that can be stored. We can compensate by storing more bits in the Bloom filter, but this increases memory usage. Therefore, the optimal number of bits in a Bloom filter will also be based on the false-positive rate, and on the number of unique elements we expect to store, which will determine how “full” the filter bits will be.
If we want to store more elements without increasing the error rate, then we need more bits to avoid further collisions. If we want to insert the same number of elements and a lower error-rate, then we need more bits to lower the number of collisions. If we deviate from this math by using too few bits or too many hash functions then we’ll quickly fill the filter and our error rate will skyrocket. If we use fewer hash functions then we’ll increase the error-rate through sensitivity to collisions, unless we also increase the number of bits, which can lower the error-rate at the cost of using more memory than necessary.
Note that this math isn’t quite right - we need an integer number of hash functions, and an integer number of bits, so we’ll round both to land close to the optimal configuration.
Let’s double-check the theoretical math with some simulations. I’ve inserted between one and five-thousand elements, and used the above equations to solve for optimal Bloom filter parameters for a desired error rate of 1%, 5%, and 10%.
Here’s the observed error rate, and the number of recommended hash functions, plotted using the mean result and a 95% confidence-interal:
As we can see, our results are almost spot-on, and become more reliable as the Bloom filter increases in size! Here are the same simulation results, where the hue represents the number of bits used rather than the number of hash functions:
Since the number of recommended bits changes with the number of inserted elements, I had to plot this as a scatter plot rather than a line plot. We can see that the number of bits needed steadily increases with the number of inserted elements, but especially with the error rate. While storing 5000 elements with a 5% error rate requires around 24-kilobits, maintaining a 1% error rate requires over 40 kilobits (5 kilobytes).
Put shortly, the math checks out.
I think I’m drawn to these probabilistic data structures because they loosen a constraint that I didn’t realize existed to do the “impossible.”
Computer scientists often discuss a trade-off between time and space. Some algorithms and data structures use a large workspace to speed computation, while others can fit in a small amount of space at the expense of more computation.
For example, inserting elements into a sorted array runs in O(n)
- it’s quick to find the right spot for the new element, but it takes a long time to scoot all the other elements over to make room. By contrast, a hash table can insert new elements in (amortized) O(1)
, meaning its performance scales much better. However, the array uses exactly as much memory as necessary to fit all its constituent elements, while the hash table must use several more times memory - and keep most of it empty - to avoid hash collisions. Similarly, compression algorithms pack data into more compact formats, but require additional computation to get useful results back out.
However, if we loosen accuracy and determinism, creating data structures like Bloom filters that can only answer set membership with a known degree of confidence, or algorithms like Hyperloglog that can count elements with some error, then we can create solutions that are both time and space efficient. Not just space efficient, but preposterously so: constant-space solutions to set membership and size seem fundamentally impossible. This trade-off in accuracy challenges my preconceptions about what kind of computation is possible, and that’s mind-blowingly cool.
]]>Posted 3/20/2023
I recently learned about HyperLogLog, which feels like cursed counter-intuitive magic, so I am eager to share.
We want to count unique items, like “how many unique words appear across all books at your local library?” or “how many unique Facebook users logged in over the past month?” For a small set of unique tokens, like counting the unique words in this blog post, you might store each word in a set or hash table as you read them, then count the length of your set when you’re done. This is simple, but means the amount of memory used will scale linearly with the number of unique tokens, making such an approach impractical when counting enormous sets of tokens. But what if I told you we could accurately estimate the number of unique words while storing only a single integer?
To start with, we want to hash each of our words. A hash function takes arbitrary data and translates it to a ‘random’ but consistent number. For example, we’ll use a hash function that takes any word and turns it into a number from zero to 2**64
, with a uniform probability across all possible numbers. A good hash function will be unpredictable, so changing a single letter in the word or swapping the order of letters will yield a completely different number.
Next, we take the resulting hash, treat it as binary, and count how many leading bits are zero. An example is shown below:
We repeat this process for every word, tracking only the highest number of leading zero-bits we’ve observed, which we’ll call n
. When we reach the end of our data, we return 2**n
as our estimate of how many unique words we’ve seen.
So how in the world does this work? The key is that a good hash function returns hashes uniformly across its range, so we have turned each unique word into random numbers. Since hashing functions are deterministic, duplicate words will return the same hash.
A uniformly random number of fixed bit-length (for example, a random 64-bit integer) will start with a zero-bit with a probability of 1/2
, and will start with a 1-bit with a probability of 1/2
. It will start with two zero-bits with a probability of 1/4
, three zero-bits with a probability of 1/8
, and so on. A probability tree for this might look like:
We can run this explanation in reverse: if you have observed a hash that starts with three zero-bits, then on average you will have observed about 8 unique hashes, because around 1 in 8 hashes start with three zero-bits.
This sounds great, but there are two problems. First, the words “on average” are pretty important here: if you only examine one word, and it happens to have a hash starting with four leading zeros, then our probabilistic counting algorithm will guess that you’ve examined sixteen words, rather than one. Over 6% of hashes will start with four leading zeros, so this is easily possible. We need some way to overcome these ‘outliers’ and get a more statistically representative count of leading zeros.
Second, our probabilistic counting function can only return integer powers of two as estimates. It can guess that you’ve observed 8, 256, or 1024 words, but it can never estimate that you’ve observed 800 words. We want an estimator with a higher precision.
One strategy for addressing both limitations of probabilistic counting is to use multiple hashes. If we hash each observed word using ten different hash functions (or one hash function with ten different salts, but that’s a technical tangent), then we can maintain ten different counts of the highest number of leading zeros observed. Then at the end, we return the average of the ten estimates.
The more hash functions we use, the less sensitive our algorithm will be to outliers. Additionally, averaging over multiple counts lets us produce non-integer estimates. For example, if half our hash functions yield a maximum of four leading zeros, and half yield a maximum of five leading zeros, then we could estimate 2**4.5
unique tokens, or around 23.
This approach solves both our problems, but at a severe cost: now we need to calculate ten times as many hashes! If we’re counting upwards of billions of words, then this approach requires calculating nine billion additional hashes. Clearly, this won’t scale well.
Fortunately, there is an alternative solution that requires no additional hashing, known as HyperLogLog. Instead of using multiple hash functions and averaging across the results, we can instead pre-divide our words into buckets, and average across those.
For example, we could make 16 buckets, assign incoming hashes to each bucket uniformly, and maintain a “most leading zero-bits observed” counter for each bucket. Then we calculate an estimated number of unique elements from each bucket, and average across all buckets to get a global estimate.
For an easy approach to assigning hashes to each bucket, we can use the first four bits of each hash as a bucket ID, then count the number of leading zeros after this ID.
Once again, averaging across several sets of “most leading zeros” will minimize the impact of outliers, and afford us greater precision, by allowing non-integer exponents for our powers of two. Unlike the multiple hash solution, however, this approach will scale nicely.
One downside to HyperLogLog is that the bucket-averaging process is a little complicated. Dividing hashes across multiple buckets diminishes the impact of outliers, as desired, but it also diminishes the impact of all our hashes. For example, say we have 64 hashes, spread across 16 buckets, so 4 hashes per bucket. With 64 hashes, we can expect, on average, one hash with six leading zeros. However, each bucket has only four hashes, and therefore an expected maximum of two leading zeros. So while one bucket probably has six, most have closer to two, and taking the arithmetic mean of the buckets would severely underestimate the number of unique hashes we’ve observed. Therefore, HyperLogLog has a more convoluted estimation algorithm, consisting of creating estimates from each bucket, taking their harmonic mean, multiplying by the number of buckets squared, and multiplying by a magic number derived from the number of buckets^{1}. This results in dampening outliers while boosting the estimate back into the appropriate range.
Here’s a plot comparing the accuracy of Probabilistic counting (count leading zeros, no compensation for outliers), Probabilistic-Med counting (run Probabilistic using ten hash functions, return median of results), and HyperLogLog (our fancy bucket solution):
I’ve generated random strings as input, and evaluate at 50 points on the x-axis, with 100 draws of random strings per x-axis point to create a distribution and error bars. The y-axis represents each estimation function’s guess as to the number of unique elements, with a 95% confidence interval.
Unsprisingly, plain probabilistic counting does not fare well. When we generate thousands of strings, the likelihood that at least one will have many leading zeros is enormous, and since our algorithm relies on counting the maximum observed leading zeros, it’s extremely outlier sensitive.
Taking the mean across ten hash algorithms is also outlier-sensitive when the outliers are large enough, which is why I’ve opted for the median in this plot. Probabilistic-Med performs much better, but it suffers the same problems over a larger time-scale: as we read more and more unique tokens, the likelihood goes up that all ten hash functions will see at least one hash with many leading zeros. Therefore, as the number of unique tokens increases, Probabilistic-Med steadily begins to over-estimate the number of unique tokens, with increasing error bars.
HyperLogLog reigns supreme. While error increases with the number of unique hashes, it remains more accurate, with tighter error bars, than the multi-hash strategy, while remaining computationally cheap. We can increase HyperLogLog’s error tolerance and accuracy in high-unique-token scenarios by increasing the number of buckets, although this lowers accuracy when the number of unique tokens is small.
This is so darn cool! Tracking the total number of unique elements without keeping a list of those elements seems impossible - and it is if you need absolute precision - but with some clever statistics we can get a shockingly close estimate.
If you’d like to see a working example, here’s the code I wrote for generating the accuracy plot, which includes implementations of Probabilistic counting, Probabilistic-Med, and HyperLogLog. This is toy code in Python that converts all the hashes to strings of one and zero characters for easy manipulation, so it is not efficient and shouldn’t be treated as anything like an ideal reference.
If you enjoyed this post, you may enjoy my other writing on dimensional analysis, network science for social modeling, or algorithmic complexity.
The derivation of this number is quite complex, so in practice it’s drawn from a lookup table or estimated ↩
Posted 3/6/2023
This is a post about Big-O notation and measuring algorithmic complexity; topics usually taught to computer science undergraduates in their second to fourth semester. It’s intended for curious people outside the field, or new students. There are many posts on this subject, but this one is mine.
In computer science we often care about whether an algorithm is an efficient solution to a problem, or whether one algorithm is more efficient than another approach. One might be tempted to measure efficiency in terms of microseconds it takes a process to run, or perhaps number of assembly instructions needed. However, these metrics will vary widely depending on what language an algorithm is implemented in, what hardware it’s run on, what other software is running on the system competing for resources, and a host of other factors. We’d prefer to think more abstractly, and compare one strategy to another rather than their implementations. In particular, computer scientists often examine how an algorithm scales, or how quickly it slows down as inputs grow very large.
Let’s start with a trivial example: given a list of numbers, return their sum. Looks something like:
def sum(list): total = 0 for item in list total += item end return total end
Since we need to read the entire list, this algorithm scales linearly with the length of the list - make the list a hundred times longer, and it will take roughly a hundred times longer to get a sum. We write this formally as O(n)
, meaning “scales linearly with n
, the size of the input.” We call this formal syntax “Big O notation,” where the ‘O’ stands for “order of approximation” (or in the original German, “Ordnung”).
Not all algorithms scale. If we were asked “return the third element in the list” then it wouldn’t matter whether the list is three elements long or three million elements long, we can get to the third element in a constant amount of time. This is written as O(1)
, indicating no reliance on the input size.
Search algorithms give us our first example problem with divergent solutions. Given a stack of papers with names on them, tell me whether “Rohan” is in the stack. A trivial solution might look like:
def hasName(list) for name in list if name == "Rohan" return true end end return false end
This scales linearly with the length of the list, just like summing the elements. If the list is in an unknown order then we have no choice but to examine every element. However, if we know the list is in alphabetical order then we can do better. Start in the middle of the list - if the name is Rohan, we’re done. If we’re after Rohan alphabetically, then discard the second half of the list, and repeat on the first half. If we’re before Rohan alphabetically, then discard the first half of the list and repeat on the second. If we exhaust the list, then Rohan’s not in it. This approach is called a binary search, and visually looks like:
In code, a binary search looks something like:
def hasName(list) if( list.length == 0 ) return false end middle = list.length / 2 if( list[middle] == "Rohan" ) return true elsif( list[middle] > "Rohan" ) # Search left half return hasName(list.first(middle)) else # Search right half return hasName(list[middle .. list.length] end end
With every step in the algorithm we discard half the list, so we look at far fewer than all the elements. Our binary search still gets slower as the input list grows longer - if we double the length of the list we need one extra search step - so the algorithm scales logarithmically rather than linearly, denoted O(log n)
.
We’ll end this section by looking at two sorting algorithms: insertion sort, and merge sort.
We want to sort a list, provided to us in random order. One simple approach is to build a new sorted list: one at a time, we take elements from the front of the main list, and find their correct position among the sorted list we’ve built so far. To find the correct position we just look at the value left of our new element, and check whether they should be swapped or not. Keep swapping left until the new element finds its correct position. This visually looks like:
One implementation might look like:
def insertionSort(list) for i in 0.upto(list.length-1) for j in (i-1).downto(0) if( list[j] > list[j+1] ) list[j], list[j+1] = list[j+1], list[j] else break # Done swapping, found the right spot! end end end return list end
Insertion sort is simple and easy to implement. If you were coming up with a sorting algorithm on the spot for something like sorting a deck of cards, you might invent something similar. So what’s the runtime?
In insertion sort, we walk the list from start to end, which is O(n)
. For every new element we examine, however, we walk the list backwards from our current position to the start. This operation also scales linearly with the length of the list, and so is also O(n)
. If we perform a backwards O(n)
walk for every step of the forwards O(n)
walk, that’s O(n) * O(n)
for a total of O(n^2)
. Can we do better?
An alternative approach to sorting is to think of it as a divide-and-conquer problem. Split the list in half, and hand the first half to one underling and the second half to another underling, and instruct them each to sort their lists. Each underling does the same, splitting their lists in half and handing them to two further underlings. Eventually, an underling receives a list of length one, which is by definition already sorted. This splitting stage looks something like:
Now we want to merge our results upwards. Each underling hands their sorted list back up to their superiors, who now have two sorted sub-lists. The superior combines the two sorted lists by first making a new empty “merged” list that’s twice as long. For every position in the merged list, the superior compares the top element of each sorted sub-list, and moves the lower element to the merged list. This process looks like:
Once all elements from the two sub-lists have been combined into a merged list, the superior hands their newly sorted list upwards to their superior. We continue this process until we reach the top of the tree, at which point our work is done. This merge step looks like:
In code, the full algorithm might look something like:
# Combine two sorted lists def merge(left, right) merged = [] while( left.length + right.length > 0 ) if( left.length == 0 ) # Left empty, take from right merged += right.shift(1) elsif( right.length == 0 ) # Right empty, take from left merged += left.shift(1) elsif( left[0] < right[0] ) # Top of left stack is less, take it merged += left.shift(1) else # Top of right stack is less, take it merged += right.shift(1) end end return merged end # Takes a single list, sub-divides it, sorts results def mergeSort(list) if( list.length <= 1 ) return list # Sorted already :) end middle = list.length / 2 left = list[0 .. middle-1] right = list[middle .. list.length-1] leftSorted = mergeSort(left) rightSorted = mergeSort(right) return merge(leftSorted, rightSorted) end
So what’s the runtime of merge sort? Well it takes log n
steps to divide the list in half down to one element. We do this division process for every element in the list. That gives us a runtime of n * log n
to break the list apart and create the full tree diagram.
Merging two sorted lists together scales linearly with the size of the lists, so the merge step is O(n)
. We need to perform a merge each time we move up a “level” of the tree, and there are log n
levels to this tree. Therefore, the full merge process also scales with O(n log n)
.
This gives us a total runtime of O(n log n + n log n)
or O(2n log n)
to create the tree and merge it back together. However, because we are concerned with how algorithms scale as the inputs become very large, we drop constants and all expressions but the dominant term - multiplying by 2 doesn’t mean much as n
approaches infinity - and simplify the run time to O(n log n)
. That’s a lot better than insertion sort’s O(n^2)
!
Big O notation typically describes an “average” or “expected” performance and not a “best case” or “worst-case”. For example, if a list is in a thoroughly random order, then insertion sort will have a performance of O(n^2)
. However, if the list is already sorted, or only one or two elements are out of place, then insertion sort’s best-case performance is O(n)
. That is, insertion sort will walk the list forwards, and if no elements are out of place, there will be no need to walk the list backwards to find a new position for any elements. By contrast, merge sort will always split the list into a tree and merge the branches back together, so even when handed a completely sorted list, merge sort’s best-case performance is still O(n log n)
.
Big O notation also does not describe memory complexity. The description of merge sort above creates a temporary merged
list during the merge step, meaning however long the input list is, merge sort needs at least twice as much memory space for its overhead. By contrast, insertion sort works “in place,” sorting the input list without creating a second list as a workspace. Many algorithms make a trade-off between time and space in this way.
Finally, Big O notation describes how an algorithm scales as n
gets very large. For small values of n
, insertion sort may outperform merge sort, because merge sort has some extra bookkeeping to allocate temporary space for merging and coordinate which minions are sorting which parts of the list.
In summary, Big O notation is a valuable tool for quickly comparing two algorithms, and can provide programmers with easy estimates as to which parts of a problem will be the most time-consuming. However, Big O notation is not the only metric that matters, and should not be treated as such.
All of the algorithms described above can be run in polynomial time. This means their scaling rate, or Big O value, can be upper-bounded by a polynomial of the form O(n^k)
. For example, while merge sort scales with O(n log n)
, and logarithms are not polynomials, n log n
is strictly less than n^2
, so merge sort is considered to run in polynomial time. By contrast, algorithms with runtimes like O(2^n)
or O(n!)
are not bounded by a polynomial, and perform abysmally slowly as n
grows large.
These definitions allow us to describe categories of algorithms. We describe algorithms that run in polynomial time as part of set P, and we typically describe P as a subset of NP - the algorithms where we can verify whether a solution is correct in polynomial time.
To illustrate the difference between running and verifying an algorithm, consider the graph coloring problem: given a particular map, and a set of three or more colors, can you color all the countries so that no two bordering countries have the same color? The known algorithms for this problem are tedious. Brute forcing all possible colorings scales with O(k^n)
for k-colors and n-countries, and the fastest known general algorithms run in O(n * 2^n)
. However, given a colored-in map, it’s easy to look at each country and its neighbors and verify that none violate the coloring rules. At worst, verifying takes O(n^2)
time if all countries border most others, but more realistically O(n)
if we assume that each country only borders a small number of neighbors rather than a significant fraction of all countries.
Next, we have NP-Hard: these are the set of problems at least as hard as the most computationally intensive NP problems, but maybe harder - some NP-Hard problems cannot even have their solutions verified in polynomial time. When we describe a problem as NP-Hard we are often referring to this last property, even though the most challenging NP problems are also NP-Hard.
One example of an NP-Hard problem without polynomial verification is the Traveling Salesman: given a list of cities and distances between cities, find the shortest path that travels through every city exactly once, ending with a return to the original city. Trying all paths through cities scales with O(n!)
. More clever dynamic programming solutions improve this to O(n^2 2^n)
. But if someone claims to have run a traveling salesman algorithm, and hands you a path, how do you know it’s the shortest possible path? The only way to be certain is to solve the traveling salesman problem yourself, and determine whether your solution has the same length as the provided answer.
Finally, we have NP-Complete. These are the most challenging problems in NP, meaning:
Solutions to these algorithms can be verified in polynomial time
There is no known polynomial-time solution to these algorithms
Any problem in NP can be translated into an input to an NP-Complete problem in polynomial time, and the result of the NP-Complete algorithm can be translated back, again in polynomial time
Here’s a visualization of these problem classes:
Broad consensus in computer science is that the NP problem space is larger than the P problem space. That is, there are some problems that cannot be solved in polynomial time, but can be verified in polynomial time. However, no one has been able to definitively prove this, in large part because making formal arguments about such abstract questions is exceedingly difficult. There are many problems we do not know how to solve in polynomial time, but how do we prove there isn’t a faster, more clever solution that we haven’t thought of?
Therefore, a minority of computer scientists hold that P = NP, or in other words, all problems that can be verified in polynomial time can also be solved in polynomial time. This would make our set of problem classes look more like:
To prove that P equals NP, all someone would need to do is find a polynomial-time solution to any NP-Complete problem. Since we know all NP problems can be translated back and forth to NP-Complete problems in polynomial time, a fast solution to any of these most challenging problems would be a fast solution to every poly-verifiable algorithm. No such solution has been found.
]]>Posted 2/10/2023
I recently needed to make a graph with a hex lattice shape, like this:
I needed to find distances and paths between different hexagonal tiles, which proved challenging in a cartesian coordinate system. I tried a few solutions, and it was a fun process, so let’s examine each option.
The most “obvious” way to index hexagonal tiles is to label each according to their row and column, like:
This feels familiar if we’re used to a rectangular grid and cartesian coordinate system. It also allows us to use integer coordinates. However, it has a few severe disadvantages:
Moving in the y-axis implies moving in the x-axis. For example, moving from (0,0) to (0,1) sounds like we’re only moving vertically, but additionally shifts us to the right!
Coordinates are not mirrored. Northwest of (0,0) is (-1,1), so we might expect that Southeast of (0,0) would be flipped across the vertical and horizontal, yielding (1,-1). But this is not the case! Southeast of (0,0) is (0,-1) instead, because by dropping two rows we’ve implicitly moved twice to the right already (see point one)
These issues make navigation challenging, because the offsets of neighboring tiles depend on the row. Southeast of (0,0) is (0,-1), but Southeast of (0,1) is (1,0), so the same relative direction sometimes requires changing the column, and sometimes does not.
Rather than using row and column coordinates we could re-index each tile by its “true” cartesian coordinates:
This makes the unintuitive aspects of offset coordinates intuitive:
It is now obvious that moving from (0,0) to (0.5,1) implies both a vertical and horizontal change
Coordinates now mirror nicely: Northwest of (0,0) is (-0.5,1), and Southeast of (0,0) is (0.5,-1).
Following from point 1, it’s now clear why the distance between (0,0) and (3,0) isn’t equal to the distance between (0,0) and (0.5,3).
But while cartesian coordinates are more “intuitive” than offset coordinates, they have a range of downsides:
We no longer have integer coordinates. We could compensate by doubling all the coordinates, but then (0,0) is adjacent to (2,0), and keeping a distance of one between adjacent tiles would be ideal.
While euclidean-distances are easy to calculate in cartesian space, it’s still difficult to calculate tile-distances using these indices. For example, if we want to find all tiles within two “steps” of (0,0) we need to use a maximum range of about 2.237, or the distance to (1,2).
Fortunately there is a third indexing scheme, with integer coordinates, coordinate mirroring, and easy distance calculations in terms of steps! It just requires thinking in three dimensions!
In a cartesian coordinate system we use two axes, since we can move up/down, and left/right. However, on a hexagonal grid, we have three degrees of freedom: we can move West/East, Northwest/Southeast, and Northeast/Southwest. We can define the coordinate of each tile in terms of the distance along each of these three directions, like so:
These “cube coordinates” have one special constraint: the sum of the coordinates is always zero. This allows us to maintain a canonical coordinate for each tile.
To understand why this is necessary, imagine a system where the three coordinates (typically referred to as (q,r,s) to distinguish between systems when we are converting to or from an (x,y) system) correspond directly with the three axes: q refers to distance West/East, r to Northwest/Southeast, and s to Northeast/Southwest. Here’s a visualization of such a scheme:
We could take several paths, such as (0,1,1) or (1,2,0) or (-1,0,2), and all get to the same tile! That would be a mess for comparing coordinates, and would make distance calculations almost impossible. With the addition of this “sum to zero” constraint, all paths to the tile yield the same coordinate of (-1,2,-1).
Distances in cube coordinates are also easy to calculate - just half the “Manhattan distance” between the two points:
def distance(q1, r1, s1, q2, r2, s2): return (abs(q1-q2) + abs(r1-r2) + abs(s1-s2)) / 2
We can add coordinates, multiply coordinates, calculate distances, and everything is simple so long as we remain in cube coordinates.
However, we will unavoidably sometimes need to convert from cube to cartesian coordinates. For example, while I built the above hex grids using cube coordinates, I plotted them in matplotlib, which wants cartesian coordinates to place each hex. Converting to cartesian coordinates will also allow us to find the distance between hex tiles “as the crow flies,” rather than in path-length, which may be desirable. So how do we convert back to xy coordinates?
First, we can disregard the s
coordinate. Since all coordinates sum to zero, s = -1 * (q + r)
, so it represents redundant information, and we can describe the positions of each tile solely using the first two coordinates.
We can also tell through the example above that changing the q
coordinate contributes only to changing the x-axis, while changing the r
coordinate shifts both the x- and y-axes. Let’s set aside the q
coordinate for the moment and focus on how much r
contributes to each cartesian dimension.
Let’s visualize the arrow from (0,0,0) to (0,1,-1) as the hypotenuse of a triangle:
We want to break down the vector of length r=1
into x
and y
components. You may recognize this as a 30-60-90 triangle, or you could use some geometric identities: the internal angles of a hexagon are 120-degrees, and this triangle will bisect one, so theta must be 60-degrees. Regardless of how you get there, we land at our triangle identities:
From here we can easily solve for the x
and y
components of r
, using 2a = r
:
We know that (0,1,-1) is halfway between (0,0) and (1,0,-1) on the x-axis, so q
must contribute twice as much to the x-axis as r
does. Therefore, we can solve for the full cartesian coordinates of a hex using the cube coordinates as follows:
This works great! But it leaves the hexagons with a radius of sqrt(3) / 3
, which may be inconvenient for some applications. For example, if you were physically manufacturing these hexagons, like making tiles for a board-game, they’d be much easier to cut to size if they had a radius of one. Therefore, you will often see the conversion math from cube to cartesian coordinates written with a constant multiple of sqrt(3)
, like:
Since this is a constant multiple, it just re-scales the graph, so all the distance measurements and convenient properties of the system remain the same, but hexagons now have an integer radius.
If you are also excited by these coordinate systems, and want to read more about the logic behind cube coordinates, path-finding, line-drawing, wrapping around the borders of a map, and so on, then I highly recommend the Red Blob Games Hexagon article, which goes into much more detail.
]]>Posted 1/17/2023
In my last post I demonstrated how to perform image dithering to convert colored images to black and white. This consists of converting each pixel to either black or white (whichever is closer), recording the amount of “error,” or the difference between the original luminoscity and the new black/white value, and propagating this error to adjoining pixels to brighten or darken them in compensation. This introduces local error (some pixels will be converted to white when their original value is closer to black, and vice versa), but globally lowers error, producing an image that appears much closer to the original.
I’m still playing with dithering, so in this post I will extend the idea to color images. Reducing the number of colors in an image used to be a common task: while digital cameras may be able to record photos with millions of unique colors, computers throughout the 90s often ran in “256 color” mode, where they could only display a small range of colors at once. This reduces the memory footprint of images significantly, since you only need 8-bits per pixel rather than 24 to represent their color. Some image compression algorithms still use palette compression today, announcing a palette of colors for a region of the image, then listing an 8- or 16-bit palette index for each pixel in the region rather than a full 24-bit color value.
Reducing a full color image to a limited palette presents a similar challenge to black-and-white image dithering: how do we choose what palette color to use for each pixel, and how do we avoid harsh color banding?
We’ll start with a photo of a hiking trail featuring a range of greens, browns, and whites:
Let’s reduce this to a harsh palette of 32 colors. First, we need to generate such a palette:
#!/usr/bin/env python3 import numpy as np def getPalette(palette_size=32): colors = [] values = np.linspace(0, 0xFFFFFF, palette_size, dtype=int) for val in values: r = val >> 16 g = (val & 0x00FF00) >> 8 b = val & 0x0000FF colors.append((r,g,b)) return colors
I don’t know much color theory, so this is far from an “ideal” spread of colors. However, it is 32 equally spaced values on the numeric range 0x000000 to 0xFFFFFF, which we can convert to RGB values. We can think of color as a three dimensional space, where the X, Y, and Z axes represent red, green, and blue. This lets us visualize our color palette as follows:
import matplotlib.pyplot as plt def plotPalette(palette): fig = plt.figure(figsize=(6,6)) ax = fig.add_subplot(111, projection='3d') r = [] g = [] b = [] c = [] for color in palette: r.append(color[0]) g.append(color[1]) b.append(color[2]) c.append("#%02x%02x%02x" % color) g = ax.scatter(r, g, b, c=c, marker='o', depthshade=False) ax.invert_xaxis() ax.set_xlabel('Red') ax.set_ylabel('Green') ax.set_zlabel('Blue') plt.show()
Which looks something like:
Just as in black-and-white image conversion, we can take each pixel and round it to the closest available color - but instead of two colors in our palette, we now have 32. Here’s a simple (and highly inefficient) conversion:
# Returns the closest rgb value on the palette, as (red,green,blue) def getClosest(color, palette): (r,g,b) = color closest = None #(color, distance) for p in palette: # A real distance should be sqrt(x^2 + y^2 + z^2), but # we only care about relative distance, so faster to leave it off distance = (r-p[0])**2 + (g-p[1])**2 + (b-p[2])**2 if( closest == None or distance < closest[1] ): closest = (p,distance) return closest[0] def reduceNoDither(img, palette, filename): pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): pixels[y,x] = getClosest(pixels[y,x], palette) reduced = Image.fromarray(pixels) reduced.save(filename) img = Image.open("bridge.png") palette = getPalette() reduceNoDither(img, palette, "bridge_32.png")
The results are predictably messy:
Our palette only contains four colors close to brown, and most are far too red. If we convert each pixel to the closest color on the palette, we massively over-emphasize red, drowning out our greens and yellows.
Dithering to the rescue! Where before we had an integer error for each pixel (representing how much we’d over or under-brightened the pixel when we rounded it to black/white), we now have an error vector, representing how much we’ve over or under emphasized red, green, and blue in our rounding.
As before, we can apply Atkinson dithering, with the twist of applying a vector error to three dimensional color points:
# Returns an error vector (delta red, delta green, delta blue) def getError(oldcolor, newcolor): dr = oldcolor[0] - newcolor[0] dg = oldcolor[1] - newcolor[1] db = oldcolor[2] - newcolor[2] return (dr, dg, db) def applyError(pixels, y, x, error, factor): if( y >= pixels.shape[0] or x >= pixels.shape[1] ): return # Don't run off edge of image er = error[0] * factor eg = error[1] * factor eb = error[2] * factor pixels[y,x,RED] += er pixels[y,x,GREEN] += eg pixels[y,x,BLUE] += eb def ditherAtkinson(img, palette, filename): pixels = np.array(img) total_pixels = pixels.shape[0] * pixels.shape[1] for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] # Returns reference new = getClosest(old, palette) quant_error = getError(old, new) pixels[y,x] = new applyError(pixels, y, x+1, quant_error, 1/8) applyError(pixels, y, x+2, quant_error, 1/8) applyError(pixels, y+1, x+1, quant_error, 1/8) applyError(pixels, y+1, x, quant_error, 1/8) applyError(pixels, y+1, x-1, quant_error, 1/8) applyError(pixels, y+2, x, quant_error, 1/8) dithered = Image.fromarray(pixels) dithered.save(filename)
Aaaaaand presto!
It’s far from perfect, but our dithered black and white images were facsimiles of their greyscale counterparts, too. Pretty good for only 32 colors! The image no longer appears too red, and the green pine needles stand out better. Interestingly, the dithered image now appears flecked with blue, with a blue glow in the shadows. This is especially striking on my old Linux laptop, but is more subtle on a newer screen with a better color profile, so your mileage may vary.
We might expect the image to be slightly blue-tinged, both because reducing red values will make green and blue stand out, and because we are using an extremely limited color palette. However, the human eye is also better at picking up some colors than others, so perhaps these blue changes stand out disproportionately. We can try compensating, by reducing blue error to one third:
That’s an arbitrary and unscientific compensation factor, but it’s removed the blue tint from the shadows in the image, and reduced the number of blue “snow” effects, suggesting there’s some merit to per-channel tuning. Here’s a side-by-side comparison of the original, palette reduction, and each dithering approach:
Especially at a smaller resolution, we can do a pretty good approximation with a color selection no wider than a big box of crayons. Cool!
]]>Posted 1/16/2023
Dithering means intentionally adding noise to a signal to reduce large artifacts like color banding. A classic example is reducing a color image to black and white. Take this magnificent photo of my neighbor’s cat:
To trivially convert this image to black and white we can take each pixel, decide which color it’s closest to, and set it to that:
#!/usr/bin/env python3 from PIL import Image import numpy as np # Load image as grayscale img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y, row in enumerate(pixels): for x,col in enumerate(row): if( pixels[y,x] >= 127 ): pixels[y,x] = 255 else: pixels[y,x] = 0 bw = Image.fromarray(pixels) bw.save("kacie_bw.png")
But the result is not very satisfying:
The cat is white. Every pixel will be closer to white than black, and we lose the whole cat except the eyes and nose, along with most of the background detail. But we can do better! What if we set the density of black pixels based on the brightness of a region? That is, black regions will receive all black pixels, white regions all white, but something that should be a mid-gray will get closer to a checkerboard of black and white pixels to approximate the correct brightness.
One particularly satisfying way to approach this regional checkerboarding is called error diffusion. For every pixel, when we set it to black or white, we record how far off the original color is from the new one. Then we adjust the color of the adjacent pixels based on this error. For example, if we set a gray pixel to black, then we record that we’ve made an error by making this pixel darker than it should be, and we’ll brighten the surrounding pixels we haven’t evaluated yet to make them more likely to be set to white. Similarly, if we round a gray pixel up to white, then we darken the nearby pixels to make them more likely to be rounded down to black.
In Floyd-Steinberg dithering we process pixels left to right, top to bottom, and propagate the error of each pixel to its neighbors with the following distribution:
That is, pass on 7/16 of the error to the pixel right of the one we’re examining. Pass on 5/16 of the error to the pixel below, and a little to the two diagonals we haven’t examined yet. We can implement Floyd-Steinberg dithering as follows:
def getClosest(color): if( color >= 127 ): return 255 # White return 0 # Black def setAdjacent(pixels, y, x, error): (rows,cols) = pixels.shape[0:2] if( y >= rows or x >= cols ): return # Don't run past edge of image pixels[y,x] += error # Load image as grayscale img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] new = getClosest(old) pixels[y,x] = new quant_error = old - new setAdjacent(pixels, y, x+1, quant_error*(7/16)) setAdjacent(pixels, y+1, x-1, quant_error*(3/16)) setAdjacent(pixels, y+1, x, quant_error*(5/16)) setAdjacent(pixels, y+1, x+1, quant_error*(1/16)) dithered = Image.fromarray(pixels) dithered.save("kacie_dithered_fs.png")
The results are a stunning improvement:
We’ve got the whole cat, ruffles on her fur, the asphalt and wood chips, details on rocks, gradients within shadows, the works! But what are those big black flecks across the cat’s fur? These flecks of “snow” impact the whole image, but they don’t stand out much on the background where we alternate between black and white pixels frequently. On the cat, even small errors setting near-white fur to white pixels build up, and we periodically set a clump of pixels to black.
We can try to reduce this snow by fiddling with the error propagation matrix. Rather than passing all of the error on to adjacent pixels, and mostly to the pixel to the right and below, what if we ‘discount’ the error, only passing on 75% of it? This is the diffusion matrix used in Atkinson dithering:
The code hardly needs a change:
img = Image.open("kacie_color.png").convert("L") pixels = np.array(img) for y,row in enumerate(pixels): for x,col in enumerate(row): old = pixels[y,x] new = getClosest(old) pixels[y,x] = new quant_error = old - new setAdjacent(pixels, y, x+1, quant_error*(1/8)) setAdjacent(pixels, y, x+2, quant_error*(1/8)) setAdjacent(pixels, y+1, x+1, quant_error*(1/8)) setAdjacent(pixels, y+1, x, quant_error*(1/8)) setAdjacent(pixels, y+1, x-1, quant_error*(1/8)) setAdjacent(pixels, y+2, x, quant_error*(1/8)) dithered = Image.fromarray(pixels) dithered.save("kacie_dithered_at.png")
And the snow vanishes:
This is a lot more pleasing to the eye, but it’s important to note that the change isn’t free: if you look closely, we’ve lost some detail on the cat’s fur, particularly where the edges of her legs and tail have been ‘washed out.’ After all, we’re now ignoring some of the error caused by our black and white conversion, so we’re no longer compensating for all our mistakes in nearby pixels. This is most noticeable in bright and dark areas where the errors are small.
I really like this idea of adding noise and propagating errors to reduce overall error. It’s a little counter-intuitive; by artificially brightening or darkening a pixel, we’re making an objectively worse local choice when converting a pixel to black or white. Globally, however, this preserves much more of the original structure and detail. This type of error diffusion is most often used in digital signal processing of images, video, and audio, but I am curious whether it has good applications in more distant domains.
If you enjoyed this post and want to read more about mucking with images and color, you may enjoy reading my post on color filter array forensics.
]]>Posted 12/03/2022
My lab group recently asked me to give a tutorial on using SQL databases in science. While we are all complex systems scientists, my background is in computer science and STS, and many of my colleagues come from physics, mathematics, and philosophy, so we learn a great deal from one another. I’ve turned my slides into a blog post here, like my last lab talk on using Git for scientific software development.
A database is a piece of software for storing and organizing your data. Most importantly, databases make it easy to query your data, asking for subsets of your data that match a specific pattern you are interested in.
If you currently store your data in formats like CSV or JSON, and write lots of code for reading this data and searching through it for pieces relevant to your research question, our goal will be to offload all of this logic from your own code to a database. It will run much faster, it will be faster to write, and it will help you avoid bugs while expressing complicated questions simply.
There are many types of databases, but for this post I’ll split them along two axis: do they run locally (as part of your research code, storing data in a single file) or remotely (running as an independent process you speak to over the network), and does the database use SQL (a language for expressing sophisticated data queries) or not. Here’s a small subset of databases along these axes:
Local | Remote | |
SQL | SQLite, DuckDB | Postgresql, MySQL, MariaDB, MSSQL, … |
NoSQL | Pandas (sorta), BerkeleyDB, … | Redis, MongoDB, Firebase, … |
In this post I’ll be focusing on SQLite and Postgresql as examples. I’ll briefly talk about NoSQL databases at the end, and the scenarios where they might be preferable to SQL databases.
SQLite stores all data in one file on your hard drive. SQLite is a library, so the database software runs inside of the software you write. It is trivial to set up, pretty fast (especially for queries), and has most database features we will want.
Critically, SQLite is ill-suited to concurrency. Since SQLite runs inside of your software, two different Python scripts can easily try to write to the same database file at the same time, risking catastrophic data corruption. You can build sophisticated locking mechanisms to ensure only one program accesses a database at once, but this adds a serious performance bottleneck. SQLite is really intended for a single piece of software to store data, not live setups where several applications write data at the same time.
Postgres runs as a software daemon; it runs all the time, storing data in a series of files and caches that it manages. Whether postgres is running on your own computer or another computer, your research software will communicate with postgresql over the network.
This difference in design means that postgres requires some additional bureaucracy to set up: users and passwords, databases and permissions and authentication. In return however, postgres is even faster than SQLite, and handles concurrent access from many applications trivially. Postgres also has a number of advanced features that are unavailable in SQLite.
Relational databases store data in tables (think spreadsheets), and in the relationships between tables.
userid | firstname | lastname | status |
zerocool | Dade | Murphy | Undergradute |
acidburn | Kate | Libby | Undergradute |
joey | Joey | Pardella | Undergradute |
cerealkiller | Emmanuel | Goldstein | Undergradute |
phantomphreak | Ramon | Sanchez | Undergradute |
lord_nikon | Paul | Cook | Graduate |
building | room | desk | userid |
Sage | 113 | 1 | zerocool |
Sage | 113 | 2 | acidburn |
Perkins | 208 | 7 | joey |
West | 302 | 4 | lord_nikon |
You request data from a table using a SELECT
statement of the form SELECT columns FROM table WHERE row matches condition
. For example:
SELECT userid FROM desks WHERE building='Innovation' AND room=413;
You can also combine multiple tables during a SELECT
to gather related information. Here we fetch the names of all graduate students with a desk assigned to them, by selecting rows from the desks table and combining them with matching entries from the user table where the user IDs of both rows match:
SELECT firstname,lastname FROM desks LEFT JOIN users ON desks.userid=users.userid WHERE status='Graduate';
The following are the main commands for interacting with a SQL database to create relations, add, remove, and update information in relations, and select information from relations:
Command | Description |
SELECT | Return some columns from a relation |
INSERT | Add data to a relation |
DELETE | Remove data from a relation |
UPDATE | Modify data in a relation |
CREATE | Create a new relation (table, view, index) |
DROP | Remove a relation |
EXPLAIN | Show how a query will access data |
So that’s all fine in theory, but how do we write software that actually uses a database?
To connect to a SQLite database from Python we first supply a database filename, open a cursor from the connection, then use the cursor to send a query and get back a 2D array of results.
import sqlite3 conn = sqlite3.connect("university.db") c = conn.cursor() c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid") for (f,l,b,r) in c.fetchall(): print("%s %s has a desk in %s %d" % (f,l,b,r)) conn.commit() # Save any CREATE/INSERT changes to the database conn.close()
You can think of the cursor as a finger tracking your position in the database. Multiple cursors allow us to make multiple queries from the same database and track which results were associated with which request.
Interacting with Postgres is similar to SQLite: we connect to the database, then open a cursor from the connection, and use the cursor to send queries and get results. However, Postgres is a daemon accessible over the network, so we’ll need to supply a hostname and port number where the SQL server can be found, the name of the database we want to reach, and a username and password authorized to connect to that database.
import psycopg2 try: conn = psycopg2.connect(host="127.0.0.1", port="5432", user="registrar", password="hunter2", database="university_users") c = conn.cursor() c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid") for (f,l,b,r) in c.fetchall(): print("%s %s has a desk in %s %d" % (f,l,b,r)) except (Exception, Error) as error: print("Error while connecting to PostgreSQL", error) finally: if( conn ): conn.commit() conn.close()
Often your SQL statements will depend on other variables, and can’t be written as constant strings ahead of time. It may be tempting to assemble the SQL statement using string concatenation to insert variables. Never do this.
Consider the following example:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '" + name + "'") matches = c.fetchall()
Given a student’s last name, look up all students with that name. You might find functionality like this on your university’s student directory. But what if a user enters input like ' OR 'a'='a
? The query now reads:
SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR 'a'='a'
While a little clunky, this will return every user in the database. Worse yet, a malicious user might construct a query like:
SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR password LIKE 'A%'
This would get them a list of all users whose password hashes start with ‘A’, then another query for ‘AA’, ‘AB’, and slowly an attacker can reconstruct the password hashes of every user at the university. This kind of attack is called a SQL Injection, and is a common vulnerability in websites. While scientific code is less likely to be directly attacked than a website, if you’re working with real-world data, especially web-scraped or user-gathered data, there can be all kinds of garbage in your input.
To avoid this vulnerability you can write your query first with placeholders for parameters, then tell SQL to complete the statement on your behalf. In SQLite this looks like:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '?'", [name])
Or in Postgresql:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '%s'", [name])
In either case, the SQL engine will properly escape any text in the name
field, ensuring that it’s interpreted as a string, and never as a SQL statement.
Real-world data is messy. Maybe you assume that every office at a school is assigned to an employee or graduate student, but some were assigned to recent graduates or retirees and haven’t been re-assigned. Maybe you assume that all students have last names, but some international students come from cultures that use mononyms, and the last name field is empty. If you don’t check these underlying assumptions, you might not learn that you’ve made a mistake until hours of debugging later! Fortunately, SQL provides an easy way to describe and enforce your assumptions about data through constraints.
CREATE TABLE users( userid TEXT PRIMARY KEY, firstname TEXT NOT NULL, lastname TEXT, status TEXT NOT NULL );
This definition of the user table includes four text fields, three of which cannot be empty. Further, the userid
field must be unique: you can have two students with the same first and last name, but they must have different usernames. We can add more detailed restrictions to the desk-assignment table:
CREATE TABLE desks( building TEXT NOT NULL, room INT NOT NULL, desk INT NOT NULL, userid TEXT, FOREIGN KEY(userid) REFERENCES users(userid), UNIQUE(building,room,desk) );
Here, we’ve explicitly said that the userid
field must match some user ID in the users
table. We’ve also said that while there can be multiple rooms in a building, and multiple desks in a room, there cannot be multiple desk 4’s in room 112 of Sage hall: the combination of building name, room number, and desk number must be unique.
If we try to insert any data into these tables that violates the described constraints, SQL will throw an exception instead of adding the new rows. Like unit testing but for your input data, constraints can help you be confident that your data follows the logic you think it does.
Compared to parsing CSV or JSON in Python and searching for the data you want, SQL will run inconceivably fast. But if you’re storing several gigabytes or more in your tables, even SQL databases will slow down. With a little bit of forethought we can make SQL queries run much faster.
Let’s say you have all your email stored in a database, with a structure something like:
CREATE TABLE emails( msgid TEXT PRIMARY KEY, from_address TEXT NOT NULL, to_address TEXT NOT NULL, subject TEXT NOT NULL, sent_date INT NOT NULL );
If we want to search for all emails received in the last week then SQL needs to search through every email in the table to check their sent dates. This is obviously highly inefficient, but we can warn SQL that we’ll be making these kinds of queries:
CREATE INDEX time_index ON emails(sent_date);
Creating an index tells SQL to build a binary tree, sorting the emails by sent_date
to reduce lookups from O(n)
to O(log n)
, dramatically improving performance. We can also build indices on multiple columns at once:
CREATE INDEX from_time_index ON emails(from_address,sent_date);
And now we can look up emails from a particular user in a particular time window in O(log n)
- even better! Both SQLite and Postgresql will automatically create indices for primary keys and unique constraints, since they’ll need to perform lookups during every new insert to make sure the constraints aren’t violated. You’ll often be selecting data based on unique characteristics too, so in practice it isn’t always necessary to declare indices explicitly.
Many SQL functions aggregate information from multiple rows. For example, we can count the number of users with:
SELECT COUNT(*) FROM users;
There are a variety of aggregate functions, include AVG
, MAX
, MIN
, and SUM
.
Often we don’t want to apply aggregate functions to every row, but to a sub-group of rows. Imagine we have a table of course registrations, like:
CREATE TABLE course_registration( userid TEXT, coursecode INT, credits INT, FOREIGN KEY(userid) REFERENCES users(userid), UNIQUE(userid,coursecode) );
To ask how many credits each student is registered for we might query:
SELECT userid,SUM(credits) FROM course_registration GROUP BY userid;
The GROUP BY
clause clusters rows based on their userid
, then runs the aggregate function on each group rather than on all rows. We could also list the students in descending order by credit count like:
SELECT userid,SUM(credits) AS total_credits FROM course_registration GROUP BY userid ORDER BY total_credits DESC;
Pandas is a ubiquitous Python package in data science. It makes it easy to store a table or a sequence of values as a Python object and perform some data analysis. It also integrates well with Seaborn, a package for statistical data visualization built on top of matplotlib. The two also integrate well with SQL. In just a couple lines, we can plot a histogram of how many credits students have registered for, from SQL to Pandas to Seaborn:
import sqlite3 import pandas as pd import seaborn as sns import matplotlib.pyplot as plt conn = sqlite3.connect("registration.db") df = pd.read_sql("SELECT userid,SUM(credits) AS registered_credits FROM course_registration GROUP BY userid", conn) sns.histplot(data=df, x="registered_credits") plt.set_title("Credit load across student body") plt.savefig("credit_load.pdf", bbox_inches="tight") conn.close()
Pandas will run a query for us (against SQLite, Postgresql, or a variety of other database types), put the result in a table with appropriate column names, and hand it off to Seaborn, which understands those same column names. Data analysis made easy!
For all the awesomeness of SQL, there are some tasks it is ill-suited to. If all you need is a way to store a dictionary so it persists, and make that dictionary accessible to multiple programs, then SQL is way more complexity and overhead than you need. Redis fills this niche well, and is simple and fast as long as you use it for this purpose.
If you have an enormous amount of data, terabytes worth, and need to update that data continuously, then SQL is a poor fit: every SQL server can only make one change to a table at a time and will have some kind of internal locking mechanism to prevent multiple writes from conflicting. This would be disastrous if, for example, you wanted to store all tweets in a database and need to save data as millions of users tweet at once. Here, tools like MongoDB step up, offering multiple database “shards” that will periodically sync with one another. This setup offers “eventual consistency” where a new tweet might not be available to all users right away, but things propagate pretty quickly, and in return we can handle huge numbers of updates at once.
More generally, SQL is a poor choice for:
Storing large files: maybe store metadata about files in a table, along with a pathname to where the file can be found on disk?
Storing unstructured data: you need to know what your rows and columns will be to put information in a spreadsheet. If your data is not uniform enough to describe in this way, then a spreadsheet is inappropriate.
Storing arbitrarily structured or nested data: if your data comes in the form of deeply nested JSON or XML, then a spreadsheet may be a poor choice. This is not always the case: if you have some nested JSON representing a tree of comments and replies on a website, then you may be able to “flatten” the tree by making each comment into a unique row, and including a “parent commentID” as a column. However, if different levels of nesting can have a wide variety of tags and meanings, this conversion may not always make sense. If you find that you’re storing a blob of JSON as a column in your table, then a table may not be the best representation for you.
For very specific types of data, like GIS, or network/graph data, there are specialized databases that may offer more task-appropriate tools than SQL.
SQL databases are an invaluable tool for any data science. They allow researchers to organize a wide variety of data so that it is easily and quickly queried to identify patterns and answer questions. SQL can simplify your code, preclude nasty bugs via constraints, and integrates nicely with most programming languages, and especially with common data science software packages.
This post offers a brief tutorial on using SQL, but there is an enormous depth available to accomplish more complicated tasks. In particular, I have left out:
Views and Materialized Views
Subqueries and Common Table Expressions
Much more detail on joins, unions, grouping, and partitioning
Functions, stored procedures, and triggers
Hopefully this gives you enough of a foundation to start using SQL in scientific contexts, and look up more details as you need them.
]]>