I’m currently at a workshop on open practices across disciplines, and one topic of discussion is how to change the academic publishing process to be more accessible to both authors and readers. I’ve also had a few friends outside of academia ask me how publishing research papers works, so it’s a good opportunity to write a post about the messy world of academic publishing.
Academics conduct research, write an article about their findings, and submit their article to an appropriate journal for their subject. There it undergoes review by a committee of peer researchers qualified to assess the quality of the work, and upon acceptance, the article is included in the next issue of the journal. In a simple scenario, the process is illustrated by this flowchart:
Libraries and research labs typically pay journals a subscription fee to receive new issues. This fee traditionally covered publication expenses, including typesetting (tedious for papers with lots of equations and plots and diagrams), printing, and mail distribution, along with the salaries of journal staff like editors, who are responsible for soliciting volunteer peer-reviews from other academics. These subscription fees were long considered a necessary evil: they limit the ability of low-income academics to access published research, such as scientists at universities in developing countries, let alone allowing the public to access research, but we all agree that printing and distributing all these journal issues has some significant financial overhead.
In recent decades, all significant journals have switched to majority or exclusively digital distribution. Academics do most of the typesetting themselves with LaTeX or Microsoft Word templates provided by the journals, and there’s no printing and negligible distribution costs for hosting a PDF online, so publication fees now go largely to the profit margins of publishers. This has made academic publishing ludicrously profitable, with margins as high as 40% in a multi-billion dollar industry.
Academics complain bitterly that journal publishers are parasitic, charging exorbitant publication fees while providing almost no service. After all, research is conducted by academics and submitted to the publishers for free. Other academics review the research, also for free, as peer-review is considered expected community service within academia. Since academics are typically funded by government agencies (such as the National Science Foundation, Department of Energy, and Department of Defense in the United States), this is taxpayer-funded public research, whose distribution is being limited by publishers rather than facilitated by them.
As journal subscription costs grew, these complaints eventually evolved into threats by universities to cancel their journal subscriptions, and funding agencies like the NSF began to demand that work they fund be made publicly accessible. The publisher profit margins were endangered, and they needed to act quickly to suppress dissent.
Many publishers now offer or require an alternative publishing scheme: Open Access. Under Open Access, articles can be read for free, but academics must pay to have their work published in order to cover staff salaries and the burdensome cost of web-hosting PDFs. This not only protects the revenue stream of publishers, but can expand it dramatically when journals like Nature Neuroscience charge $11,690 per article.
While Open Access allows academics with fewer resources to read scholarly work from their peers, and allows the public to read academic papers, it also inhibits academics with less funding from publishing if they can’t afford the publication fees. Further, it provides an incentive for publishers to accept as many papers for publication as possible to maximize publication fees, even if these papers are of lower quality or do not pass rigorous peer-review. When journals are paid under a subscription model they make the same income whether a new issue has ten or a hundred articles in it, and so it is more profitable to be selective in order to maximize the ‘prestige’ of the journal and increase subscriptions.
Academic research remains constrained by publishers, who either charge a fortune before publication, or after, while providing minimal utility to academia. These costs disproportionately impact researchers with less funding, often those outside North America and Europe. The most obvious solution to this problem might be “replace the journals with lower-cost alternatives,” but this is easier said than done. Even if we could find staff to organize and run a series of lower-cost journals, there’s a lot of political momentum behind the established publishers. Academics obtain grant funding, job offers, and tenure through publishing. Successful publishing means publishing many papers in prestigious journals and getting many citations on those papers. A new unproven journal won’t replace a big name like Nature or Science any time soon in the eyes of funding agencies and tenure committees, and will take time to gather a loyal readership before papers in it receive many reads or citations. While I hope for eventual reform of journals and academic institutional practices at large, a more immediate solution is needed.
One option is to simply pressure existing journals into dropping fees. If enough universities threaten to cut their subscriptions to major journals, then publishers will have no choice but to lower subscription costs or Open Access publication fees and accept a lower profit margin. This strategy has seen some limited success - some universities are cutting their contracts with major publishers, perhaps most notably when the University of California system ended their subscription to all Elsevier journals in 2019. However, this strategy can only work if researchers have leverage. Elsevier is the worst offender, and so universities can cut ties and push their researchers to publish in competitor journals from Springer or SAGE, but the costs at those competitor publishers remains high.
Physicists popularized the idea of a “preprint.” Originally this consisted of astrophysicists emailing rough drafts of their papers to one another. This had less to do with publication fees and more to do with quickly sharing breakthroughs without the delays that peer-review and publication incurs. Over time, the practice shifted from mailing lists to centralized repositories, and grew to encompass physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. That preprint service is called arXiv. This effort has been replicated in other fields, including bioRxiv, ChemRxiv, medRxiv, and SocArXiv, although preprint usage is not common in all fields.
Papers submitted to preprint servers have not undergone peer-review, and often have little to no quality control - the moderators at arXiv will give a paper a quick glance to remove obvious spam submissions, but they have neither the resources nor the responsibility to confirm that research they host is of high quality or was conducted ethically. Preprint papers were always intended to be rough drafts before publication in real journals, not a substitution for publication. Nevertheless, it is common practice for scholars to bypass journal paywalls by looking for a preprint of the same research before it underwent peer-review, so in practice preprint servers already serve as an alternative to journal subscriptions.
The direct action counter to journal subscription fees is to simply pirate the articles. Sci-Hub and Library Genesis (URLs subject to frequent change) acquire research papers and books, respectively, and host them as PDFs for free, ignoring copyright. Both shadow libraries have been sued for copyright infringement in several jurisdictions, but have rotated operations between countries and have so far avoided law enforcement.
Use of Sci-Hub is ubiquitous in STEM-academia, and is often the only way that researchers can access articles if they have limited funding or operate out of sanctioned locations, such as Russia during the Russia-Ukraine war. Sci-Hub’s founder, Alexandra Elbakyan, considers the site’s operations to be a moral imperative under the Universal Declaration of Human Rights, which guarantees all human beings the right to freely share in scientific advancements and their benefits. Whether or not you agree with Elbakyan’s stance, it seems clear that a combination of shadow libraries and preprint services have undermined the business models of traditional academic publishers and made them more amenable to alternatives like Open Access, and more susceptible to threats by universities to end subscriptions.
Academic publishing is approaching a crisis point. Research funding in most disciplines is scarce, and journal subscription or publication fees are steadily increasing. The number of graduate and postgraduate researchers is growing, guaranteeing an accelerating rate of papers, straining publication fees and the peer-review system even further. Academics have tolerated the current system by using preprints and shadow libraries to share work without paying journals, but these are stopgaps with a range of shortcomings. If academic research is to flourish then we will see a change that lowers publication costs and perhaps relieves strain on peer reviewers, but what that change will look like or how soon it will come remains open to debate.
I’ve submitted a new paper! Here’s the not-peer-reviewed pre-print. This post will discuss my work for non-network-scientist audiences.
There is broad disillusionment regarding the influence major tech companies have over our online interactions. Social media is largely governed by Meta (Facebook, Instagram, Whatsapp), Google (YouTube), and Twitter. In specific sub-communities, like open source software development, a single company like GitHub (owned by Microsoft) may have near-monopolistic control over online human collaboration. These companies define both the technology we use to communicate, and thereby the actions we can take to interact with one another, and the administrative policies regarding what actions and content are permissible on each platform.
In addition to debates over civic responsibility and regulation of online platforms, pushback to the centralized influence of these companies has taken two practical forms:
Alt-Tech. Communities that are excluded from mainstream platforms, often right-wing hate and conspiracy groups, have built an ecosystem of alternative platforms that mirrors their mainstream counterparts, but with administrations more supportive of their political objectives. These include Voat and the .Win-network (now defunct Reddit-clones), BitChute and Parler (YouTube-clones), Gab (Twitter-clone), and many others.
The Decentralized Web. Developers concerned about centralized control of content have built a number of decentralized platforms that aim to limit the control a single entity can have over human communication. These efforts include Mastodon, a Twitter alternative consisting of federated Twitter-like subcommunities, and ad-hoc communities like a loose network of self-hosted git servers. The decentralized web also encompasses much older decentralized networks like Usenet and email, and bears similarity to the motivations behind some Web3 technologies.
It is this second category, of ostensibly self-governed online communities, that interests me. Building a community-run platform is a laudable goal, but does the implementation of Mastodon and similar platforms fall short of those aspirations? How do we measure how ‘decentralized’ a platform is, or inversely, how much influence an oligarchy has over a platform?
One common approach to measuring social influence is to examine population size. The largest three Mastodon instances host over half of the entire Mastodon population. Therefore, the administrators of those three instances have disproportionate influence over permitted speech and users on Mastodon as a whole. Users who disagree with their decisions are free to make their own Mastodon instances, but if the operators of the big three instances refuse to connect to yours then half the Mastodon population will never see your posts.
A size disparity in community population is inevitable without intervention. Social networks follow rich-get-richer dynamics: new users are likely to join an existing vibrant community rather than a fledgling one, increasing its population and making it more appealing to future users. This is fundamentally a social pressure, but it is even further amplified by search engines, which are more likely to return links to larger and more active sites, funneling potential users towards the largest communities.
But is size disparity necessarily a failure of decentralization? Proponents of Mastodon have emphasized the importance of small communities that fit the needs of their members, and the Mastodon developers have stated that most Mastodon instances are small, topic-specific communities, with their
mastodon.social as a notable exception. If smaller communities operate comfortably under the shadow of larger ones, perhaps this is a healthy example of decentralized governance.
Before exploring alternative methods for measuring social centralization, let’s compare a few of these decentralized and alt-tech platforms using the lens of community size. Below is a plot of sub-community population sizes for five platforms.
The y-axis represents the population of each community as a fraction of the largest community’s size. In other words, the largest community on each platform has a size of “1”, while a community with a tenth as many users has a size of “0.1”. The x-axis is what fraction of communities have at least that large a population. This allows us to quickly show that about 2% of Mastodon instances are least 1% the size of the largest instance, or alternatively, 98% of Mastodon instances have fewer than 1% as many users as the largest instance.
This puts Mastodon in similar territory as two centralized platforms, BitChute and Voat. Specifically, the number of commenters on BitChute channels follows a similar distribution to Mastodon instance sizes, while the distribution of Voat “subverse” (analogous to “subreddits”) populations is even more skewed.
By contrast, the number of users on self-hosted Git servers (the Penumbra of Open-Source), and unique authors on Polish Usenet newsgroups, is far more equitable: around a third of git servers have at least 1% as many users as the largest, while the majority of newsgroups are within 1% of the largest.
If smaller communities exist largely independently of larger ones, then the actions of administrators on those large communities does not matter to the small community, and even in the face of a large population disparity a platform can be effectively decentralized. How can we measure this notion of “independence” in a platform-agnostic way such that we can compare across platforms?
Each of the five platforms examined above has some notion of cross-community activity. On Mastodon, users can follow other users on both their own instance and external instances. On the other four platforms, users can directly participate in multiple communities, by contributing to open source projects on multiple servers (Penumbra), or commenting on multiple channels (BitChute), subverses (Voat), or newsgroups (Usenet).
In network science terminology, we can create a bipartite graph, or a graph with two types of vertices: one for communities, and one for users. Edges between users and communities indicate that a user interacts with that community. For example, here’s a diagram of Mastodon relationships, where an edge of ‘3’ indicates that a user follows three accounts on a particular instance:
This allows us to simulate the disruption caused by removing an instance: if
mastodon.social went offline tomorrow, how many follow relationships from users on
scholar.social would be disrupted? More globally, what percentage of all follow relationships by remaining users have just been pruned? If the disruption percentage is high, then lots of information flowed from the larger community to the smaller communities. Conversely, if the disruption percentage is low, then users of the smaller communities are largely unaffected.
Here is just such a plot, simulating removing the largest community from each platform, then the two largest, three largest, etcetera:
From this perspective on inter-community relationships, each platform looks a little different. Removing the largest three Mastodon instances has a severe effect on the remaining population, but removing further communities has a rapidly diminished effect. Removing Usenet newsgroups and BitChute channels has a similar pattern, but less pronounced.
Voat and the Penumbra require additional explanation. Voat, like Reddit, allowed users to subscribe to “subverses” to see posts from those communities on the front page of the site. New users were subscribed to a set of 27 subverses by default. While the two largest subverses by population (
8chan) were topic-specific and non-default, the third largest subverse,
news, was a default subverse with broad appeal and high overlap with all other communities. Therefore, removing the largest two communities would have had little impact on users uninvolved in QAnon discussions, but removing
news would impact almost every user on the site and cuts nearly 10% of interactions site-wide.
The Penumbra consists of independently operated git servers, only implicitly affiliated in that some developers contributed to projects hosted on multiple servers. Since servers are largely insular, most developers only contribute to projects on one, and so those developers are removed entirely along with the git server. If a user contributed to projects hosted on two servers then disruption will increase when the first server is removed, but will decrease when the second server is removed, and the developer along with it. This is shown as spiky oscillations, where one popular git server is removed and drives up disruption, before another overlapping git server is removed and severs the other side of those collaborations.
Sometimes you may be uninterested in the impact of removing the largest 2, 3, or 10 instances, and want a simple summary statistic for whether one platform is “more centralized” than another. One way to approximate this is to calculate the area under the curve for each of the above curves:
This scores Mastodon as the most centralized, because removing its largest instances has such a large effect on its peers. By contrast, while the Voat curve is visually striking, it’s such a sharp increase because removing the largest two communities doesn’t have a large impact on the population.
“Centralization” is an ill-defined term, and network scientists have a range of ways of measuring centralization for different scenarios. These metrics fall into three broad categories:
|Measures how central a role a single node plays in the network
|Betweenness centrality, Eigenvector centrality
|Measures aspects of a particular group of vertices
|Assortativity / Homophily, Modularity, Insularity / Border index
|A summary attribute of an entire graph
|Diameter, Density, Cheeger numer
These metrics can capture aspects of centrality like “this vertex is an important bridge connecting two regions of a graph” or “this vertex is an important hub because many shortest paths between vertices pass through it.” They can measure how tight a bottleneck a graph contains (or, phrased another way, how well a graph can be partitioned in two), they can measure how much more likely similar vertices are to connect with one another, or how skewed the degree distribution of a graph is.
However, these metrics are mostly intended for fully connected unipartite graphs, and do not always have clear parallels in disconnected or bipartite graphs. Consider the following examples:
Many would intuitively agree that the left-most graph is central: one community in the center is larger than the rest, and serves as a bridge connecting several other communities together. By contrast, the middle graph is decentralized, because while the communities aren’t all the same size, none are dramatically larger than one another, and none serve a critical structural role as a hub or bridge.
The graph on the right is harder to describe. One community is much larger than its peers, but the remaining graph is identical to the decentralized example. By degree distribution, the graph would appear to be centralized. If we add a single edge connecting the giant community to any user in the main graph, then the giant community’s betweenness centrality score would skyrocket because of its prominent role in so many shortest-paths between users. However, it would still be inappropriate to say that the largest community plays a pivotal role in the activity of the users in the rest of the graph - it’s hardly connected at all!
My disruption metric is a cluster-level or mesoscale measurement for bipartite graphs that measures the influence of each community on its peers, although you can calculate the area under the disruption curve to make a graph-scale summary statistic. Using this approach, the centralized community is decidedly centralized, and the decentralized and ambiguous graphs are decidedly not.
Community size disparity is natural. Some communities will have broader appeal, and benefit from more from rich-get-richer effects than their smaller, more focused peers. Therefore, even a thriving decentralized platform may have a highly skewed population distribution. To measure the influence of oligarchies on a platform, we need a more nuanced view of interconnection and information flow between communities.
I have introduced a ‘disruption’ metric that accounts for both the size of a community and its structural role in the rest of the graph, measuring its potential influence on its peers. While the disruption metric illustrates how population distributions can be deceptive, it is only a preliminary measurement. Follows across communities and co-participation in communities are a rough proxy for information flow, or a network of potential information flow. A more precise metric for observed information flow might measure the number of messages that are boosted (“retweeted”) from one Mastodon instance to another, or might measure how frequently a new discussion topic, term, or URL appears first in one community, and later appears in a “downstream” community.
Does population size correlate with these measurements of information flow and influence? Are some smaller communities more influential than their size would suggest? How much does the graph structure of potential information flow predict ‘social decentralization’ in practice? There are many more questions to explore in this domain - but this is a start!
Ant nests look kind of like networks - they have rooms, and tunnels between the rooms, analogous to vertices and edges on a graph. A graph representation of a nest might help us answer questions about different ant species like:
Do some species create more rooms than others?
Do some species have different room layouts, such as a star with a central room, or a main corridor rooms sprout off of, closer to a random network, or something like a small-world network?
Do some species dig their rooms deeper, perhaps to better insulate from cold weather, or with additional ‘U’ shaped bends to limit flooding in wetter climates?
I’m no entomologist, and I will not answer those questions today. I will however, start work on a tool that can take photos of ant farms and produce corresponding network diagrams. I don’t expect this tool to be practical for real world research: ant farms are constrained to two dimensions, while ants in the wild will dig in three, and this tool may miss critical information like the shapes of rooms. But it will be a fun exercise, and maybe it will inspire something better.
We’ll start with a photo of an ant farm, cropped to only include the dirt:
I want to reduce this image to a Boolean map of where the ants have and have not excavated. For a first step, I’ll flatten it to black and white, adjusting brightness and contrast to try to mark the tunnels as black, and the remaining dirt as white. Fortunately, ImageMagick makes this relatively easy:
convert -white-threshold 25% -brightness-contrast 30x100 -alpha off -threshold 50% original.png processed.png
Clearly this is a noisy representation. Some flecks of dirt are dark enough to flatten to ‘black,’ and the ants have left some debris in their tunnels that appear ‘white.’ The background color behind the ant farm is white, so some regions that are particularly well excavated appear bright instead of dark. We might be able to improve that last problem by coloring each pixel according to its distance from either extreme, so that dark tunnels and bright backgrounds are set to ‘black’ and the medium brown dirt is set to ‘white’ - but that’s more involved, and we’ll return to that optimization later if necessary.
In broad strokes, we’ve set excavated space to black and dirt to white. If we aggregate over regions of the image, maybe we can compensate for the noise.
My first thought was to overlay a square graph on the image. For each, say, 10x10 pixel region of the image, count the number of black pixels, and if they’re above a cutoff threshold then set the whole square to black, otherwise set it to white. This moves us from a messy image representation to a simpler tile representation, like a board game.
There are a few problems with this approach. Looking ahead, I want to identify rooms and tunnels based on clumps of adjacent black tiles. A square has only four neighbors - eight if we count diagonals, but diagonally-adjacent tiles don’t necessarily imply that the ants have dug a tunnel between the two spaces. So, we’ll use hexagons instead of squares: six-neighbors, no awkwardness about ‘corners,’ and we can still construct a regular lattice:
So far so good! A hexagonal coordinate system is a little different from a Cartesian grid, but fortunately I’ve worked with cube coordinates before. For simplicity, we’ll set the diameter of a hexagon to the diameter of a tunnel. This should help us distinguish between tunnels and rooms later on, because tunnels will be around one tile wide, while rooms will be much wider.
Unfortunately, a second problem still remains: there’s no good threshold for how many black pixels should be inside a hexagon before we set it to black. A hexagon smack in the middle of a tunnel should contain mostly black pixels. But what if the hexagons aren’t centered? In a worst-case scenario a tunnel will pass right between two hexagons, leaving them both with half as many black pixels. If we set the threshold too tight then we’ll set both tiles to white and lose a tunnel. If we set the threshold too loose then we’ll set both tiles to black and make a tunnel look twice as wide as is appropriate - perhaps conflating some tunnels with rooms.
So, I’m going to try dithering! This is a type of error propagation used in digital signal processing, typically in situations like converting color images to black and white. In our case, tiles close to white will still be set to white, and tiles close to black will still be darkened to black - but in an ambiguous case where two adjoining tiles are not-quite-dark-enough to be black, we’ll round one tile to white, and the other to black. The result is mostly okay:
We’re missing some of the regions in the upper right that the ants excavated so completely that the white background shone through. We’re also missing about two hexagons needed to connect the rooms and tunnels on the center-left with the rest of the nest. We might be able to correct both these issues by coloring pixels according to contrast and more carefully calibrating the dithering process, but we’ll circle back to that later.
So far we’ve reduced a messy color photograph to a much simpler black-and-white tile board, but we still need to identify rooms, junctions, and tunnels. I’m going to approach this with a depth first search:
Define a global set of explored tiles, and a local set of “neighborhood” tiles. Select an unexplored tile at random as a starting point.
Mark the current tile as explored, add it to the neighborhood, and make a list of unexplored neighbors
If the list is longer than three, recursively explore each neighbor starting at step 2
Once there are no more neighbors to explore, mark the neighborhood as a “room” if it contains at least ten tiles, and a “junction” if it contains at least four. Otherwise, the neighborhood is part of a tunnel, and should be discarded.
If any unexplored tiles remain, select one and go to step 2
Once all tiles have been explored, we have a list of “rooms” and a list of “junctions,” each of which are themselves lists of tiles. We can visualize this by painting the rooms blue and the junctions red:
Looking good so far!
We’re most of the way to a graph representation. We need to create a ‘vertex’ for each room or junction, with a size proportional to the number of tiles in the room, and a position based on the ‘center’ of the tiles.
Then we need to add edges. For this we’ll return to a depth-first flood fill algorithm. This time, however, we’ll recursively explore all tiles adjacent to a room that aren’t part of another room or junction, to see which other vertices are reachable. This won’t preserve the shape, length, or width of a tunnel, but it will identify which areas of the nest are reachable from which others:
And there we have it!
We’ve gone from a color photo of an ant farm to a network diagram, all using simple algorithms, no fancy machine learning. I think we have a decent result for a clumsy first attempt!
There are many caveats. We’re missing some excavated spaces because of the wall color behind the ant farm in our sample photo. The dithering needs finer calibration to identify some of the smaller tunnels. Most importantly, an enormous number of details need to be calibrated for each ant farm photo. The brightness and contrast adjustments and noise reduction, the hexagon size, the dithering thresholds, and the room and junction sizes for flood filling, may all vary between each colony photo.
For all these reasons, I’m pausing here. If I think of a good way to auto-calibrate those parameters and improve on the image flattening and dithering steps, then maybe I’ll write a part two. Otherwise, this has progressed beyond a couple-evening toy project, so I’ll leave the code as-is.
A recent research paper, Freaky Leaky SMS: Extracting User Locations by Analyzing SMS Timings (PDF), purports to geolocate phone numbers by texting them and analyzing response times. This is creepy, interesting, and hopefully a warning that can perhaps help phone companies to better protect their customers’ privacy in the future. Today I’m writing up a short summary, context, and some of my thoughts about the study. The original paper is intended for computer security and machine learning scientists, but I intend to write for a broader audience in this post.
When Alice sends Bob a text message, Bob’s phone sends back an acknowledgement automatically - “I received your text!” If Alice’s phone doesn’t receive that acknowledgement before a timeout, Alice gets a “Failed to deliver text” error.
If Alice is standing next to Bob in Chicago, that text should be delivered quickly, and the acknowledgement should arrive almost instantly. If Alice is in Chicago and Bob is in Hong Kong, it should take slightly longer for the round-trip text message and acknowledgement.
So, if the delay before a text acknowledgement correlates with the distance between the phones, can we text Bob from three different phones, and by analyzing the delays, triangulate his position? What level of precision can we obtain when tracking Bob in this way?
In reality, text message delays will be messy. If Alice’s texts travel through a telecommunications hub in Chicago, then there may be a delay related to the amount of congestion on that hub. If there are multiple paths between Alice and Bob across telecommunications equipment, then each path may incur a different delay. Finally, the routes of telecommunications equipment may not take birds-eye-view shortest paths between locations. For example, if Alice and Bob are on opposite sides of a mountain range, the phone switches connecting them may divert around the mountains or through a pass, rather than directly over.
However, “messy” does not mean random or uncorrelated. If we text Bob enough times from enough phones, and apply some kind of noise reduction (maybe taking the median delay from each test-phone?), we may be able to overcome these barriers and roughly identify Bob’s location.
The researchers set up a controlled experiment: they select 34 locations across Europe, the United States, and the United Arab Emirates, and place a phone at each. They assign three of these locations as “senders” and all 34 as “receivers.”
To gather training data, they send around 155K text messages, in short bursts every hour over the course of three days. This provides a baseline of round-trip texting time from the three senders to the 34 receivers during every time of day (and therefore, hopefully, across a variety of network congestion levels).
For testing, the researchers can text a phone number from their three senders, compare the acknowledgement times to their training data, and predict which of the 34 locations a target phone is at. The researchers compare the test and training data using a ‘multilayer perceptron’, but the specific machine learning model isn’t critical here. I’m curious whether a much simpler method, like k-nearest-neighbors or a decision-tree, might perform adequately, but that’s a side tangent.
The heart of the research paper consists of two results, in sections 5.1 and 5.2. First, they try to distinguish whether a target is ‘domestic’ or ‘abroad.’ For example, the sensors in the UAE can tell whether a phone number is also at one of the locations in the UAE with 96% accuracy. This is analogous to our starting example of distinguishing between a Chicago-Chicago text and a Chicago-Hong-Kong text, and is relatively easy, but a good baseline. They try distinguishing ‘domestic’ and ‘abroad’ phones from a variety of locations, and retain high accuracy so long as the two countries are far apart. Accuracy drops to between 75 and 62% accuracy when both the sensor and target are in nearby European countries, where timing differences will be much smaller. Still better than random guessing, but no longer extremely reliable.
Next, the researchers pivot to distinguishing between multiple target locations in a single country - more challenging both because the response times will be much closer, and because they must now predict from among four or more options rather than a simple “domestic” and “abroad”. Accuracy varies between countries and the distances between target locations, but generally, the technique ranges between 63% and 98% accurate.
The rest of the paper has some auxiliary results, like how stable the classifier accuracy is over time as congestion patterns change, how different phones have slightly different SMS acknowledgement delays, and how well the classifier functions if the target individual travels between locations. There’s also some good discussion on the cause of errors in the classifier, and comparisons to other types of SMS attacks.
These results are impressive, but it’s important to remember that they are distinguishing only between subsets of 34 predefined locations. This study is a far cry from “enter any phone number and get a latitude and longitude,” but clearly there’s a lot of signal in the SMS acknowledgement delay times.
So what can be done to fix this privacy leak? Unfortunately, I don’t see any easy answers. Phones must return SMS acknowledgements, or we’d never know if a text message was delivered successfully. Without acknowledgements, if someone’s phone battery dies, or they put it in airplane mode, or lose service while driving through a tunnel, text messages to them would disappear into the void.
Phones could add a random delay before sending an acknowledgement - or the telecommunications provider could add such a delay on their end. This seems appealing, but the delay would have to be short - wait too long to send an acknowledgement, and the other phones will time out and report that the text failed to deliver. If you add a short delay, chosen from, say, a uniform or normal distribution, then sending several texts and taking the median delay will ‘de-noise’ the acknowledgement time.
Right now there are two prominent “defenses” against this kind of attack. The first is that it’s a complicated mess to pull off. To generalize from the controlled test in the paper to finding the geolocation of any phone would require more ‘sending’ phones, lots more receiving phones for calibration, and a ton of training data, not to mention a data scientist to build a classifier around that data. The second is that the attack is “loud:” texting a target repeatedly to measure response times will bombard them with text messages. This doesn’t prevent the attack from functioning, but at least the victim receives some indication that something weird is happening to them. There is a type of diagnostic SMS ping called a silent SMS that does not notify the user, but these diagnostic messages can only be sent by a phone company, and are intended for things like confirming reception between a cell phone and tower.
Overall, a great paper on a disturbing topic. I often find side-channel timing attacks intriguing; the researchers haven’t identified a ‘bug’ exactly, the phone network is functioning exactly as intended, but this is a highly undesired consequence of acknowledgement messages, and a perhaps unavoidable information leak if we’re going to provide acknowledgement at all.
A recent excellent paper performs a sophisticated natural language processing task, usually solved using complicated deep-learning neural networks, using a shockingly simple algorithm and gzip. This post will contextualize and explain that paper for non-computer scientists, or for those who do not follow news in NLP and machine learning.
Text Classification is a common task in natural language processing (NLP). Here’s an example setting:
Provided are several thousand example questions from Yahoo! Answers, pre-categorized into bins like ‘science questions,’ ‘health questions,’ and ‘history questions.’ Now, given an arbitrary new question from Yahoo! Answers, which category does it belong in?
This kind of categorization is easy for humans, and traditionally much more challenging for computers. NLP researchers have spent many years working on variations of this problem, and regularly host text classification competitions at NLP conferences. There are a few broad strategies to solving such a task.
One of the oldest computational tools for analyzing text is the Bag of Words model, which dates back to the 1950s. In this approach, we typically discard all punctuation and capitalization and common “stop words” like “the,” “a,” and “is” that convey only structural information. Then we count the number of unique words in a sample of text, and how many times each occur, then normalize by the total number of words.
For example, we may take the sentence “One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them” and reduce it to a bag of words:
We could then take another passage of text, reduce it to a bag of words, and compare the bags to see how similar the word distributions are, or whether certain words have much more prominence in one bag than another. There are many tools for performing this kind of distribution comparison, and many ways of handling awkward edge cases like words that only appear in one bag.
The limitations of bags of words are obvious - we’re destroying all the context! Language is much more than just a list of words and how often they appear: the order of words, and their co-occurrence, conveys lots of information, and even structural elements like stop words and punctuation convey some information, or we wouldn’t use them. A bag of words distills language down to something that basic statistics can wrestle with, but in so doing boils away much of the humanity.
Natural Language Processing has moved away from bags of words in favor of word embeddings. The goal here is to capture exactly that context of word co-occurrence that a bag of words destroys. For a simple example, let’s start with Asimov’s laws of robotics:
A robot may not injure a human being or, through inaction, allow a human being to come to harm. A robot must obey orders given it by human beings except where such orders would conflict with the First Law. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law
Removing punctuation and lower-casing all terms, we could construct a window of size two, encompassing the two words before and after each term as context:
|a, may, not, harm, must, obey, law, protect
|injure, a, being, or, allow, to, it, by, except
|must, obey, given, it, where, such, would, conflict
This gives us a small amount of context for each term. For example, we know that “orders” are things that can be “obeyed,” “given,” and may “conflict.” You can imagine that if we used a larger corpus for training, such as the complete text of English Wikipedia, we would get a lot more context for each word, and a much better sense of how frequently words appear in conjunction with one another.
Now let’s think of each word as a point in space. The word “robot” should appear in space close to other words that it frequently appears near, such as “harm,” “obey,” and “protect,” and should appear far away from words it never co-occurs with, such as “watermelon.” Implicitly, this means that “robot” will also appear relatively close to other words that share the same context - for example, while “robot” does not share context with “orders,” both “orders” and “robot” share context with “obey,” so the words “robot” and “orders” will not be too distant.
This mathematical space, where words are points with distance determined by co-occurrence and shared context, is called an embedding. The exact process for creating this embedding, including how many dimensions the space should use, how much context should be included, how points are initially projected into space, how words are tokenized, whether punctuation is included, and many finer details, vary between models. For more details on the training process, I recommend this Word Embedding tutorial from Dave Touretzky.
Once we have an embedding, we can ask a variety of questions, like word association: kitten is to cat as puppy is to X? Mathematically, we can draw a vector from kitten to cat, then transpose that vector to “puppy” and look for the closest point in the embedding to find “dog.” This works because “cat” and “dog” are in a similar region of the embedding, as they are both animals, and both pets. The words “kitten” and “puppy” will be close to their adult counterparts, and so also close to animal and pet associations, but will additionally be close to youth terms like “baby” and “infant”.
(Note that these embeddings can also contain undesired metadata: for example, “doctor” may be more closely associated with “man” than “woman”, and the inverse for “nurse”, if the training data used to create the embedding contains such a gender bias. Embeddings represent word adjacency and similar use in written text, and should not be mistaken for an understanding of language or a reflection of the true nature of the world.)
In addition to describing words as points in an embedding, we can now describe documents as a series of points, or as an average of those points. Given two documents, we can now calculate the average distance from points in one document to points in another document. Returning to the original problem of text classification, we can build categories of documents as clouds of points. For each new prompt, we can calculate its distance from each category, and place it in the closest category.
These embedding techniques allow us to build software that is impressively flexible: given an embedded representation of ‘context’ we can use vectors to categorize synonyms and associations, and build machines that appear to ‘understand’ and ‘reason’ about language much more than preceding Bag of Words models, simple approaches at representing context like Markov Chains, or attempts at formally parsing language and grammar. The trade-off is that these models are immensely complicated, and require enormous volumes of training data. Contemporary models like BERT have hundreds of millions of parameters, and can only be trained by corporations with vast resources like Google and IBM.
In modern Natural Language Processing, deep neural networks using word embeddings dominate. They produce the best results in a wide variety of tasks, from text classification to translation to prediction. While variations between NLP models are significant, the general consensus is that more parameters and more training data increase performance. This focuses most of the field on enormous models built by a handful of corporations, and has turned attention away from simpler or more easily understood techniques.
Zhiying Jiang, Matthew Y.R. Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, and Jimmy Lin, did not use a deep neural network and a large embedding space. They did not use machine learning. They used gzip.
Their approach is simple: compression algorithms, like gzip, are very good at recognizing patterns and representing them succinctly. If two pieces of text are similar, such as sharing many words, or especially entire phrases, then compressing the two pieces of text together should be quite compact. If the two pieces of text have little in common, then their gzipped representation will be less compact.
Specifically, given texts
A is more similar to
C, then we can usually expect:
len(gzip(A+B)) - len(gzip(A)) < len(gzip(A+C)) - len(gzip(A))
So, given a series of pre-categorized texts in a training set, and given a series of uncategorized texts in a test set, the solution is clear: compress each test text along with each training text to find the ‘distance’ between the test text and each training text. Select the
k nearest neighbors, and find the most common category among them. Report this category as the predicted category for the test text.
Their complete algorithm is a fourteen line Python script:
import gzip import numpy as np for (x1,_) in test_set: Cx1 = len(gzip.compress(x1.encode())) distance_from_x1 =  for (x2,_) in training_set: Cx2 = len(gzip.compress(x2.encode()) x1x2 = " ".join([x1,x2]) Cx1x2 = len(gzip.compress(x1x2.encode()) ncd = (Cx1x2 - min(Cx1,Cx2)) / max (Cx1,Cx2) distance_from_x1.append(ncd) sorted_idx = np.argsort(np.array(distance_from_x1)) top_k_class = training_set[sorted_idx[:k], 1] predict_class = max(set(top_k_class), key = top_k_class.count)
Shockingly, this performs on par with most modern NLP classifiers: it performs better than many for lots of common English classification data sets, and on most data sets it performs above average. BERT has higher accuracy on every data set, but not by much. A fourteen line Python script with gzip in lieu of machine learning performs almost as well as Google’s enormous embedded deep learning neural network. (See table 3 in the original paper, page 5)
A more recent variant on the classification challenge is to classify text in a language not included in the training data. For example, if we expended enormous resources training BERT on English text, is there any way to pivot that training and apply that knowledge to Swahili? Can we use embeddings from several languages to get some general cross-language fluency at text categorization in other languages? Or, if we do need to re-train, how little training data can we get away with to re-calibrate our embeddings and function on a new language? This is unsurprisingly a very difficult task. The gzip classifier outperformed all contemporary machine learning approaches that the authors compared to. (See table 5 in the original paper, page 6)
This paper is a great reminder that more complicated tools, like ever-larger machine-learning models, are not always better. In particular, I think their approach hits upon an interesting balance regarding complexity. Bag of words models discard context and punctuation, making computation simple, but at the cost of destroying invaluable information. However, keeping all of this information in the form of an embedding, and attempting to parse human language, incurs a heavy complexity cost. There’s a lot of “fluff” in language that we do not necessarily need for classification. The gzip approach keeps the extra context of word order and punctuation, but does not try to tackle the harder problem of understanding language in order to address the simpler problem of looking for similarities. In general, tools should be as simple as possible to complete their task, but no simpler.
It appears that the authors have a made an unusual choice in their accuracy calculations, which inflate their scores compared to contemporary techniques. In summary, they use a kNN classifier with
k=2, but rather than choosing a tie-breaking metric for when the two neighbors diverge, they mark their algorithm as correct if either neighbor has the correct label. This effectively makes their accuracy a “top 2” classifier rather than a kNN classifier, which misrepresents the performance of the algorithm. This isn’t necessarily an invalid way to measure accuracy, but it does need to be documented, and isn’t what we’d expect in traditional kNN. The gzip scores under a standard k=2 kNN remain impressive for such a simple approach and are still competitive - but they’re no longer beating deep neural network classifiers for non-English news datasets (table 5).
Here’s the problem in a little more detail:
The authors compress all training texts along with the test prompt, to find the gzip distance between the prompt and each possible category example
Rather than choosing the closest example and assuming the categories match, the authors choose the
k closest examples, take the mode of their categories, and predict that. This k-nearest-neighbors (kNN) strategy is common in machine learning, and protects against outliers
When the number of categories among neighbors are tied, one must have a tie-breaking strategy. A common choice is to pick the tie that is a closer neighbor. Another choice might be to expand the neighborhood, considering one additional neighbor until the tie is broken - or inversely, to shrink the neighborhood, using a smaller k until the tie is broken. Yet another choice might be to randomly choose one of the tied categories.
The authors use
k=2, meaning that they examine the two closest neighbors, which will either be of the same category, or will be a tie. Since they will encounter many ties, their choice of tie-breaking algorithm is very important
In the event of a tie between two neighbors, the authors report success if either neighbor has the correct label
The code in question can be found here. Further analysis by someone attempting to reproduce the results of the paper can be found here and is discussed in this GitHub issue. In conversations with the author this appears to be an intentional choice - but unfortunately it’s one that makes the gzip classifier appear to outperform BERT, when in reality if it gives you two potentially correct classes and you always pick the correct choice you will outperform BERT.
I’ve heard the first author is defending their thesis tomorrow - Congratulations, good luck, hope it goes great!