Image Dithering in Color!

Posted 1/17/2023

In my last post I demonstrated how to perform image dithering to convert colored images to black and white. This consists of converting each pixel to either black or white (whichever is closer), recording the amount of “error,” or the difference between the original luminoscity and the new black/white value, and propagating this error to adjoining pixels to brighten or darken them in compensation. This introduces local error (some pixels will be converted to white when their original value is closer to black, and vice versa), but globally lowers error, producing an image that appears much closer to the original.

I’m still playing with dithering, so in this post I will extend the idea to color images. Reducing the number of colors in an image used to be a common task: while digital cameras may be able to record photos with millions of unique colors, computers throughout the 90s often ran in “256 color” mode, where they could only display a small range of colors at once. This reduces the memory footprint of images significantly, since you only need 8-bits per pixel rather than 24 to represent their color. Some image compression algorithms still use palette compression today, announcing a palette of colors for a region of the image, then listing an 8- or 16-bit palette index for each pixel in the region rather than a full 24-bit color value.

Reducing a full color image to a limited palette presents a similar challenge to black-and-white image dithering: how do we choose what palette color to use for each pixel, and how do we avoid harsh color banding?

We’ll start with a photo of a hiking trail featuring a range of greens, browns, and whites:

Photo of a snowy hiking trail

Let’s reduce this to a harsh palette of 32 colors. First, we need to generate such a palette:

#!/usr/bin/env python3
import numpy as np

def getPalette(palette_size=32):
    colors = []
    values = np.linspace(0, 0xFFFFFF, palette_size, dtype=int)
    for val in values:
        r = val >> 16
        g = (val & 0x00FF00) >> 8
        b = val & 0x0000FF
        colors.append((r,g,b))
    return colors

I don’t know much color theory, so this is far from an “ideal” spread of colors. However, it is 32 equally spaced values on the numeric range 0x000000 to 0xFFFFFF, which we can convert to RGB values. We can think of color as a three dimensional space, where the X, Y, and Z axes represent red, green, and blue. This lets us visualize our color palette as follows:

import matplotlib.pyplot as plt

def plotPalette(palette):
    fig = plt.figure(figsize=(6,6))
    ax = fig.add_subplot(111, projection='3d')
    r = []
    g = []
    b = []
    c = []
    for color in palette:
        r.append(color[0])
        g.append(color[1])
        b.append(color[2])
        c.append("#%02x%02x%02x" % color)
    g = ax.scatter(r, g, b, c=c, marker='o', depthshade=False)
    ax.invert_xaxis()
    ax.set_xlabel('Red')
    ax.set_ylabel('Green')
    ax.set_zlabel('Blue')
    plt.show()

Which looks something like:

32 colors represented in 3-space on a scatterplot

Just as in black-and-white image conversion, we can take each pixel and round it to the closest available color - but instead of two colors in our palette, we now have 32. Here’s a simple (and highly inefficient) conversion:

# Returns the closest rgb value on the palette, as (red,green,blue)
def getClosest(color, palette):
    (r,g,b) = color
    closest = None #(color, distance)
    for p in palette:
        # A real distance should be sqrt(x^2 + y^2 + z^2), but
        # we only care about relative distance, so faster to leave it off
        distance = (r-p[0])**2 + (g-p[1])**2 + (b-p[2])**2
        if( closest == None or distance < closest[1] ):
            closest = (p,distance)
    return closest[0]

def reduceNoDither(img, palette, filename):
    pixels = np.array(img)
    for y,row in enumerate(pixels):
        for x,col in enumerate(row):
            pixels[y,x] = getClosest(pixels[y,x], palette)
    reduced = Image.fromarray(pixels)
    reduced.save(filename)

img = Image.open("bridge.png")
palette = getPalette()
reduceNoDither(img, palette, "bridge_32.png")

The results are predictably messy:

Hiking trail rendered in 32 colors by closest color conversion

Our palette only contains four colors close to brown, and most are far too red. If we convert each pixel to the closest color on the palette, we massively over-emphasize red, drowning out our greens and yellows.

Dithering to the rescue! Where before we had an integer error for each pixel (representing how much we’d over or under-brightened the pixel when we rounded it to black/white), we now have an error vector, representing how much we’ve over or under emphasized red, green, and blue in our rounding.

As before, we can apply Atkinson dithering, with the twist of applying a vector error to three dimensional color points:

# Returns an error vector (delta red, delta green, delta blue)
def getError(oldcolor, newcolor):
    dr = oldcolor[0] - newcolor[0]
    dg = oldcolor[1] - newcolor[1]
    db = oldcolor[2] - newcolor[2]
    return (dr, dg, db)

def applyError(pixels, y, x, error, factor):
    if( y >= pixels.shape[0] or x >= pixels.shape[1] ):
        return # Don't run off edge of image
    er = error[0] * factor
    eg = error[1] * factor
    eb = error[2] * factor
    pixels[y,x,RED] += er
    pixels[y,x,GREEN] += eg
    pixels[y,x,BLUE] += eb

def ditherAtkinson(img, palette, filename):
    pixels = np.array(img)
    total_pixels = pixels.shape[0] * pixels.shape[1]
    for y,row in enumerate(pixels):
        for x,col in enumerate(row):
            old = pixels[y,x] # Returns reference
            new = getClosest(old, palette)
            quant_error = getError(old, new)
            pixels[y,x] = new
            applyError(pixels, y,   x+1, quant_error, 1/8)
            applyError(pixels, y,   x+2, quant_error, 1/8)
            applyError(pixels, y+1, x+1, quant_error, 1/8)
            applyError(pixels, y+1, x,   quant_error, 1/8)
            applyError(pixels, y+1, x-1, quant_error, 1/8)
            applyError(pixels, y+2, x,   quant_error, 1/8)
    dithered = Image.fromarray(pixels)
    dithered.save(filename)

Aaaaaand presto!

Forest trail put through colored Atkinson dithering, looks closer to a correct shade of brown, but has blue flecks of snow on close inspection

It’s far from perfect, but our dithered black and white images were facsimiles of their greyscale counterparts, too. Pretty good for only 32 colors! The image no longer appears too red, and the green pine needles stand out better. Interestingly, the dithered image now appears flecked with blue, with a blue glow in the shadows. This is especially striking on my old Linux laptop, but is more subtle on a newer screen with a better color profile, so your mileage may vary.

We might expect the image to be slightly blue-tinged, both because reducing red values will make green and blue stand out, and because we are using an extremely limited color palette. However, the human eye is also better at picking up some colors than others, so perhaps these blue changes stand out disproportionately. We can try compensating, by reducing blue error to one third:

Forest trail put through colored Atkinson dithering, now with far fewer blue flecks

That’s an arbitrary and unscientific compensation factor, but it’s removed the blue tint from the shadows in the image, and reduced the number of blue “snow” effects, suggesting there’s some merit to per-channel tuning. Here’s a side-by-side comparison of the original, palette reduction, and each dithering approach:

Side by side of four images from earlier in the post

Especially at a smaller resolution, we can do a pretty good approximation with a color selection no wider than a big box of crayons. Cool!


Image Dithering

Posted 1/16/2023

Dithering means intentionally adding noise to a signal to reduce large artifacts like color banding. A classic example is reducing a color image to black and white. Take this magnificent photo of my neighbor’s cat:

Kacie asking for a bellyrub, in color

To trivially convert this image to black and white we can take each pixel, decide which color it’s closest to, and set it to that:

#!/usr/bin/env python3
from PIL import Image
import numpy as np

# Load image as grayscale
img = Image.open("kacie_color.png").convert("L")
pixels = np.array(img)
for y, row in enumerate(pixels):
    for x,col in enumerate(row):
        if( pixels[y,x] >= 127 ):
            pixels[y,x] = 255
        else:
            pixels[y,x] = 0
bw = Image.fromarray(pixels)
bw.save("kacie_bw.png")

But the result is not very satisfying:

Kacie in black and white, looks like a white cloud

The cat is white. Every pixel will be closer to white than black, and we lose the whole cat except the eyes and nose, along with most of the background detail. But we can do better! What if we set the density of black pixels based on the brightness of a region? That is, black regions will receive all black pixels, white regions all white, but something that should be a mid-gray will get closer to a checkerboard of black and white pixels to approximate the correct brightness.

One particularly satisfying way to approach this regional checkerboarding is called error diffusion. For every pixel, when we set it to black or white, we record how far off the original color is from the new one. Then we adjust the color of the adjacent pixels based on this error. For example, if we set a gray pixel to black, then we record that we’ve made an error by making this pixel darker than it should be, and we’ll brighten the surrounding pixels we haven’t evaluated yet to make them more likely to be set to white. Similarly, if we round a gray pixel up to white, then we darken the nearby pixels to make them more likely to be rounded down to black.

In Floyd-Steinberg dithering we process pixels left to right, top to bottom, and propagate the error of each pixel to its neighbors with the following distribution:

That is, pass on 7/16 of the error to the pixel right of the one we’re examining. Pass on 5/16 of the error to the pixel below, and a little to the two diagonals we haven’t examined yet. We can implement Floyd-Steinberg dithering as follows:

def getClosest(color):
    if( color >= 127 ):
        return 255 # White
    return 0 # Black

def setAdjacent(pixels, y, x, error):
    (rows,cols) = pixels.shape[0:2]
    if( y >= rows or x >= cols ):
        return # Don't run past edge of image
    pixels[y,x] += error

# Load image as grayscale
img = Image.open("kacie_color.png").convert("L")
pixels = np.array(img)
for y,row in enumerate(pixels):
    for x,col in enumerate(row):
        old = pixels[y,x]
        new = getClosest(old)
        pixels[y,x] = new
        quant_error = old - new
        setAdjacent(pixels, y,   x+1, quant_error*(7/16))
        setAdjacent(pixels, y+1, x-1, quant_error*(3/16))
        setAdjacent(pixels, y+1, x,   quant_error*(5/16))
        setAdjacent(pixels, y+1, x+1, quant_error*(1/16))
dithered = Image.fromarray(pixels)
dithered.save("kacie_dithered_fs.png")

The results are a stunning improvement:

Kacie in black and white, dithered to maintain maximum detail, but with snow artifacts

We’ve got the whole cat, ruffles on her fur, the asphalt and wood chips, details on rocks, gradients within shadows, the works! But what are those big black flecks across the cat’s fur? These flecks of “snow” impact the whole image, but they don’t stand out much on the background where we alternate between black and white pixels frequently. On the cat, even small errors setting near-white fur to white pixels build up, and we periodically set a clump of pixels to black.

We can try to reduce this snow by fiddling with the error propagation matrix. Rather than passing all of the error on to adjacent pixels, and mostly to the pixel to the right and below, what if we ‘discount’ the error, only passing on 75% of it? This is the diffusion matrix used in Atkinson dithering:

The code hardly needs a change:

img = Image.open("kacie_color.png").convert("L")
pixels = np.array(img)
for y,row in enumerate(pixels):
    for x,col in enumerate(row):
        old = pixels[y,x]
        new = getClosest(old)
        pixels[y,x] = new
        quant_error = old - new
        setAdjacent(pixels, y,   x+1, quant_error*(1/8))
        setAdjacent(pixels, y,   x+2, quant_error*(1/8))
        setAdjacent(pixels, y+1, x+1, quant_error*(1/8))
        setAdjacent(pixels, y+1, x,   quant_error*(1/8))
        setAdjacent(pixels, y+1, x-1, quant_error*(1/8))
        setAdjacent(pixels, y+2, x,   quant_error*(1/8))
dithered = Image.fromarray(pixels)
dithered.save("kacie_dithered_at.png")

And the snow vanishes:

Kacie in black and white, dithered to minimize snow, with some loss of detail in bright and dark regions

This is a lot more pleasing to the eye, but it’s important to note that the change isn’t free: if you look closely, we’ve lost some detail on the cat’s fur, particularly where the edges of her legs and tail have been ‘washed out.’ After all, we’re now ignoring some of the error caused by our black and white conversion, so we’re no longer compensating for all our mistakes in nearby pixels. This is most noticeable in bright and dark areas where the errors are small.

Closing Thoughts

I really like this idea of adding noise and propagating errors to reduce overall error. It’s a little counter-intuitive; by artificially brightening or darkening a pixel, we’re making an objectively worse local choice when converting a pixel to black or white. Globally, however, this preserves much more of the original structure and detail. This type of error diffusion is most often used in digital signal processing of images, video, and audio, but I am curious whether it has good applications in more distant domains.

If you enjoyed this post and want to read more about mucking with images and color, you may enjoy reading my post on color filter array forensics.


SQL for Scientists

Posted 12/03/2022

My lab group recently asked me to give a tutorial on using SQL databases in science. While we are all complex systems scientists, my background is in computer science and STS, and many of my colleagues come from physics, mathematics, and philosophy, so we learn a great deal from one another. I’ve turned my slides into a blog post here, like my last lab talk on using Git for scientific software development.

What is a database?

A database is a piece of software for storing and organizing your data. Most importantly, databases make it easy to query your data, asking for subsets of your data that match a specific pattern you are interested in.

If you currently store your data in formats like CSV or JSON, and write lots of code for reading this data and searching through it for pieces relevant to your research question, our goal will be to offload all of this logic from your own code to a database. It will run much faster, it will be faster to write, and it will help you avoid bugs while expressing complicated questions simply.

There are many types of databases, but for this post I’ll split them along two axis: do they run locally (as part of your research code, storing data in a single file) or remotely (running as an independent process you speak to over the network), and does the database use SQL (a language for expressing sophisticated data queries) or not. Here’s a small subset of databases along these axes:

  Local Remote
SQL SQLite, DuckDB Postgresql, MySQL, MariaDB, MSSQL, …
NoSQL Pandas (sorta), BerkeleyDB, … Redis, MongoDB, Firebase, …

In this post I’ll be focusing on SQLite and Postgresql as examples. I’ll briefly talk about NoSQL databases at the end, and the scenarios where they might be preferable to SQL databases.

SQLite

SQLite stores all data in one file on your hard drive. SQLite is a library, so the database software runs inside of the software you write. It is trivial to set up, pretty fast (especially for queries), and has most database features we will want.

Critically, SQLite is ill-suited to concurrency. Since SQLite runs inside of your software, two different Python scripts can easily try to write to the same database file at the same time, risking catastrophic data corruption. You can build sophisticated locking mechanisms to ensure only one program accesses a database at once, but this adds a serious performance bottleneck. SQLite is really intended for a single piece of software to store data, not live setups where several applications write data at the same time.

Postgresql

Postgres runs as a software daemon; it runs all the time, storing data in a series of files and caches that it manages. Whether postgres is running on your own computer or another computer, your research software will communicate with postgresql over the network.

This difference in design means that postgres requires some additional bureaucracy to set up: users and passwords, databases and permissions and authentication. In return however, postgres is even faster than SQLite, and handles concurrent access from many applications trivially. Postgres also has a number of advanced features that are unavailable in SQLite.

Relational Databases with SQL

Relational databases store data in tables (think spreadsheets), and in the relationships between tables.

userid firstname lastname status
zerocool Dade Murphy Undergradute
acidburn Kate Libby Undergradute
joey Joey Pardella Undergradute
cerealkiller Emmanuel Goldstein Undergradute
phantomphreak Ramon Sanchez Undergradute
lord_nikon Paul Cook Graduate
building room desk userid
Sage 113 1 zerocool
Sage 113 2 acidburn
Perkins 208 7 joey
West 302 4 lord_nikon

You request data from a table using a SELECT statement of the form SELECT columns FROM table WHERE row matches condition. For example:

SELECT userid FROM desks WHERE building='Innovation' AND room=413;

You can also combine multiple tables during a SELECT to gather related information. Here we fetch the names of all graduate students with a desk assigned to them, by selecting rows from the desks table and combining them with matching entries from the user table where the user IDs of both rows match:

SELECT firstname,lastname FROM desks
        LEFT JOIN users ON desks.userid=users.userid
        WHERE status='Graduate';

The following are the main commands for interacting with a SQL database to create relations, add, remove, and update information in relations, and select information from relations:

Command Description
SELECT Return some columns from a relation
INSERT Add data to a relation
DELETE Remove data from a relation
UPDATE Modify data in a relation
CREATE Create a new relation (table, view, index)
DROP Remove a relation
EXPLAIN Show how a query will access data

So that’s all fine in theory, but how do we write software that actually uses a database?

Connecting to SQLite

To connect to a SQLite database from Python we first supply a database filename, open a cursor from the connection, then use the cursor to send a query and get back a 2D array of results.

import sqlite3

conn = sqlite3.connect("university.db")
c = conn.cursor()
c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid")
for (f,l,b,r) in c.fetchall():
        print("%s %s has a desk in %s %d" % (f,l,b,r))
conn.commit() # Save any CREATE/INSERT changes to the database
conn.close()

You can think of the cursor as a finger tracking your position in the database. Multiple cursors allow us to make multiple queries from the same database and track which results were associated with which request.

Connecting to Postgresql

Interacting with Postgres is similar to SQLite: we connect to the database, then open a cursor from the connection, and use the cursor to send queries and get results. However, Postgres is a daemon accessible over the network, so we’ll need to supply a hostname and port number where the SQL server can be found, the name of the database we want to reach, and a username and password authorized to connect to that database.

import psycopg2

try:
        conn = psycopg2.connect(host="127.0.0.1", port="5432",
                user="registrar", password="hunter2",
                database="university_users")
        c = conn.cursor()
        c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid")
        for (f,l,b,r) in c.fetchall():
                print("%s %s has a desk in %s %d" % (f,l,b,r))
except (Exception, Error) as error:
        print("Error while connecting to PostgreSQL", error)
finally:
        if( conn ):
                conn.commit()
                conn.close()

Parameterized Queries

Often your SQL statements will depend on other variables, and can’t be written as constant strings ahead of time. It may be tempting to assemble the SQL statement using string concatenation to insert variables. Never do this.

Consider the following example:

c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '" + name + "'")
matches = c.fetchall()

Given a student’s last name, look up all students with that name. You might find functionality like this on your university’s student directory. But what if a user enters input like ' OR 'a'='a? The query now reads:

SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR 'a'='a'

While a little clunky, this will return every user in the database. Worse yet, a malicious user might construct a query like:

SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR password LIKE 'A%'

This would get them a list of all users whose password hashes start with ‘A’, then another query for ‘AA’, ‘AB’, and slowly an attacker can reconstruct the password hashes of every user at the university. This kind of attack is called a SQL Injection, and is a common vulnerability in websites. While scientific code is less likely to be directly attacked than a website, if you’re working with real-world data, especially web-scraped or user-gathered data, there can be all kinds of garbage in your input.

To avoid this vulnerability you can write your query first with placeholders for parameters, then tell SQL to complete the statement on your behalf. In SQLite this looks like:

c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '?'", [name])

Or in Postgresql:

c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '%s'", [name])

In either case, the SQL engine will properly escape any text in the name field, ensuring that it’s interpreted as a string, and never as a SQL statement.

Constraints

Real-world data is messy. Maybe you assume that every office at a school is assigned to an employee or graduate student, but some were assigned to recent graduates or retirees and haven’t been re-assigned. Maybe you assume that all students have last names, but some international students come from cultures that use mononyms, and the last name field is empty. If you don’t check these underlying assumptions, you might not learn that you’ve made a mistake until hours of debugging later! Fortunately, SQL provides an easy way to describe and enforce your assumptions about data through constraints.

CREATE TABLE users(
        userid TEXT PRIMARY KEY,
        firstname TEXT NOT NULL,
        lastname TEXT,
        status TEXT NOT NULL
);

This definition of the user table includes four text fields, three of which cannot be empty. Further, the userid field must be unique: you can have two students with the same first and last name, but they must have different usernames. We can add more detailed restrictions to the desk-assignment table:

CREATE TABLE desks(
        building TEXT NOT NULL,
        room INT NOT NULL,
        desk INT NOT NULL,
        userid TEXT,
        FOREIGN KEY(userid) REFERENCES users(userid),
        UNIQUE(building,room,desk)
);

Here, we’ve explicitly said that the userid field must match some user ID in the users table. We’ve also said that while there can be multiple rooms in a building, and multiple desks in a room, there cannot be multiple desk 4’s in room 112 of Sage hall: the combination of building name, room number, and desk number must be unique.

If we try to insert any data into these tables that violates the described constraints, SQL will throw an exception instead of adding the new rows. Like unit testing but for your input data, constraints can help you be confident that your data follows the logic you think it does.

Indices, or how to make your database super fast

Compared to parsing CSV or JSON in Python and searching for the data you want, SQL will run inconceivably fast. But if you’re storing several gigabytes or more in your tables, even SQL databases will slow down. With a little bit of forethought we can make SQL queries run much faster.

Let’s say you have all your email stored in a database, with a structure something like:

CREATE TABLE emails(
        msgid TEXT PRIMARY KEY,
        from_address TEXT NOT NULL,
        to_address TEXT NOT NULL,
        subject TEXT NOT NULL,
        sent_date INT NOT NULL
);

If we want to search for all emails received in the last week then SQL needs to search through every email in the table to check their sent dates. This is obviously highly inefficient, but we can warn SQL that we’ll be making these kinds of queries:

CREATE INDEX time_index ON emails(sent_date);

Creating an index tells SQL to build a binary tree, sorting the emails by sent_date to reduce lookups from O(n) to O(log n), dramatically improving performance. We can also build indices on multiple columns at once:

CREATE INDEX from_time_index ON emails(from_address,sent_date);

And now we can look up emails from a particular user in a particular time window in O(log n) - even better! Both SQLite and Postgresql will automatically create indices for primary keys and unique constraints, since they’ll need to perform lookups during every new insert to make sure the constraints aren’t violated. You’ll often be selecting data based on unique characteristics too, so in practice it isn’t always necessary to declare indices explicitly.

Grouping and Counting

Many SQL functions aggregate information from multiple rows. For example, we can count the number of users with:

SELECT COUNT(*) FROM users;

There are a variety of aggregate functions, include AVG, MAX, MIN, and SUM.

Often we don’t want to apply aggregate functions to every row, but to a sub-group of rows. Imagine we have a table of course registrations, like:

CREATE TABLE course_registration(
        userid TEXT,
        coursecode INT,
        credits INT,
        FOREIGN KEY(userid) REFERENCES users(userid),
        UNIQUE(userid,coursecode)
);

To ask how many credits each student is registered for we might query:

SELECT userid,SUM(credits) FROM course_registration GROUP BY userid;

The GROUP BY clause clusters rows based on their userid, then runs the aggregate function on each group rather than on all rows. We could also list the students in descending order by credit count like:

SELECT userid,SUM(credits) AS total_credits FROM course_registration GROUP BY userid ORDER BY total_credits DESC;

Pandas Integration

Pandas is a ubiquitous Python package in data science. It makes it easy to store a table or a sequence of values as a Python object and perform some data analysis. It also integrates well with Seaborn, a package for statistical data visualization built on top of matplotlib. The two also integrate well with SQL. In just a couple lines, we can plot a histogram of how many credits students have registered for, from SQL to Pandas to Seaborn:

import sqlite3
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

conn = sqlite3.connect("registration.db")
df = pd.read_sql("SELECT userid,SUM(credits) AS registered_credits FROM course_registration GROUP BY userid", conn)
sns.histplot(data=df, x="registered_credits")
plt.set_title("Credit load across student body")
plt.savefig("credit_load.pdf", bbox_inches="tight")
conn.close()

Pandas will run a query for us (against SQLite, Postgresql, or a variety of other database types), put the result in a table with appropriate column names, and hand it off to Seaborn, which understands those same column names. Data analysis made easy!

Limitations of SQL, and when to use other tools

For all the awesomeness of SQL, there are some tasks it is ill-suited to. If all you need is a way to store a dictionary so it persists, and make that dictionary accessible to multiple programs, then SQL is way more complexity and overhead than you need. Redis fills this niche well, and is simple and fast as long as you use it for this purpose.

If you have an enormous amount of data, terabytes worth, and need to update that data continuously, then SQL is a poor fit: every SQL server can only make one change to a table at a time and will have some kind of internal locking mechanism to prevent multiple writes from conflicting. This would be disastrous if, for example, you wanted to store all tweets in a database and need to save data as millions of users tweet at once. Here, tools like MongoDB step up, offering multiple database “shards” that will periodically sync with one another. This setup offers “eventual consistency” where a new tweet might not be available to all users right away, but things propagate pretty quickly, and in return we can handle huge numbers of updates at once.

More generally, SQL is a poor choice for:

  • Storing large files: maybe store metadata about files in a table, along with a pathname to where the file can be found on disk?

  • Storing unstructured data: you need to know what your rows and columns will be to put information in a spreadsheet. If your data is not uniform enough to describe in this way, then a spreadsheet is inappropriate.

  • Storing arbitrarily structured or nested data: if your data comes in the form of deeply nested JSON or XML, then a spreadsheet may be a poor choice. This is not always the case: if you have some nested JSON representing a tree of comments and replies on a website, then you may be able to “flatten” the tree by making each comment into a unique row, and including a “parent commentID” as a column. However, if different levels of nesting can have a wide variety of tags and meanings, this conversion may not always make sense. If you find that you’re storing a blob of JSON as a column in your table, then a table may not be the best representation for you.

For very specific types of data, like GIS, or network/graph data, there are specialized databases that may offer more task-appropriate tools than SQL.

Conclusion

SQL databases are an invaluable tool for any data science. They allow researchers to organize a wide variety of data so that it is easily and quickly queried to identify patterns and answer questions. SQL can simplify your code, preclude nasty bugs via constraints, and integrates nicely with most programming languages, and especially with common data science software packages.

This post offers a brief tutorial on using SQL, but there is an enormous depth available to accomplish more complicated tasks. In particular, I have left out:

  • Views and Materialized Views

  • Subqueries and Common Table Expressions

  • Much more detail on joins, unions, grouping, and partitioning

  • Functions, stored procedures, and triggers

Hopefully this gives you enough of a foundation to start using SQL in scientific contexts, and look up more details as you need them.


Torrent Health Monitoring

Posted 8/21/2022

Distributed Denial of Secrets publishes most of our datasets via torrent. This minimizes infrastructural requirements for us: every time someone downloads a release, if they leave their torrent client running, they help us upload to other interested people. Once many people have mirrored our release it can remain available even if we stop seeding, completely self-hosted by the public. This is ideal, because with our budget we’re unable to maintain seed boxes for every release simultaneously; we keep offline backups, but virtual machine storage is more expensive.

This system typically works well, especially for rapidly distributing new releases. However, occasionally an older release will become unavailable, either because interest has waned and seeds have dropped offline, or because the trackers used by the torrent are no longer functional. If someone reports that a torrent is unavailable then we can pull the data from our backups and resume seeding, and issue a new magnet link containing an updated list of trackers. Unfortunately, that’s reactive, slow, and tedious. How can we proactively monitor the availability of all our torrents, to notify us when one requires attention?

Specifically, we want to build a dashboard that displays a list of torrents, and for each indicates how many trackers are online, how many peers those trackers are aware of, and how many peers can be found in the distributed hash table (DHT). It should track this information over the course of a day, a week, and a month, so we can distinguish between short-term and permanent loss of availability.

Every torrent client has the functionality to locate peers through trackers, and most modern clients can also find peers through the DHT. However, most clients do not provide a way to use that functionality without starting a download for the torrent, nor do they provide a way to export that peer information so we can plot availability over time. There are a few libraries for handling torrents, like libtorrent, but these also don’t easily expose peer-discovery independently from downloading. Fortunately, there are libraries for performing bittorrent DHT lookups, so our primary hurdle is implementing the client side of the bittorrent tracker protocol, described in BEP 0003, BEP 0015, and BEP 0023.

How do torrent trackers work?

Torrent trackers are conceptually simple:

  • A torrent or magnet link contains a list of trackers

  • Any client interested in downloading the torrent data contacts each tracker

  • The client announces the hash of the torrent they’re interested in, registering their interest with the tracker

  • The tracker returns a list of any other IP addresses that have recently registered interest in the same content

  • The client periodically re-registers its interest with the tracker, to identify any new peers, and ensure it remains discoverable to others

From there the client contacts each discovered peer directly, and negotiates a download. Since we’re only interested in peer discovery, we don’t have to follow along further than this.

Clients can communicate with trackers using two protocols: older trackers communicate using HTTP, but far more common is the newer, simpler, faster UDP-based protocol. In both protocols, clients can make announce requests, which announce their interest in a torrent, and scrape requests, which fetch some aggregated metadata about the number of clients interested in a torrent.

Unfortunately, scrape requests have little utility for our purposes: If one tracker says that it knows 7 peers, and another tracker says it knows 3, how many peers are there? 7? 10? Somewhere in-between? We can’t aggregate information across trackers without fetching the list of peer IP addresses from each tracker, which requires using an announce request.

The tracker HTTP API

The tracker HTTP protocol is deceptively simple. A tracker URL looks something like http://tracker.opentrackr.org:1337/announce. This contains the domain name of the tracker, the port number, and the resource for the request (typically “announce”). To send a request, the client adds several fields:

Field Description
info_hash A URL-encoded version of the torrent sha256 hash
peer_id A random string uniquely identifying the client
port The port number on which the client can be reached
uploaded The number of blocks the client has already uploaded
downloaded The number of blocks the client has downloaded
left How many blocks the client still needs to download

Therefore a full request to a tracker may look something like:

http://tracker.opentrackr.org:1337/announce?info_hash=%5Bg%03%95%28%0A%3F%3F**%0A%CFs%D4K%2C%CE%0F%E1%AE&peer_id=foo&port=1234&uploaded=0&downloaded=0&left=0

Note that the uploaded, downloaded, and left fields are required, but are only hints. If the client is downloading a magnet link, it may not know how large the torrent data is, and therefore how much is left to download. This self-reported metadata isn’t verified in any way, the tracker just uses it to report some analytics.

Once the client makes an announce request to a tracker, the tracker responds with either an HTTP error, or with a text-encoded dictionary describing available peer data for the torrent. Great, so does the tracker respond with some JSON? XML? YAML? No, it responds with Bencode! This is a custom text-encoding scheme made for bittorrent metadata that can encode:

Field type Encoding rule Example
integers Prefix with an i, then the integer in ascii-base10, then an e 7 becomes i7e
bytestrings Length-prefixed, then a colon, then the string “foo” becomes 3:foo
lists Start with an l, then the contents of the list, then an e [2,3] becomes li2ei3ee
dictionaries Start with a d, then the contents of the dictionary, then an e. Each entry consists of a string key, followed immediately by a value {"foo": 1, "bar": 2} becomes d3:fooi1e3:bari2ee

The tracker may respond with a Bencoded dictionary with a key of failure reason and a value of some explanatory text string like “this tracker doesn’t have information on that torrent” or “you’ve been rate-limited”. Otherwise, it’ll respond in one of two ways:

Bencoded dictionaries

In the older bittorrent tracker standard 3, trackers respond with a dictionary containing the key peers and a value of a list, where each entry is a dictionary, containing contact information for that peer. For example (translated to json):

{
    "peers":
        [
            {"ip": "1.2.3.4", "port": 4567},
            {"ip": "2.3.4.5", "port": 5678}
        ]
}

Or in the glorious bencode:

d5:peersl2:ip7:1.2.3.44:porti4567e22:ip7:2.3.4.54:porti5678eee

There may be a variety of other keys (a “peer ID” in the peer dictionary, or metadata like “number of seeds, peers, and leeches” at the top level), but this is all we need for our purposes.

Bencoded compact bytestring

All this text encoding gets a little tedious, so in an amendment to the tracker spec (standard 23), trackers may now instead return a binary string in the “peers” field, like:

{
    "peers": "\x04\x03\x02\x01\x04\xD2\x05\x04\x03\x02\t)"
}

Or in bencode again:

d5:peers12:\x04\x03\x02\x01\x04\xD2\x05\x04\x03\x02\t)e

This is equivalent to the dictionary above: the first four bytes are an integer IP address, followed by two bytes for a port, then another six bytes for the next IP address and port. The hex-escaping is added here for illustration purposes; the tracker would return those raw bytes.

While this string compression doesn’t save much in our two-peer example, it’s significantly more compact when handling dozens or hundreds of peers.

The tracker UDP API

HTTP is unwieldy. It takes many packets, the server might use gzip compression, maybe the server requires HTTPS, or goes through some redirects before responding. Once the server responds, it might respond with a variety of HTTP errors, and while it should respond with bencoded data, servers often return HTML in error. Even when they return bencoded data, they sometimes follow the bencode spec incorrectly. In short, supporting HTTP in torrent clients is a complicated mess. But it doesn’t need to be this way! The information the client and server are exchanging is relatively simple, and we can communicate it in just a handful of UDP packets. So begins bittorrent specification 15.

First, we need to perform a handshake with the server:

The client sends a magic number confirming that they are using the torrent tracker protocol, as opposed to random Internet traffic like a port scan. Then they send an action (0: connect), and a random transaction ID to identify datagrams connected to this session.

If the tracker is online, it will respond to complete the handshake:

The tracker sends back action 0 (responding to the connect request), the same transaction ID the client sent, and a random connection ID. The client will include this connection ID in future datagrams. This handshake prevents IP address spoofing, as used in DNS amplification attacks where an attacker coerces a DNS server into flooding a third party with traffic.

The client may now send its announce request (action code 1: announce):

This uses the same connection ID and a new transaction ID from the previous step, followed by the info hash of the torrent, and a peer ID representing this client. Then the client sends some metadata regarding how far along its download is (matching the downloaded, left, and uploaded fields in the HTTP spec). Finally, the client sends the IP address and port it can be reached at, although trackers will typically ignore the IP address field and use the IP that sent the request (again to prevent spoofing), a key identifying the client, and an unused num_wanted field.

If the client has both an IPv4 and an IPv6 address, and is therefore looking for both v4 and v6 peers, then it must make two announce requests, over v4 and v6, using the same key. This allows the tracker to avoid “double-counting” the number of peers interested in a torrent.

Finally, the tracker responds with peer data:

Here, the action and transaction ID match the previous datagram, and the interval indicates how long the client should cache results for before polling the tracker again. The leechers and seeders counts are the tracker’s guess as to how many peers are mostly-downloading or mostly-uploading based on the downloaded, left, and uploaded fields from each announce request. These counts are not authoritative, or confirmed by the tracker in any way.

And at last, the tracker responds with a series of IP addresses and port numbers: 4 bytes per address (assuming IPv4, 16 bytes for IPv6), and two bytes per port number.

That’s all there is to the UDP protocol! Keep in mind that all values should follow network byte-order (big endian). While the diagrams make this protocol look complicated, there’s far less parsing or error handling needed than for the HTTP version, no external libraries required, and the entire exchange occurs in just 4 packets. No wonder the majority of torrents only use UDP trackers!

Creating the dashboard

With the tracker protocol implemented, we can take a list of torrents, extract their list of trackers, and look up peers from each tracker. We can also look up peers in the DHT using third party code. From here, it’s a simple process to make a SQL database to track all that information with timestamps, select results from those tables based on their age, and at last throw up an interface to peruse it:

Screenshot of the DDoSecrets torrent health dashboard

In the hopes that this code might benefit others, it’s been released on GitHub.


Distinguishing In-Groups from Onlookers by Language Use

Posted 6/8/2022

This post is a non-academic summary of my most recent paper, which can be found here. It’s in a similar theme as a previous paper, which I discussed here, but this post can be read on its own. An enormous thank you to my fantastic co-authors Josh Minot, Sam Rosenblatt, Guillermo de Anda Jáuregui, Emily Moog, Briane Paul V. Samson, Laurent Hébert-Dufresne, and Allison M. Roth.

If you wanted to find QAnon believers on Twitter, YouTube, or Reddit, you might search for some of their flavorful unique vocabulary like WWG1WGA (“Where we go one, we go all”). To find cryptocurrency enthusiasts, you might search for in-group phrases like HODL or WAGMI, or “shitcoins”, or specific technologies like “NFT” or “ETH”. This works well for new, obscure communities, when no one else has picked up on their vocabulary. However, once a community reaches the limelight, the keyword-search strategy quickly deteriorates: a search for “WWG1WGA” is now as likely to find posts discussing QAnon, or ridiculing them, as it is to identify true believers.

Human observers with some contextual understanding of a community can quickly distinguish between participants in a group, and discussion about (or jokes about) a group. Training a computer to do the same is decidedly more complicated, but would allow us to examine exponentially more posts. This could be useful for tasks like identifying covid conspiracy communities (but distinguishing them from people talking about the conspiracists) or identifying a hate group (but distinguishing from people discussing hate groups). This, in turn, could help us to study the broad effects of deplatforming, by more systematically examining where communities migrate when they’re kicked off a major site. Those possibilities are a long way off, but distinguishing participants in a group from onlookers talking about the group is a step towards the nuance in language processing we need.

Setup

Our study focuses on a simple version of this problem: given a subreddit representing an in-group, and a subreddit dedicated to discussing the in-group, automatically label commenters as being part of the in-group or onlookers based on the text of their comments. We use the following list of subreddit pairs:

In-Group Onlooker Description
r/NoNewNormal r/CovIdiots NoNewNormal discussed perceived government overreach and fear-mongering related to Covid-19
r/TheRedPill r/TheBluePill TheRedPill is part of the “manosphere” of misogynistic anti-feminist communities
r/BigMouth r/BanBigMouth Big Mouth is a sitcom focusing on puberty; BanBigMouth claimed the show was associated with pedophilia and child-grooming, and petitioned for the show to be discontinued
r/SuperStraight r/SuperStraightPhobic SuperStraight was an anti-trans subreddit, SuperStraightPhobic antagonized its userbase and content
r/ProtectAndServe r/Bad_Cop_No_Donut ProtectAndServe is a subreddit of verified law-enforcement officers, while Bad_Cop_No_Donut documents law enforcement abuse of power and misconduct
r/LatterDaySaints r/ExMormon LatterDaySaints is an unofficial subreddit for Mormon practitioners, while ExMormon hosts typically critical discussion about experiences with the church
r/vegan r/antivegan Vegan discusses cooking tips, environmental impact, animal cruelty, and other vegan topics. AntiVegan is mostly satirical, making fun of “vegan activists”

Some of these subreddit pairs are directly related: r/TheBluePill is explicitly about r/TheRedPill. Other subreddit pairs are only conceptually connected: r/Bad_Cop_No_Donut is about law enforcement, but it’s not specifically about discussing r/ProtectAndServe. This variety should help illustrate under what conditions we can clearly distinguish in-groups from onlookers.

For each subreddit pair, we downloaded all comments made in each subreddit during the last year in which they were both active. In other words, if one or both subreddits have been banned, we grab the year of comments leading up to the first ban. If both subreddits are still active, we grab the comments from the last 365 days to present.

We discarded comments from bots, and comments from users with an in-subreddit average karma below one. This is to limit the effect of users from an onlooking subreddit “raiding” the in-group subreddit (or vice versa), and therefore muddying our understanding of how each subreddit typically writes.

What’s in the Data

Next, we want to identify the words used far more in the in-group than the onlooking group, or vice versa. There are a variety of ways of measuring changes in word-usage, including rank turbulence divergence (which words have changed the most in terms of their order of occurrence between one dataset and another) and Jensen-Shannon divergence (the difference in word frequency between each subreddit and a combination of the two subreddits).

For example, here’s a plot illustrating which words appear more prominently in r/NoNewNormal or r/CovIdiots, based on the words “rank”, where rank 1 is the most used word, and rank 10,000 is the 10,000th most-used word:

An allotaxonograph comparing r/NoNewNormal and r/CovIdiots

While we know both subreddits feature terms like “vaccine”, “mask”, and “covid”, this plot tells us that terms like “doomer”, “trump”, and “lockdown” are used disproportionately in our in-group, while disparaging terms like “idiot”, “stupid”, and “moron” are far more common in the onlooker group.

We can already see one limitation of this study: the most distinguishing term between our two subreddits is “covidiot”, a term developed on r/CovIdiots. We’re not just capturing some context around the in-group’s use of terminology, we’re identifying keywords specific to this community of onlookers, too.

Building a Classifier

Now that we’ve had a peek at the data, and have confirmed that there are terms that strongly distinguish one community from its onlookers, we want to build a classifier around these distinguishing terms. Specifically, for every user we want to get a big text string consisting of all of their comments, the classifier should take this comment string as input, and return whether the user is in the in-group or the onlooker group.

Since we know whether each user participates mostly in the in-group subreddit, or the onlooking subreddit, we’ll treat that as ground-truth to measure how well our classifier performs.

We built two classifiers: a very simple linear-regression approach that’s easy to reverse-engineer and examine, and a “Longformer” transformer deep-learning model that’s much closer to state-of-the-art, but more challenging to interrogate. This is a common approach that allows us to examine and debug our results using our simple method, while showing the performance we can achieve with modern techniques.

We trained the linear regression model on term frequency-inverse document frequency; basically looking for words common in one subreddit and uncommon in another, just like in the plot above. We configured the Longformer model as a sequence classifier; effectively “given this sequence of words, classify which subreddit they came from, based on a sparse memory of prior comments from each subreddit.”

Results

Here’s our performance on a scale from -1 (labeled every user incorrectly) to 0 (did no better than proportional random guessing) to 1 (labeled every user correctly):

In-Group Onlooker Logistic Regression Performance Longformer Performance
r/NoNewNormal r/CovIdiots 0.41 0.48
r/TheRedPill r/TheBluePill 0.55 0.65
r/BigMouth r/BanBigMouth 0.64 0.80
r/SuperStraight r/SuperStraightPhobic 0.35 0.43
r/ProtectAndServe r/Bad_Cop_No_Donut 0.50 0.55
r/LatterDaySaints r/ExMormon 0.65 0.72
r/vegan r/antivegan 0.49 0.56

Or, visually:

Barplot of above table

Much better than guessing in all cases, and for some subreddits (BigMouth, LatterDaySaints, and TheRedPill) quite well!

If a user has barely commented, or their comments all consist of responses like “lol”, classification will be near-impossible. Therefore, we can re-run our analysis, this time only considering users who have made at least ten comments, with at least one hundred unique words.

In-Group Onlooker Logistic Regression Performance Longformer Performance
r/NoNewNormal r/CovIdiots 0.57 0.60
r/ProtectAndServe r/Bad_Cop_No_Donut 0.65 0.76
r/LatterDaySaints r/ExMormon 0.80 0.83
r/vegan r/antivegan 0.65 0.72

And visually again:

Barplot of above table

For a few subreddit pairs, the onlooking subreddit has too few comments left over after filtering for analysis to be meaningful. For the four pairs that remain, performance improves significantly when we ignore low-engagement users.

Similarly, we can examine what kinds of users the classifier labels correctly most-often:

Plot comparing labeling correctness to subreddit comments, total subreddit karma, and mean subreddit karma

The classifier performs better on users with more comments (and therefore more text to draw from), and more karma in the subreddit (which typically correlates with number of comments unless the user is immensely unpopular), but does not significantly differ with mean subreddit karma. In other words, popular users who receive lots of karma on many of their comments, and therefore might be more representative of the subreddit’s views, are not easier to classify.

Conclusions, Limitations, Next Steps

For a first attempt at solving a new problem, we have some promising results. We can consistently distinguish users from an in-group and users from a specific onlooking group, based on the language of users’ posts. Our study focuses on subreddits, which provide a best-case scenario for classification: comments are neatly partitioned into the in-group and onlooker subreddits. If we studied Twitter users, for example, we’d have no baseline to determine whether our classifier was guessing correctly, or even a good way to feed it training data, without human annotators labeling thousands of Twitter accounts by hand.

It’s also unclear how well this classifier would function in a cross-platform environment. For example, could we train the classifier on a subreddit, and then classify Twitter or Discord users based on their comments? Theoretically, the same community will discuss the same topics on multiple platforms, likely with similar keywords. However, the design of each platform (such as the short character limits on Tweets) may constrain authors enough to make classification harder.

Finally, it’s unclear how well this classification will hold up over time. Would a classifier trained on last year’s comments still perform well on users from this year? Or will the discussion topics of a community have drifted too far for those old word frequencies to be useful? This could be especially important when communities migrate between platforms, when we may for example have old Reddit data and new Discord data.

Lots more to do, but I’m excited about these first steps!


View older posts