Hosting Under Duress

I have an article in the latest issue of 2600: The Hacker Quarterly, writing on behalf of Distributed Denial of Secrets. It’s reproduced below:

Hosting Under Duress

By Milo Trujillo (illegaldaydream@ddosecrets)

On June 19th, Distributed Denial of Secrets published BlueLeaks, approximately 270 gigabytes of internal documents from U.S. local-LEA/federal-agency fusion centers, municipal police departments, police training groups, and so on. The documents have revealed a range of abuses of power, from tracking protestors and treating journalists and activists like enemies, to willful inaction against the alt-right, with additional BlueLeaks-based stories emerging each week. Thank you, Anonymous, for leaking this data!

The retaliation against DDoSecrets has been significant. Twitter promptly banned @ddosecrets, followed by Reddit’s bans of /r/ddosecrets and /r/blueleaks, all for violating content policies regarding posting personal information and hacked material. Both blocked the ddosecrets.com domain name in posts, and Twitter went as far as blocking it in DMs, and blocking URL-shortened links by following them with a web spider before approving the message. German police seized a DDoSecrets server on behalf of U.S. authorities (our hosting providers are geographically scattered), and goons from Homeland Security Investigations paid a visit to some folks operating a mirror of DDoSecrets releases, asking questions about the BlueLeaks documents and the founder of DDoSecrets, ultimately attempting to recruit them as informants and offering money for info that led to arrests.

None of these actions have hindered distribution of the BlueLeaks documents, which were released by torrent, and all are directed at the publishers of the documents, not the hackers that leaked them. Wikileaks maintains an active Twitter account and has faced no such domain banning. What we have is a warning: publishing information on U.S. law enforcement, even when clearly in the public interest, will not be tolerated.

So how do you design server infrastructure to operate in this hostile space, where third party corporations will ban you and self-hosted servers are liable to be seized? Distribution, redundancy, and misdirection. All the documents published by DDoSecrets are distributed by torrent, so there is no central server to seize or account to ban to halt distribution, and data proliferates so long as there is public interest. But leaking data is only half of the DDoSecrets mission statement: raw documents aren’t valuable to the public, the ability to extract meaning from them is. Therefore, DDoSecrets works closely with journalists and academics to help them access and analyze data, and runs a number of services to make analyzing leaks easier, like Whispers (https://whispers.ddosecrets.com/), a search tool for Nazi chat logs, or X-Ray (https://xray.ddosecrets.com/), a crowd-sourced transcription tool for leaked business records with formats too challenging to OCR. These services have to be hosted somewhere.

Static services like Whispers or the homepage are easy: They’re set up with backups and Docker containers and Ansible scripts. If a server disappears, rent a new one from a different hosting provider and re-deploy with a couple lines in a terminal. A few services aren’t quite so easy to replicate, though. The Data server maintains a copy of every leak, available over direct HTTPS, mostly so we can give a URL to less technical journalists that “just works” in their browser, without walking them through using a torrent client. All the data is available by torrent and nothing unique is on the server, but finding a new hosting provider to spin up a 16-terabyte system (not counting redundant drives in the RAID) and then re-uploading all that data is, to say the least, inconvenient. The same goes for Hunter, the document-ingesting cross-analyzing leak search engine. It would be nice if we only had to migrate these servers infrequently.

The solution for these large servers is to hide them away forever, and make a repeat of the German seizure unlikely. These servers are now hosted only as Tor onion sites, and are only connected to, even for administration, via Tor. A tiny “frontend” virtual machine acts as a reverse-proxy, providing a public-facing “data.ddosecrets.com” that really connects via Tor to the much larger system. The reverse-proxy can be replaced in minutes, and doesn’t know anything about the source of the data it’s providing.

We’ll end with a call to action. None of the design outlined above is terribly complex, and with the exception of the Tor reverse-proxy, is pretty common IT practice in mid-sized companies that have outgrown “a single production server” and want scalable and replaceable infrastructure. The technical barrier for aiding the cause is low. Hacking has always been about challenging authority and authoritarianism, and that mindset is needed now in abundance, at DDoSecrets and beyond. No time to waste - Hack the Planet!

I submitted the article on January 3rd, so it predates a range of DDoSecrets releases including the Parler Scrape and the Gab Leak, which have drawn more data, attention, and significant operating costs to the collective. It also predates the Verkada hack and subsequent raid on Tillie Kottmann’s apartment, which culminated in charges that should frighten anyone connected to computer security in any form (Twitter thread).

We’ve seen this dynamic play out before, when hacktivist groups in 2010-2012 challenged corporations, governments, and psuedo-religious cults in a bid to make the world a better place. Emma Best has written an article on Hacktivism, Leaktivism, and the Future exploring these parallels, some of what we can expect from both hacktivism and State response going forward, and hopeful futures of what we can accomplish together.

If you believe in what we are doing, your help, either financial or by volunteering, would be greatly appreciated.

Posted 3/26/21


Color Filter Array Forensics

Digital Image Forensics is a field concerned with identifying whether images are original or have been edited, and if the latter, what kinds of edits have been made. I haven’t experimented much with digital forensics, but there’s overlap with steganography and neat encoding tricks like halftone QR codes. I’ve been reading about some forensic techniques in UC Berkeley’s “Tutorial on Digital Image Forensics” (200 page PDF), and Color Filter Array forensics is a fun one.

Digital cameras have hardware limitations that leave very specific patterns in the resulting images, and any photo edits will disrupt these patterns, unless the editor takes care to preserve them. Details follow.

Digital Camera Hardware

Digital images usually consist of three color values per pixel, for red, green, and blue. However, most digital cameras don’t have any color sensors in them. Instead, they have a grid of light/luminosity sensors, and they add a layer of filters in front of the sensors that filter out all but red, all but green, or all but blue light. This is much cheaper to manufacture! But there’s a serious drawback: Each pixel can only record a single red, green, or blue sample, instead of all three.

So cameras fake their color data! They use a light filter grid called a Color Filter Array (most commonly a Bayer Filter), like the following:

One row consists of red/green/red filters, the following row consists of green/blue/green filters, then red/green/red, and so on. The result is that each pixel now has a real value for one color channel, and has two or more neighbors with a real value for each other color channel. We can approximate the missing color channels as an average of our neighbors’ color channels. For example, a red pixel will calculate its “blue” channel as the average of the neighbors in the four corners diagonal from its position, and will calculate its “green” channel as the average of the neighbors above, below, left, and right. This approximation is called a “de-mosaicking” algorithm.

De-mosaicking works okay, because how much will the red value change over the distance of a single pixel? Usually not by very much, unless there’s a sharp edge with high color contrast, in which case this approximation will make colors “bleed” slightly over the sharp edge. Newer cameras try to auto-detect these high-contrast borders and only approximate color channels using the neighbors on the same side of the border, but let’s ignore that for now.

Detecting De-Mosaicking Patterns

While the simulated color data looks mostly correct to the human eye, it leaves an unnatural pattern in the numeric color values for each channel. Specifically, we know that each pixel will have two “simulated” channels that are the average of the same channel in each neighboring pixel with a real value for that channel. This should be easy to check in Python, Matlab, or your image-analysis language of choice:

#!/usr/bin/env python3
from PIL import Image
import numpy as np
from statistics import mean

im = Image.open("bayer_filter_demosaicked.jpg")
pixels = np.array(im)
RED,GREEN,BLUE = [0,1,2]

# .X.
# ...
# .X.
def getVerticalAverage(pixels, i, j, channel):
        rows = pixels.shape[0]
        if( i == 0 ):
                return pixels[i+1,j,channel]
        if( i == rows-1 ):
                return pixels[i-1,j,channel]
        return round(mean([pixels[i-1,j,channel],pixels[i+1,j,channel]]))

# ...
# X.X
# ...
def getHorizontalAverage(pixels, i, j, channel):
        cols = pixels.shape[1]
        if( j == 0 ):
                return pixels[i,j+1,channel]
        if( j == cols-1 ):
                return pixels[i,j-1,channel]
        return round(mean([pixels[i,j-1,channel],pixels[i,j+1,channel]]))

# X.X
# ...
# X.X
def getDiagonalAverage(pixels, i, j, channel):
        rows = pixels.shape[0]
        cols = pixels.shape[1]
        corners = []
        if( i > 0 ):
                if( j > 0 ):
                        corners.append(pixels[i-1,j-1,channel])
                if( j < cols-1 ):
                        corners.append(pixels[i-1,j+1,channel])
        if( i < rows-1 ):
                if( j > 0 ):
                        corners.append(pixels[i+1,j-1,channel])
                if( j < cols-1 ):
                        corners.append(pixels[i+1,j+1,channel])
        return round(mean(corners))

def confirmEqual(i, j, color1, color2):
        if( color1 != color2 ):
                print("Anomaly detected at %d,%d (got %d, expected %d)" % (i,j, color1,color2))

# For every pixel, determine what 'real' color channel it has
# then confirm that its interpolated channels match what we get
# from de-mosaicking
for i,row in enumerate(pixels):
        for j,col in enumerate(row):
                if( i % 2 == 0 ): # Red/Green row
                        if( j % 2 == 0 ): # Red column
                                correctGreen = mean([getHorizontalAverage(pixels,i,j,GREEN),getVerticalAverage(pixels,i,j,GREEN)])
                                correctBlue = getDiagonalAverage(pixels,i,j,BLUE)
                                confirmEqual(i, j, pixels[i,j,GREEN], correctGreen)
                                confirmEqual(i, j, pixels[i,j,BLUE], correctBlue)
                        else: # Green column
                                confirmEqual(i, j, pixels[i,j,RED], getHorizontalAverage(pixels,i,j,RED))
                                confirmEqual(i, j, pixels[i,j,BLUE], getVerticalAverage(pixels,i,j,BLUE))
                else: # Green/Blue row
                        if( j % 2 == 0 ): # Green column
                                confirmEqual(i, j, pixels[i,j,RED], getVerticalAverage(pixels,i,j,RED))
                                confirmEqual(i, j, pixels[i,j,BLUE], getHorizontalAverage(pixels,i,j,BLUE))
                        else: # Blue column
                                correctGreen = mean([getHorizontalAverage(pixels,i,j,GREEN),getVerticalAverage(pixels,i,j,GREEN)])
                                correctRed = getDiagonalAverage(pixels,i,j,RED)
                                confirmEqual(i, j, pixels[i,j,RED], correctRed)
                                confirmEqual(i, j, pixels[i,j,GREEN], correctGreen)

Of course, this is only possible if you know both which Color Filter Array the camera model that took the photo uses, and the details of their de-mosaicking algorithm. For now we’ll assume the basic case of “red/green + green/blue” and “average neighboring color channels ignoring high-contrast borders”. For more on color filter arrays and better de-mosaicking approaches, read here. Let’s also assume the image has only lossless compression, which is often the case for the highest export quality straight off a digital camera.

Image Editing Footprints

If anyone opens our camera’s photos in editing software like Photoshop or GIMP, and makes any color adjustments, they’ll break the de-mosaic pattern. If they use the clone/stamp tool, the stamped portions of the image won’t have color channels averaging their neighbors outside the stamped region, and the de-mosaic pattern will be broken. If they copy a portion of a different image into this one, the pattern will be broken.

Not only can we detect when an image has been altered in this way, we can detect where anomalies occur, and potentially highlight the exact changes made. Amending the above script, we’ll replace “reporting” an anomaly with highlighting anomalies:

# Turn all correct pixels 'red', leaving anomalies for further examination
pixels2 = np.copy(pixels)
def confirmEqual(i, j, color1, color2):
        global pixels2
        if( color1 == color2 ):
                pixels2[i,j,RED] = 255
                pixels2[i,j,GREEN] = 0
                pixels2[i,j,BLUE] = 0

Since Photoshop/GIMP’s changes look “correct” to the human eye, the tools have done their job, and they have no incentive to make their changes undetectable to forensic analysis.

Defeating CFA Anomaly Detection

Unfortunately, this technique is far from flawless. There are two ways to defeat CFA anomaly detection:

  1. Delete the appropriate RGB channels from each pixel after editing the image, and re-run the de-mosaicking algorithm to recreate the de-mosaic pattern. This requires the forger have the same knowledge as a forensic analyst regarding exactly what Color Filter Array and de-mosaicking approach their camera uses.

  2. Export the image using a LOSSY compression algorithm, with the compression rate turned up high enough to perturb the channel values and destroy the de-mosaic pattern. This will make it obvious that the image has been re-saved since being exported from the camera, but will destroy the clear-cut evidence of which portions have been edited, if any.

All in all, a very cool forensic technique, and a fun introduction to the field!

Posted 2/1/21


Peer-to-Peer Network Models and their Implications

Let’s take the following as a given:

Mainstream social media is harmful because it puts a single company in control of the human social experience and places them in the role of cultural censor

If you agree and want to get on with addressing the problem, skip ahead to the next section.

How We Got Here

When we say “control of the human social experience” we refer to something like Elinor Ostrom’s Institutional Analysis and Development Framework (here’s a good paper on applying IAD to digital institutions if you want to dive in), which describes platforms in terms of the following rules:

  1. Operational Rules: Describe the interface and what actions a user can take, such as tweeting and retweeting and liking

  2. Collective Rules: Describe the higher-level context in which the operational rules are utilized, like how the Twitter content feed orders tweets based on what users you follow, how popular the tweet was in terms of retweets and likes, and how old the content is

  3. Constitutional Rules: Describe by what process the operational, collective, and constitutional rules can be changed

In a corporate-owned social media network, the corporation has complete control over the operational and collective rules, and most of the control over the constitutional rules. There may be some external influence, such as summoning the CEOs in front of Congress for questioning, or threatening to amend FCC regulations governing Internet platforms, or DMCA takedown requests. Regardless, the users of the platform, those most affected by operational and collective rules, have almost no say over those rules.

The only influence a user has over a platform is the option to leave, which en masse might impact the company’s bottom line. However, if we assume users on social media generally want to be social, then they’ll want to migrate to another social platform when they leave, and all the major social media companies have similarly tight control over their platforms with remarkably similar acceptable content and moderation policies.

When we describe platforms as cultural censors, we mean that they decide what content they will permit on their platform and what is a banable offense. Across social sites like Tumblr and Facebook and infrastructure sites like Paypal and Patreon, we’ve repeatedly seen companies take a puritanical stance against LGBT+, sex-positive, and sex-work content. Violet Blue (fantastic!) has written lots about this. Simultaneously, platforms are extremely hesitant to censor white supremacist or neo-nazi content, because they do not want to be accused of political censorship or bias by Congress or the White House, and the aforementioned content is increasing adjacent to Republican political talking points.

So, corporate-run social media implies a structure the users have no say in, with content limits the users have no say in, which favor harmful and icky views while inhibiting freedom of expression. There’s no way out by hopping to another corporate-run social media platform, because the next platform has the same problems and policies. “I’ll just build my own platform” leads to the same pattern with a different set of oligarchs, as we’ve seen (and I’ve written about before) with a range of “alt-tech” platforms like Voat, Parler, Gab, and BitChute that were created to host right-wing extremist content banned from their mainstream counterparts.

Decentralizing Governance

To address this problem we need a radical new kind of social media, with no central governors. This new media should not be susceptible to a small oligarchy enforcing social views, but ideally should have some conception of moderation, so community spaces can remove harassers, abusers, spammers, and hateful content.

The clear solution is a decentralized network, where content isn’t hosted on a single server hosted by a single entity. There can be collective storage of social media data, with collective agreement on what content to support and how.

Okay, great! So what does “decentralized” look like? A distributed hash table? A blockchain? A bunch of mirrored servers? Let’s look before we leap and consider common decentralized network models and their implications when used in social media of any sort.

Distributed Hash Tables

I recently described distributed hash tables in more depth, but the one line summary is that in a DHT users are connected directly to multiple peers, and data in the form of key-value pairs (like a Python dictionary) are distributed across all peers with some redundancy, so anyone can quickly look up information associated with a known key. Participants use a centralized introduction server (called a “tracker” for torrents) to find their initial peers, but this introduction point has little information and can be easily replaced or made redundant.

DHTs have two significant limitations. The first is that all content must be framed as a key-value pair, and it’s not always obvious how to design systems around this constraint. Nevertheless, a variety of projects use DHTs at their core, including Hyperswarm, the underlying peer-discovery system used by Hypercore (previously known as “DAT protocol”), which in turn is the peer-to-peer content sharing system powering chat applications like Cabal. DHTs are also at the heart of GNUnet (another generic p2p content sharing system used for file-sharing and chat), and a similar key-value routing technology is used in Freenet, which is aimed at distributed file hosting, microblogging, version-control systems, and other append-only shared data. Finally, DHTs are used for routing in the Invisible Internet Project, and for connecting to onion services in Tor, so it’s safe to say that DHTs are a common design pattern in decentralized spaces.

The larger limitation of DHTs is that they produce a singleton. While a distributed hash table is “distributed” in the sense that content can be scattered across a wide range of peers, it is centralized in the sense that there is only a single DHT network shared by all users, with a single shared namespace, storing content for all peers in the network. This may not always be desirable: users may not want to host content for peers outside their social circle, may not want to reveal their IP address to peers outside their circle, or may simply want to extend the DHT and change its functionality in a way incompatible with the larger community. While it is technically possible to run multiple competing DHTs with no connection to one another, utilizing separate introduction servers, this is strongly disincentivized, since DHTs gain performance, redundancy, reliability, and storage capacity with scale.

Blockchains (Sigh)

Blockchains are a form of data storage with two interesting attributes:

  1. They form an append-only log that cannot be rewritten (once a block has proliferated and several more blocks are added to “solidify” its presence in the chain)

  2. They are highly redundant

If those two attributes are not both valuable for your application, then a blockchain is the wrong choice for you. There are alternative solutions to append-only logs (like signed messages stored in a merkle tree, or Paxos), and to data redundancy (like a DHT, or client-stored messages that can be re-uploaded to peers later, as in sneakernets). But let’s look at the structure of a blockchain network.

Blockchain network traffic looks remarkably similar to DHT traffic, except that instead of using key-value pairs, every peer stores the complete data log:

This certainly provides more redundancy than a DHT, and maybe that level of redundancy is important if you’re building a ledger of all financial transactions ever and your entire economy relies on its stability. For most use-cases, the redundancy afforded by a DHT is sufficient, and requires far less storage for each peer. Blockchains also imply a significant computational cost to write any new data if the blockchain uses proof-of-work to ensure immutability. It’s an expensive and over-applied data structure.

We can address some of the limitations of blockchain using a sidechain, where the primary blockchain (or another data structure like a merkle tree) includes references to the heads of miniature blockchains that are smaller and can be updated more quickly. These sidechains can follow simpler rules than a public blockchain, such as allowing a user with a private key to craft new blocks instead of using a “proof of work” algorithm.

Signal, the centralized but end-to-end encrypted chat software, uses a chat protocol with some similarities to a blockchain. In the Signal protocol, each message in a conversation includes a reference to both the previous message that user sent, and the most recent messages that user has seen. This means Signal’s central server can’t ommit messages without clients noticing, and at worst can deny all service to a user. However, there is no proof-of-work involved in this chain; the only requirement for adding a new block is that it must be signed by the user, eliminating one of the largest limitations of a full blockchain.

Sidechains are also at the heart of Keybase, a Slack-like encrypted messaging and file hosting system that maintains a chain for each user to immutably store information about the user’s identity. Keybase also maintains each shared folder as a blockchain of write operations. Notably, however, Keybase is a centralized system that uses blockchains and signature verification to keep the server honest. The blockchains serve as a tamper-evidence mechanism that makes it difficult for the server to revert or manipulate messages (along with aggressive client caching), but the Keybase server is a sole central data repository for the network.

As with DHTs, blockchain networks form singletons (even if not run on a central server like Keybase), and running parallel blockchains or forking a chain is frowned upon because it shrinks the network and dilutes the benefit of a single shared ground truth.

Federation

Federation is an interesting combination of centralization and decentralization. From the user’s perspective, federated social networks work much like mainstream networks, except that they have several Facebook- or Twitter-like servers to choose from. Each server operates as an independent miniature social community. Server operators, however, have a different experience. Each operator can choose to federate with another server, bidirectionally linking the two servers, exchanging messages to create a larger collaborative social network. Collections of federated servers are referred to as a fediverse.

The most well-known federated social network is Mastodon, a Twitter-like cluster of servers. Each server has its own content policies and moderators, and usually only federates with servers with similar content policies. This lessens the likelihood of the network implementing extremely puritanical social policies, and allows easy migration if moderators on a single server go power-mad. When Gab (an alt-right Twitter clone) abandoned their own software stack and became a Mastodon instance they were universally condemned, and most other server operators refused to federate with the Gab instance, isolating them on their own server and proving the effectiveness of Mastodon’s moderation strategy.

Unfortunately, Mastodon also suffers from the singleton problem. While servers can federate, it is difficult to convince an operator of a major server to create a bidirectional link with a minor one, and there is little incentive to run a Mastodon server and less incentive if it is unfederated with major servers. As a result, three Mastodon servers contain almost 60% of the known Mastodon population.

The Diaspora Network, a decentralized Facebook-like network, also follows a federation model. Of note, they popularized clustering contacts by “aspects” of your life, so you can easily share contact with some categories of peers and not others.

It’s worth pointing out that federation is far from a new concept. Internet Relay Chat also provides functionality for federating servers, synchronizing chatrooms and messages between two servers, though the process is very rarely used since it grants operators on both servers extreme power over conversations that occur on the other. Similarly, NNTP (the network news transfer protocol underlying Usenet) allows servers to exchange news postings and comments to create a shared community. In Usenet’s case federation was commonplace, and resulted in a singleton network similar to Reddit with far less moderation.

Pubs

Pubs (in the “public house”, inn/bar sense of the word) invert the idea of federation. Instead of users connecting to a single server and allowing the servers to interlink, users now connect to several servers, and serve as the links. In a pub model, servers are reduced to the role of content caches and introduction points: They allow users to leave messages for other users to find, allowing two participants to communicate asynchronously without being online at the same time for a peer-to-peer connection. They also allow users to leave messages publicly on the server, making it possible to introduce themselves and meet other users.

Since users are connected to multiple servers at once, they can post the same message to multiple servers concurrently, and rely on clients to recognize and combine the duplicates. This means users are not bound to a single server where they host their content, as in a federated service, but can store their content feed on multiple servers, more akin to a distributed hash table. Since servers have no federation, there is little cost to running a pub server. Unlike in federated spaces, a small pub can be valuable by providing a closed but shared conversation space, representing a tighter group of friends, or colleagues with a similar and obscure interest. Users can choose to only post some of their messages to these private spaces, moving between more public and more private content feeds.

The most popular pub-model network in use today is Secure Scuttlebutt, which works both via pub servers and via syncing with peers over WiFi networks, exchanging both their own messages and cached messages from friends and friends-of-friends (gossip). Notably, Scuttlebutt is offline-first: the entire “content feed” is stored locally, so you can browse and reply to messages while offline, and then re-sync when you next connect to a pub or are on a LAN with another peer. The entire network can theoretically run without pubs, purely on local network exchanges, and place no authority on pubs at all. Without a reliance on central servers there is also no clear opportunity for community moderation. Scuttlebutt supports a peer blocking individual users and hiding them from their content feed, but this scales poorly on a larger network.

The Open Privacy Research Society is working on their own pub-based chat network, Cwtch, with the twist that pubs can’t read the messages they host. Cwtch is more like a signal group chat, sans the reliance on the central Signal servers, by caching messages in a variety of servers and storing them locally. Cwtch operates entirely over Tor, reaching pubs and other peers via Tor onion services. When two peers are online at the same time they can exchange their message logs from the group, filling in each-other’s blanks, and using a server only to leave messages asynchronously when peers are offline.

Pub-style networks have the distinct advantage of only caching content relevant to the user (or close to them in a social graph). Servers need to store considerably more content, but can auto-delete older messages to limit the load, since messages will live on in users’ local storage.

The Takeaways

DHTs, Blockchains, Federation, and Pubs provide distinctly anti-capitalist models for sharing social content without a patron corporation facilitating discussion. Each decentralized model has unique characteristics shaping the kinds of information sharing that are possible, the opportunity (and dangers) for moderation, and the kind of clustering that is likely to result. I’m personally most excited about pubs, while acknowledging the utility of DHT networks, but all four paradigms have borne fruit already and should be pursued.

The folks at SimplySecure (plus some community members like me!) are exploring decentralized design concepts at decentralization off the shelf.

Posted 11/21/20


RAIDs: Combining Hard Drives for Fun and Profit

After writing about decentralized data storage and torrents, I’ve had data storage on my mind. I drafted this post while setting up a RAID for the MeLa research group today, and built RAIDs for DDoSecrets a few weeks ago, so there’s a ton of “data storage” in my life right now and it only seems natural to quickly write up how large data storage on a centralized server works.

Hard drives get more expensive as storage capacity increases, and traditional spinning-plate hard drives have a limited lifespan, because spinning pieces and motors wear out relatively quickly from friction. Hard drives can read and write data more quickly when they spin faster, but this also wears out the drive more quickly. Therefore, getting a very big hard drive with a long lifespan that’s also fast becomes prohibitively expensive.

What if instead we could slap together several cheap and unreliable hard drives, and out of the harmony of garbage get high-capacity, reliable, high-speed data storage? A Redundant Array of Inexpensive Disks, or RAID, does exactly this. But how?

Types of RAIDs used today

Linear RAID

The most obvious way to “combine” two hard drives is to (metaphorically) glue one to the end of another. When the first hard drive is filled, start writing data to the second. We can create a virtual hard drive consisting of two or more physical hard drives glued together in this way, and write data to the virtual drive as if it’s one big drive.

Alright, that’s easy, but it’s lacking in a few ways. First, there’s no redundancy: If one of the cheap hard drives in our stack of three fails, we’ve just lost a third of our data. Second, it seems slow: Accessing most files will only use one hard drive at a time, and couldn’t we get more performance by using multiple drives at once?

RAID 0: Striping

A striped RAID works the same way as a linear RAID, but it splits data across all drives equally. If you had two drives then you’d put the even bytes on the first drive and the odd bytes on the second drive. Then when reading or writing data you use both drives at once, for half as long, and so in theory get twice the speed!

In the real world we don’t write “even and odd bytes” but rather “even and odd chunks of bytes” called stripes, because that’s faster in practice. Same idea.

Redundancy is still a problem with stripes, perhaps even more than with linear RAIDs: If a hard drive dies we now lose “all the even chunks of every file”, which makes our remaining data just about worthless.

RAID 1: Mirroring

Mirroring creates a perfect backup of a drive. Every time we write data to one drive, we also write the data to all the backup drives. If one of the drives dies, we seamlessly continue using the rest. When we replace a dead drive, we copy all the data from the other drives in the mirror.

When reading data we can actually get performance similar to striping, by reading some chunks from one drive while reading other chunks from a second drive. Only data writes need to be synchronized across all the drives. Mirrors limit you to the size of a single drive (if you have three 1-TB drives that are all perfect copies of one another, you only get one terabyte of “real storage”), and the write speed of a single drive, but the combined read speed of all your drives.

RAID 10 (1+0): Striped Mirrors

If we have four hard-drives we can easily combine last two strategies: Create two sets of mirrors, and stripe data across them.

We get the storage capacity of two out of the four drives, the write speed of two drives, the read speed of four drives, and redundancy. The redundancy is slightly unintuitive: We lose nothing if any one drive fails, and we lose nothing if a second drive fails and it wasn’t the mirrored copy of the first drive that failed. In other words, as long as we’re lucky and we still have a full copy of the data across some combination of drives, then we’re okay.

With more hard drives comes more flexibility. Six hard drives can be organized as three mirrors of two drives each, or two mirrors of three drives each. The administrator chooses a trade off between more storage, more speed, and more redundancy.

RAID 01 (0+1): Mirrored Stripes

Don’t do this. If we reverse the order, striping two drives together and then mirroring the data to another two drives, we conceptually get the same result as a RAID 10. In practice however, a RAID 01 is more fragile. In most implementations, when one half of a stripe fails, the other half is disabled, too. Critical metadata, which tracks what stripes were allocated and placed on which drives, was kept on the now-dead drive, shattering the puzzle. Therefore when one drive in a RAID 01 fails, its striped partner also shuts down, reducing the 01 to a RAID 0. Don’t use 01, use 10.

RAID 5 / RAID Z

A RAID 5 distributes chunks across three or more drives so that any one drive can be lost without losing data. For example, assuming we have three drives, we can store even chunks on drive 1, odd chunks on drive 2, and the XOR of the two chunks on drive 3. Given any two drives, the information on the third drive can be re-created.

This lets us keep two thirds of the storage from our three drives, along with the read and write speed of two drives. Mostly a better trade off than a striped mirror! With four or more drives the dividends are even better, since we’ll get the storage capacity and read and write speed of three drives, then four, etc.

RAID 6 / RAID 2Z

Same idea as RAID 5, but the parity blocks (the XOR of the data blocks) are written to at least two drives. This means two drives can fail without losing data, and a RAID 6 is only possible with at least four drives.

Antique RAIDs

Alright, we covered RAID 0, 1, 5, and 6, what happened to 2, 3, and 4? They’re all poor designs that have been retired in favor of 5 and 6. Here’s a brief run-down:

RAID 2

Same idea as a RAID 5, except information is striped at the bit-level across all the drives, and the drives use Hamming codes to provide redundancy and error correction. This means that all the drives must spin in sync, so you can only access one file at a time, and reading and writing at a bit level makes the configuration slower.

RAID 3

Same as RAID 2, but stripes at a byte-level instead of bit, and stores XORs of the bytes on the last drive, which is a dedicated “parity drive”. Again requires that all drives spin in sync.

RAID 4

Same as RAID 3, but stripes at a block-level instead of byte. This means read performance is much better (you can frequently read blocks for one file from two drives while reading blocks for another file from a third drive), but write performance is still poor, since parity blocks are all stored on a single drive, which must be used for all writes.

Hardware or Software?

Traditionally to build a RAID you need a special “RAID card” on the computer, which connects to all the relevant drives and implements the RAID, presenting a single “virtual hard drive” to the motherboard and operating system. On more modern systems you can produce a “software RAID” where the operating system has access to the individual drives and produces a RAID on its own, using tools like mdadm or ZFS. This is sometimes more efficient, especially with ZFS, where the filesystem and RAID software are integrated and can read and write more efficiently than with a virtual disk.

Which type of RAID is right for me?

Choosing the type of RAID you want is a decision about how much redundancy you need, versus capacity and speed. Many servers have multiple RAIDs for different purposes. One additional consideration is that most computers can only boot off of mirrored RAIDs. This is because the BIOS, the code burned into the motherboard that initializes enough of the hardware to find the operating system and start it, is very small and not so clever. Stripes and RAID 5 clusters are complicated, but a drive from a mirror can be treated like a single independent drive. The BIOS finds one drive in the mirror and uses it to start the operating system, which then realizes it’s on a mirrored RAID and picks up the other drives.

Therefore, one common server configuration is to use two or more SSDs in a mirrored RAID for booting. These drives contain the operating system and all software, can be read at an absolutely blazing speed, and have redundancy because of the mirror. Then additional conventional drives are placed in a RAID 5 or 6 for a decent trade on performance and capacity, creating a larger pool of drives for data.

Posted 10/22/20


Distributed Hash Tables and Decentralized Data Storage

This post is mostly theoretical computer science (data structures, distributed systems) leading up to future posts that can talk about the design of decentralized communities with an emphasis on social theory (self-governance, trust, responsibility for content)

I’m at the (virtual) computer-supported collaborative work conference this week, and it’s stirring many ideas related to shared governance of decentralized communities. Before digging into those ideas, though, one more interlude about technical underpinnings of decentralized systems…

The Problem Space

We have information on a big central server, and we would like to spread it across many servers. This can be for a variety of technical reasons, including:

  • Redundancy, if the central server goes offline

  • Performance, if users can connect to a variety of servers then the average workload per server will be much lower

  • Resource limitations, if a central server with enough storage, processing, or bandwidth to support all users at once is infeasible

There may also be social reasons for desiring distribution, such as removing trust in a single central entity that could delete or modify data at will, preferring instead a solution where multiple parties have copies of data and can disagree on governance policy.

There are two broad ways of solving “distribution” that at first seem quite different, but are forced to tackle similar problems:

  1. Everyone has a copy of all of the data

  2. Everyone has a piece of the data

Mirrored Data Stores

Taking the “simple” case first, let’s assume we want to mirror data across multiple servers, such that each has an identical copy of all information. This is often appropriate for load-balancers and content-distribution-networks, where we really want “50 copies of the same website, hosted by servers across the entire planet.”

This is very easy if the content never changes! Just have a single “content provider” upload data to each server, and have users connect to the content distribution servers.

The problem is slightly more complicated, but still not too bad, if the single content provider can send out an update. We may have a chaotic transition period where some CDN servers have updated and some have not, but in practice all servers will have the new content “pretty soon.” If content is pulled rather than pushed, meaning that the CDN servers periodically connect to the main server and check for a new version of the data rather than the main server connecting to each CDN server to upload content, then we’ll need some marker to determine whether content is “new”. Some of the more obvious options are:

  1. Always download content from the server, assume the server has the “ground truth”. Works, but wasteful.

  2. Download content if it has a newer timestamp than the timestamp of the previous data. This works, but timestamps are generally messy because computers clocks can drift and need to be periodically re-synchronized via NTP.

  3. Download content if it has a newer version number than the previous data. Same idea as the timestamp, but without the messiness of dealing with real-world “time”

This “versioning” lets us implement some helpful optimizations, like having CDN servers download updates from one another. CDN server 1 can download an update from the “main server”, while CDN servers 2 and 3 download from server 1, allowing the system to run smoothly even if the main server goes offline before servers 2 and 3 can be updated. All the CDN servers are always in agreement about what data is the “newest”, because a single source of ground truth increments a version number to disambiguate.

Let’s move to a messier problem: Server content is no longer static. Imagine collaborative editing software like Google Docs or Overleaf. Multiple users are making changes to a shared document, but that document doesn’t exist on a single server, but is rather spread across a range of servers for performance and redundancy. We must combine users’ edits to synchronize the servers and create a consistent view of the document.

We’ve lost the idea of single linear incrementing versions: Two users can add changes “simultaneously” (a loose definition, where “simultaneously” can just mean that the users made changes on two different servers before those servers had a chance to sync), and we need to come up with a deterministic ordering. Notice that timestamps don’t matter nearly as much as relative ordering and awareness: If Alice added a sentence to a paragraph, and Bob deleted that paragraph, then to combine the changes we need to know which edit came first, and whether Bob was aware of Alice’s edit at the time.

Lamport Vector Clocks

We can address the above challenges using a vector clock, which is basically a version number for each server indicating both what iteration of content the server is on and what updates it’s aware of from other servers.

When server 1 writes to server 2 it includes a list of all messages it doesn’t think server 2 knows about yet, based on the vector clock it received from server 2 the last time server 2 sent a message. That is, if server 1 has received (2,1,2) from server 2, it knows server 2 has “seen two messages from server 1, sent one message of its own, and seen two messages from server 3”. If server 1 has also received (0,0,3) from server 3, then server 1 knows about a message from server 3 that server 2 doesn’t know about. Therefore, when server 1 is ready to send a new message to server 2 it will first include the (0,0,3) message from server 3, followed by the new (3,1,3) message. In this way, it is not possible to receive a message without first receiving all the messages it depends on, guaranteeing an intact history.

Vector clocks assume all participants are truthful. If a server can lie about message timestamps or send multiple messages with the same timestamp then the “consistent world view” model can be trivially broken.

Notice that while we can use vector clocks to produce an optimal ordering of messages, we cannot eliminate all conflicts. Sometimes two users will introduce two conflicting changes, and both make incompatible changes to the same sentence. By frequently synchronizing servers we can make this scenario infrequent, but we need a resolution protocol like one of the following:

  1. Manual intervention (as with git merge conflicts)

  2. Automatic consensus for deciding which change to keep (as with blockchain stabilization when two competing blocks are mined)

  3. A ranking system for selecting a change (for example, if a user replies to a tweet while the original poster deletes their tweet, either always delete the reply, or create an empty “deleted post” for the new tweet to reply to)

We now have a protocol for ordering changes from different participants and resolving conflicts. This is far from the only solution: We can also build consensus protocols like Paxos that only accept and proliferate one change from one participant at a time, guaranteeing zero conflicts even in the face of equipment failure at the cost of significant delays and overhead and the inability to work “offline” (like with git) and then merge in changes later when you’re online. There are many design trade-offs in this space.

Distributed Hash Tables

So far we have described decentralized systems for ensuring that all participants end up with the same data at the end. What about distributing data across participants so users can look up information they’re interested in, without having to store the complete dataset? This is where we introduce distributed hash tables, or DHTs.

The premise is simple: Take a hash table (an efficient way of implementing the more abstract “associative array”, also called a “dictionary” or “key-value table”), and sprinkle the key-value pairs across multiple participant servers, in an equal and deterministic way. With a traditional hash table you hash the key to determine the position the value should be stored at - in a distributed hash table we hash the key to determine which participant the key-value pair should be stored at.

In the trivial case, a client would maintain a network connection to every participant in a distributed hash table. When they want to GET or PUT a value for a key, they hash the key, determine which participant is responsible, and send the GET or PUT directly to the node.

Unfortunately, this scales poorly. If a DHT contains hundreds of thousands or millions of participants, expecting a client (or even a participant) to maintain millions of concurrent network connections would be unwieldy. Instead, we’ll employ a finger-table. Each participant will maintain links to the nodes 2^0 through 2^j ahead, where 2^j is less than the total number of participants. In other words, a logarithmic number of hops:

To dive all in on computer science terminology, this guarantees that all lookups are O(log n). In a DHT with millions of nodes, lookups will take a maximum of 20 or so hops. Much worse than the O(1) lookup of a traditional hash table, but… pretty good. This trade off means clients can connect to any participant in the DHT to submit a request, and the request will quickly bounce around to the correct destination. One network connection for a client, a handful for participants of the DHT.

Alright, so that’s how we store data in a DHT with a static structure. What about redundancy? How do we handle adding and removing nodes? How do we deploy a DHT in a chaotic peer-to-peer network rather than a data center?

Data Redundancy

For data redundancy, we can just store the key-value pairs in two locations! Instead of storing in hash(key) % participants we can store in the original location and in hash(key) + 1 % participants. For additional redundancy, store in + 2, etc. If the “location” of a participant in the DHT ring is random, then there’s no harm in storing in + 1. This is also convenient from a lookup perspective: We’re looking for data stored in participant 6, but participant 6 is offline? Send the query to participant 7 instead!

What if a participant and its backup get out of sync? How do we decide which value is “correct” for a key? Well, that’s what we have lamport vector clocks for!

Adding and Removing Nodes: Dynamic Routing Tables

Replacing a node in a DHT is simple: Contact every participant that links to the dead-node, and give them contact information to update their reference. This is relatively painless: O(log(n)^2) steps to send a RELINK message to all log(n) nodes with a link to the dead one.

Growing and shrinking the DHT is more challenging. The trivial solution, adding the new edges, informing all nodes of the new DHT size, and re-hashing and re-introducing all keys, is obviously too inefficient to be practical.

Let’s revise the structure of a DHT. Instead of numbering all of the nodes sequentially, 0 to n, what if each node has a large random number associated with it? To start with, just add a few zeros, and assume the nodes are numbered “0”, “100”, “200”, …, “1500”.

Now our key lookup mechanism is broken! If we run hash(key) % 1600 the vast majority of keys will be assigned to non-existent nodes! Alright, so let’s re-define the assignment: Keys are now assigned to the closest node number that comes before the “ideal” position. This means keys assigned to nodes “1400” through “1499” will be assigned to node “1400”, keys assigned to “1500” through “1599” will be assigned to node “1500”, and keys for nodes “0” through “99” will be assigned to node “0”.

Each node is still responsible for propagating a message forward through the network, until either the correct position is found, or it’s determined that the key does not exist in the DHT.

We’ll also need to change the linking in the network. Instead of linking to “+1”, “+2”, “+4”, “+8”, we’ll instead allocate each participant some “buckets”. These buckets will let a participant track links to “many nodes 1 or 2 distant”, “a moderate number 8 or 10 distant”, “a few 50 or 100 distant”, and so on. The same concept as a finger-table, just non-deterministic. If a participant doesn’t know any participants “about 100 away” they can instead send a lookup request to the known neighbors “about 50 away”, who are more likely to know neighbors that are closer to them.

This bucketing system makes it easier to introduce new participants: We don’t have to calculate all the participants that “should” have links to the current node number, we just have to send out an introduction, and nearby nodes are likely to add the new participant to their buckets, while distant nodes are unlikely to add the participant to their buckets. The same bucketing system is ideal for redundancy, because if a nearby neighbor goes offline (which we can check using a periodic ping/heartbeat system), a participant will have many other nearby participants in their bucket, and can continue operating without loss of connectivity. If one of the few distant links is lost, then the participant needs to send out a new lookup to find other distant peers to add to their finger-table buckets.

Therefore, when we add a new participant, say node “1355”, we need to send out an announcement. Many nearby participants will add “1355” to their finger-tables, and a handful of more distant nodes will, too. Key-value pairs destined for “1355” through “1399” will be re-allocated from node “1300” to our new participant, but will also be kept in “1300” and “1200” for redundancy, depending on the fault tolerance of the network.

This structure is still recognizably a DHT if we squint at it, but it’s a lot fuzzier now, with non-deterministic positioning and linking. Lookups are still deterministic, in that key-value pairs that exist in the network can reliably be found. We can also stabilize the structure of the DHT by adding an age-based probability function: Nodes that have been active for longer in the DHT (and are therefore likely to be online in the future) are more likely to be added to buckets, and more likely to be recommended in response to “find me more neighbor” requests. This means a new node will be added to many of its nearby peers, who keep large lists of nearby neighbors, but only long-lived nodes will be added to distant buckets. This means long hops across the DHT are much more likely to be reliable and efficient, and only once a lookup gets close to its destination, where participants have large redundant buckets, do connections become more chaotic.

DHTs in the Real-World

With the additions in the “dynamic routing tables” section, we’ve got a very approximate description of Kademlia, a widely used Distributed Hash Table model. BitTorrent, described in a recent blog post, uses a modified Kademlia DHT in place of a tracker, using trackers primarily for bootstrapping by introducing clients to participants in the DHT. The Invisible Internet Protocol, I2P uses a modified Kademlia to track routers and routes connected to the network. Many cryptocurrencies use a bucket structure similar to Kademlia to introduce participants to other peers in the network, but since the blockchain isn’t a key-value storage system they don’t use a DHT for data storage.

Now that we have an understanding of how to build a decentralized content-sharing system with peer introduction and routing, we can move on to more interesting topics: How to build useful systems and communities on top of this communication protocol, and how to build valuable social frameworks on top of those communities. But that’s for another post…

Posted 10/21/20


View older posts