Digital Image Forensics is a field concerned with identifying whether images are original or have been edited, and if the latter, what kinds of edits have been made. I haven’t experimented much with digital forensics, but there’s overlap with steganography and neat encoding tricks like halftone QR codes. I’ve been reading about some forensic techniques in UC Berkeley’s “Tutorial on Digital Image Forensics” (200 page PDF), and Color Filter Array forensics is a fun one.
Digital cameras have hardware limitations that leave very specific patterns in the resulting images, and any photo edits will disrupt these patterns, unless the editor takes care to preserve them. Details follow.
Digital images usually consist of three color values per pixel, for red, green, and blue. However, most digital cameras don’t have any color sensors in them. Instead, they have a grid of light/luminosity sensors, and they add a layer of filters in front of the sensors that filter out all but red, all but green, or all but blue light. This is much cheaper to manufacture! But there’s a serious drawback: Each pixel can only record a single red, green, or blue sample, instead of all three.
One row consists of red/green/red filters, the following row consists of green/blue/green filters, then red/green/red, and so on. The result is that each pixel now has a real value for one color channel, and has two or more neighbors with a real value for each other color channel. We can approximate the missing color channels as an average of our neighbors’ color channels. For example, a red pixel will calculate its “blue” channel as the average of the neighbors in the four corners diagonal from its position, and will calculate its “green” channel as the average of the neighbors above, below, left, and right. This approximation is called a “de-mosaicking” algorithm.
De-mosaicking works okay, because how much will the red value change over the distance of a single pixel? Usually not by very much, unless there’s a sharp edge with high color contrast, in which case this approximation will make colors “bleed” slightly over the sharp edge. Newer cameras try to auto-detect these high-contrast borders and only approximate color channels using the neighbors on the same side of the border, but let’s ignore that for now.
While the simulated color data looks mostly correct to the human eye, it leaves an unnatural pattern in the numeric color values for each channel. Specifically, we know that each pixel will have two “simulated” channels that are the average of the same channel in each neighboring pixel with a real value for that channel. This should be easy to check in Python, Matlab, or your image-analysis language of choice:
#!/usr/bin/env python3 from PIL import Image import numpy as np from statistics import mean im = Image.open("bayer_filter_demosaicked.jpg") pixels = np.array(im) RED,GREEN,BLUE = [0,1,2] # .X. # ... # .X. def getVerticalAverage(pixels, i, j, channel): rows = pixels.shape if( i == 0 ): return pixels[i+1,j,channel] if( i == rows-1 ): return pixels[i-1,j,channel] return round(mean([pixels[i-1,j,channel],pixels[i+1,j,channel]])) # ... # X.X # ... def getHorizontalAverage(pixels, i, j, channel): cols = pixels.shape if( j == 0 ): return pixels[i,j+1,channel] if( j == cols-1 ): return pixels[i,j-1,channel] return round(mean([pixels[i,j-1,channel],pixels[i,j+1,channel]])) # X.X # ... # X.X def getDiagonalAverage(pixels, i, j, channel): rows = pixels.shape cols = pixels.shape corners =  if( i > 0 ): if( j > 0 ): corners.append(pixels[i-1,j-1,channel]) if( j < cols-1 ): corners.append(pixels[i-1,j+1,channel]) if( i < rows-1 ): if( j > 0 ): corners.append(pixels[i+1,j-1,channel]) if( j < cols-1 ): corners.append(pixels[i+1,j+1,channel]) return round(mean(corners)) def confirmEqual(i, j, color1, color2): if( color1 != color2 ): print("Anomaly detected at %d,%d (got %d, expected %d)" % (i,j, color1,color2)) # For every pixel, determine what 'real' color channel it has # then confirm that its interpolated channels match what we get # from de-mosaicking for i,row in enumerate(pixels): for j,col in enumerate(row): if( i % 2 == 0 ): # Red/Green row if( j % 2 == 0 ): # Red column correctGreen = mean([getHorizontalAverage(pixels,i,j,GREEN),getVerticalAverage(pixels,i,j,GREEN)]) correctBlue = getDiagonalAverage(pixels,i,j,BLUE) confirmEqual(i, j, pixels[i,j,GREEN], correctGreen) confirmEqual(i, j, pixels[i,j,BLUE], correctBlue) else: # Green column confirmEqual(i, j, pixels[i,j,RED], getHorizontalAverage(pixels,i,j,RED)) confirmEqual(i, j, pixels[i,j,BLUE], getVerticalAverage(pixels,i,j,BLUE)) else: # Green/Blue row if( j % 2 == 0 ): # Green column confirmEqual(i, j, pixels[i,j,RED], getVerticalAverage(pixels,i,j,RED)) confirmEqual(i, j, pixels[i,j,BLUE], getHorizontalAverage(pixels,i,j,BLUE)) else: # Blue column correctGreen = mean([getHorizontalAverage(pixels,i,j,GREEN),getVerticalAverage(pixels,i,j,GREEN)]) correctRed = getDiagonalAverage(pixels,i,j,RED) confirmEqual(i, j, pixels[i,j,RED], correctRed) confirmEqual(i, j, pixels[i,j,GREEN], correctGreen)
Of course, this is only possible if you know both which Color Filter Array the camera model that took the photo uses, and the details of their de-mosaicking algorithm. For now we’ll assume the basic case of “red/green + green/blue” and “average neighboring color channels ignoring high-contrast borders”. For more on color filter arrays and better de-mosaicking approaches, read here. Let’s also assume the image has only lossless compression, which is often the case for the highest export quality straight off a digital camera.
If anyone opens our camera’s photos in editing software like Photoshop or GIMP, and makes any color adjustments, they’ll break the de-mosaic pattern. If they use the clone/stamp tool, the stamped portions of the image won’t have color channels averaging their neighbors outside the stamped region, and the de-mosaic pattern will be broken. If they copy a portion of a different image into this one, the pattern will be broken.
Not only can we detect when an image has been altered in this way, we can detect where anomalies occur, and potentially highlight the exact changes made. Amending the above script, we’ll replace “reporting” an anomaly with highlighting anomalies:
# Turn all correct pixels 'red', leaving anomalies for further examination pixels2 = np.copy(pixels) def confirmEqual(i, j, color1, color2): global pixels2 if( color1 == color2 ): pixels2[i,j,RED] = 255 pixels2[i,j,GREEN] = 0 pixels2[i,j,BLUE] = 0
Since Photoshop/GIMP’s changes look “correct” to the human eye, the tools have done their job, and they have no incentive to make their changes undetectable to forensic analysis.
Unfortunately, this technique is far from flawless. There are two ways to defeat CFA anomaly detection:
Delete the appropriate RGB channels from each pixel after editing the image, and re-run the de-mosaicking algorithm to recreate the de-mosaic pattern. This requires the forger have the same knowledge as a forensic analyst regarding exactly what Color Filter Array and de-mosaicking approach their camera uses.
Export the image using a LOSSY compression algorithm, with the compression rate turned up high enough to perturb the channel values and destroy the de-mosaic pattern. This will make it obvious that the image has been re-saved since being exported from the camera, but will destroy the clear-cut evidence of which portions have been edited, if any.
All in all, a very cool forensic technique, and a fun introduction to the field!
Let’s take the following as a given:
Mainstream social media is harmful because it puts a single company in control of the human social experience and places them in the role of cultural censor
If you agree and want to get on with addressing the problem, skip ahead to the next section.
When we say “control of the human social experience” we refer to something like Elinor Ostrom’s Institutional Analysis and Development Framework (here’s a good paper on applying IAD to digital institutions if you want to dive in), which describes platforms in terms of the following rules:
Operational Rules: Describe the interface and what actions a user can take, such as tweeting and retweeting and liking
Collective Rules: Describe the higher-level context in which the operational rules are utilized, like how the Twitter content feed orders tweets based on what users you follow, how popular the tweet was in terms of retweets and likes, and how old the content is
Constitutional Rules: Describe by what process the operational, collective, and constitutional rules can be changed
In a corporate-owned social media network, the corporation has complete control over the operational and collective rules, and most of the control over the constitutional rules. There may be some external influence, such as summoning the CEOs in front of Congress for questioning, or threatening to amend FCC regulations governing Internet platforms, or DMCA takedown requests. Regardless, the users of the platform, those most affected by operational and collective rules, have almost no say over those rules.
The only influence a user has over a platform is the option to leave, which en masse might impact the company’s bottom line. However, if we assume users on social media generally want to be social, then they’ll want to migrate to another social platform when they leave, and all the major social media companies have similarly tight control over their platforms with remarkably similar acceptable content and moderation policies.
When we describe platforms as cultural censors, we mean that they decide what content they will permit on their platform and what is a banable offense. Across social sites like Tumblr and Facebook and infrastructure sites like Paypal and Patreon, we’ve repeatedly seen companies take a puritanical stance against LGBT+, sex-positive, and sex-work content. Violet Blue (fantastic!) has written lots about this. Simultaneously, platforms are extremely hesitant to censor white supremacist or neo-nazi content, because they do not want to be accused of political censorship or bias by Congress or the White House, and the aforementioned content is increasing adjacent to Republican political talking points.
So, corporate-run social media implies a structure the users have no say in, with content limits the users have no say in, which favor harmful and icky views while inhibiting freedom of expression. There’s no way out by hopping to another corporate-run social media platform, because the next platform has the same problems and policies. “I’ll just build my own platform” leads to the same pattern with a different set of oligarchs, as we’ve seen (and I’ve written about before) with a range of “alt-tech” platforms like Voat, Parler, Gab, and BitChute that were created to host right-wing extremist content banned from their mainstream counterparts.
To address this problem we need a radical new kind of social media, with no central governors. This new media should not be susceptible to a small oligarchy enforcing social views, but ideally should have some conception of moderation, so community spaces can remove harassers, abusers, spammers, and hateful content.
The clear solution is a decentralized network, where content isn’t hosted on a single server hosted by a single entity. There can be collective storage of social media data, with collective agreement on what content to support and how.
Okay, great! So what does “decentralized” look like? A distributed hash table? A blockchain? A bunch of mirrored servers? Let’s look before we leap and consider common decentralized network models and their implications when used in social media of any sort.
I recently described distributed hash tables in more depth, but the one line summary is that in a DHT users are connected directly to multiple peers, and data in the form of key-value pairs (like a Python dictionary) are distributed across all peers with some redundancy, so anyone can quickly look up information associated with a known key. Participants use a centralized introduction server (called a “tracker” for torrents) to find their initial peers, but this introduction point has little information and can be easily replaced or made redundant.
DHTs have two significant limitations. The first is that all content must be framed as a key-value pair, and it’s not always obvious how to design systems around this constraint. Nevertheless, a variety of projects use DHTs at their core, including Hyperswarm, the underlying peer-discovery system used by Hypercore (previously known as “DAT protocol”), which in turn is the peer-to-peer content sharing system powering chat applications like Cabal. DHTs are also at the heart of GNUnet (another generic p2p content sharing system used for file-sharing and chat), and a similar key-value routing technology is used in Freenet, which is aimed at distributed file hosting, microblogging, version-control systems, and other append-only shared data. Finally, DHTs are used for routing in the Invisible Internet Project, and for connecting to onion services in Tor, so it’s safe to say that DHTs are a common design pattern in decentralized spaces.
The larger limitation of DHTs is that they produce a singleton. While a distributed hash table is “distributed” in the sense that content can be scattered across a wide range of peers, it is centralized in the sense that there is only a single DHT network shared by all users, with a single shared namespace, storing content for all peers in the network. This may not always be desirable: users may not want to host content for peers outside their social circle, may not want to reveal their IP address to peers outside their circle, or may simply want to extend the DHT and change its functionality in a way incompatible with the larger community. While it is technically possible to run multiple competing DHTs with no connection to one another, utilizing separate introduction servers, this is strongly disincentivized, since DHTs gain performance, redundancy, reliability, and storage capacity with scale.
Blockchains are a form of data storage with two interesting attributes:
They form an append-only log that cannot be rewritten (once a block has proliferated and several more blocks are added to “solidify” its presence in the chain)
They are highly redundant
If those two attributes are not both valuable for your application, then a blockchain is the wrong choice for you. There are alternative solutions to append-only logs (like signed messages stored in a merkle tree, or Paxos), and to data redundancy (like a DHT, or client-stored messages that can be re-uploaded to peers later, as in sneakernets). But let’s look at the structure of a blockchain network.
Blockchain network traffic looks remarkably similar to DHT traffic, except that instead of using key-value pairs, every peer stores the complete data log:
This certainly provides more redundancy than a DHT, and maybe that level of redundancy is important if you’re building a ledger of all financial transactions ever and your entire economy relies on its stability. For most use-cases, the redundancy afforded by a DHT is sufficient, and requires far less storage for each peer. Blockchains also imply a significant computational cost to write any new data if the blockchain uses proof-of-work to ensure immutability. It’s an expensive and over-applied data structure.
We can address some of the limitations of blockchain using a sidechain, where the primary blockchain (or another data structure like a merkle tree) includes references to the heads of miniature blockchains that are smaller and can be updated more quickly. These sidechains can follow simpler rules than a public blockchain, such as allowing a user with a private key to craft new blocks instead of using a “proof of work” algorithm.
Signal, the centralized but end-to-end encrypted chat software, uses a chat protocol with some similarities to a blockchain. In the Signal protocol, each message in a conversation includes a reference to both the previous message that user sent, and the most recent messages that user has seen. This means Signal’s central server can’t ommit messages without clients noticing, and at worst can deny all service to a user. However, there is no proof-of-work involved in this chain; the only requirement for adding a new block is that it must be signed by the user, eliminating one of the largest limitations of a full blockchain.
Sidechains are also at the heart of Keybase, a Slack-like encrypted messaging and file hosting system that maintains a chain for each user to immutably store information about the user’s identity. Keybase also maintains each shared folder as a blockchain of write operations. Notably, however, Keybase is a centralized system that uses blockchains and signature verification to keep the server honest. The blockchains serve as a tamper-evidence mechanism that makes it difficult for the server to revert or manipulate messages (along with aggressive client caching), but the Keybase server is a sole central data repository for the network.
As with DHTs, blockchain networks form singletons (even if not run on a central server like Keybase), and running parallel blockchains or forking a chain is frowned upon because it shrinks the network and dilutes the benefit of a single shared ground truth.
Federation is an interesting combination of centralization and decentralization. From the user’s perspective, federated social networks work much like mainstream networks, except that they have several Facebook- or Twitter-like servers to choose from. Each server operates as an independent miniature social community. Server operators, however, have a different experience. Each operator can choose to federate with another server, bidirectionally linking the two servers, exchanging messages to create a larger collaborative social network. Collections of federated servers are referred to as a fediverse.
The most well-known federated social network is Mastodon, a Twitter-like cluster of servers. Each server has its own content policies and moderators, and usually only federates with servers with similar content policies. This lessens the likelihood of the network implementing extremely puritanical social policies, and allows easy migration if moderators on a single server go power-mad. When Gab (an alt-right Twitter clone) abandoned their own software stack and became a Mastodon instance they were universally condemned, and most other server operators refused to federate with the Gab instance, isolating them on their own server and proving the effectiveness of Mastodon’s moderation strategy.
Unfortunately, Mastodon also suffers from the singleton problem. While servers can federate, it is difficult to convince an operator of a major server to create a bidirectional link with a minor one, and there is little incentive to run a Mastodon server and less incentive if it is unfederated with major servers. As a result, three Mastodon servers contain almost 60% of the known Mastodon population.
The Diaspora Network, a decentralized Facebook-like network, also follows a federation model. Of note, they popularized clustering contacts by “aspects” of your life, so you can easily share contact with some categories of peers and not others.
It’s worth pointing out that federation is far from a new concept. Internet Relay Chat also provides functionality for federating servers, synchronizing chatrooms and messages between two servers, though the process is very rarely used since it grants operators on both servers extreme power over conversations that occur on the other. Similarly, NNTP (the network news transfer protocol underlying Usenet) allows servers to exchange news postings and comments to create a shared community. In Usenet’s case federation was commonplace, and resulted in a singleton network similar to Reddit with far less moderation.
Pubs (in the “public house”, inn/bar sense of the word) invert the idea of federation. Instead of users connecting to a single server and allowing the servers to interlink, users now connect to several servers, and serve as the links. In a pub model, servers are reduced to the role of content caches and introduction points: They allow users to leave messages for other users to find, allowing two participants to communicate asynchronously without being online at the same time for a peer-to-peer connection. They also allow users to leave messages publicly on the server, making it possible to introduce themselves and meet other users.
Since users are connected to multiple servers at once, they can post the same message to multiple servers concurrently, and rely on clients to recognize and combine the duplicates. This means users are not bound to a single server where they host their content, as in a federated service, but can store their content feed on multiple servers, more akin to a distributed hash table. Since servers have no federation, there is little cost to running a pub server. Unlike in federated spaces, a small pub can be valuable by providing a closed but shared conversation space, representing a tighter group of friends, or colleagues with a similar and obscure interest. Users can choose to only post some of their messages to these private spaces, moving between more public and more private content feeds.
The most popular pub-model network in use today is Secure Scuttlebutt, which works both via pub servers and via syncing with peers over WiFi networks, exchanging both their own messages and cached messages from friends and friends-of-friends (gossip). Notably, Scuttlebutt is offline-first: the entire “content feed” is stored locally, so you can browse and reply to messages while offline, and then re-sync when you next connect to a pub or are on a LAN with another peer. The entire network can theoretically run without pubs, purely on local network exchanges, and place no authority on pubs at all. Without a reliance on central servers there is also no clear opportunity for community moderation. Scuttlebutt supports a peer blocking individual users and hiding them from their content feed, but this scales poorly on a larger network.
The Open Privacy Research Society is working on their own pub-based chat network, Cwtch, with the twist that pubs can’t read the messages they host. Cwtch is more like a signal group chat, sans the reliance on the central Signal servers, by caching messages in a variety of servers and storing them locally. Cwtch operates entirely over Tor, reaching pubs and other peers via Tor onion services. When two peers are online at the same time they can exchange their message logs from the group, filling in each-other’s blanks, and using a server only to leave messages asynchronously when peers are offline.
Pub-style networks have the distinct advantage of only caching content relevant to the user (or close to them in a social graph). Servers need to store considerably more content, but can auto-delete older messages to limit the load, since messages will live on in users’ local storage.
DHTs, Blockchains, Federation, and Pubs provide distinctly anti-capitalist models for sharing social content without a patron corporation facilitating discussion. Each decentralized model has unique characteristics shaping the kinds of information sharing that are possible, the opportunity (and dangers) for moderation, and the kind of clustering that is likely to result. I’m personally most excited about pubs, while acknowledging the utility of DHT networks, but all four paradigms have borne fruit already and should be pursued.
After writing about decentralized data storage and torrents, I’ve had data storage on my mind. I drafted this post while setting up a RAID for the MeLa research group today, and built RAIDs for DDoSecrets a few weeks ago, so there’s a ton of “data storage” in my life right now and it only seems natural to quickly write up how large data storage on a centralized server works.
Hard drives get more expensive as storage capacity increases, and traditional spinning-plate hard drives have a limited lifespan, because spinning pieces and motors wear out relatively quickly from friction. Hard drives can read and write data more quickly when they spin faster, but this also wears out the drive more quickly. Therefore, getting a very big hard drive with a long lifespan that’s also fast becomes prohibitively expensive.
What if instead we could slap together several cheap and unreliable hard drives, and out of the harmony of garbage get high-capacity, reliable, high-speed data storage? A Redundant Array of Inexpensive Disks, or RAID, does exactly this. But how?
The most obvious way to “combine” two hard drives is to (metaphorically) glue one to the end of another. When the first hard drive is filled, start writing data to the second. We can create a virtual hard drive consisting of two or more physical hard drives glued together in this way, and write data to the virtual drive as if it’s one big drive.
Alright, that’s easy, but it’s lacking in a few ways. First, there’s no redundancy: If one of the cheap hard drives in our stack of three fails, we’ve just lost a third of our data. Second, it seems slow: Accessing most files will only use one hard drive at a time, and couldn’t we get more performance by using multiple drives at once?
A striped RAID works the same way as a linear RAID, but it splits data across all drives equally. If you had two drives then you’d put the even bytes on the first drive and the odd bytes on the second drive. Then when reading or writing data you use both drives at once, for half as long, and so in theory get twice the speed!
In the real world we don’t write “even and odd bytes” but rather “even and odd chunks of bytes” called stripes, because that’s faster in practice. Same idea.
Redundancy is still a problem with stripes, perhaps even more than with linear RAIDs: If a hard drive dies we now lose “all the even chunks of every file”, which makes our remaining data just about worthless.
Mirroring creates a perfect backup of a drive. Every time we write data to one drive, we also write the data to all the backup drives. If one of the drives dies, we seamlessly continue using the rest. When we replace a dead drive, we copy all the data from the other drives in the mirror.
When reading data we can actually get performance similar to striping, by reading some chunks from one drive while reading other chunks from a second drive. Only data writes need to be synchronized across all the drives. Mirrors limit you to the size of a single drive (if you have three 1-TB drives that are all perfect copies of one another, you only get one terabyte of “real storage”), and the write speed of a single drive, but the combined read speed of all your drives.
If we have four hard-drives we can easily combine last two strategies: Create two sets of mirrors, and stripe data across them.
We get the storage capacity of two out of the four drives, the write speed of two drives, the read speed of four drives, and redundancy. The redundancy is slightly unintuitive: We lose nothing if any one drive fails, and we lose nothing if a second drive fails and it wasn’t the mirrored copy of the first drive that failed. In other words, as long as we’re lucky and we still have a full copy of the data across some combination of drives, then we’re okay.
With more hard drives comes more flexibility. Six hard drives can be organized as three mirrors of two drives each, or two mirrors of three drives each. The administrator chooses a trade off between more storage, more speed, and more redundancy.
Don’t do this. If we reverse the order, striping two drives together and then mirroring the data to another two drives, we conceptually get the same result as a RAID 10. In practice however, a RAID 01 is more fragile. In most implementations, when one half of a stripe fails, the other half is disabled, too. Critical metadata, which tracks what stripes were allocated and placed on which drives, was kept on the now-dead drive, shattering the puzzle. Therefore when one drive in a RAID 01 fails, its striped partner also shuts down, reducing the 01 to a RAID 0. Don’t use 01, use 10.
A RAID 5 distributes chunks across three or more drives so that any one drive can be lost without losing data. For example, assuming we have three drives, we can store even chunks on drive 1, odd chunks on drive 2, and the XOR of the two chunks on drive 3. Given any two drives, the information on the third drive can be re-created.
This lets us keep two thirds of the storage from our three drives, along with the read and write speed of two drives. Mostly a better trade off than a striped mirror! With four or more drives the dividends are even better, since we’ll get the storage capacity and read and write speed of three drives, then four, etc.
Same idea as RAID 5, but the parity blocks (the XOR of the data blocks) are written to at least two drives. This means two drives can fail without losing data, and a RAID 6 is only possible with at least four drives.
Alright, we covered RAID 0, 1, 5, and 6, what happened to 2, 3, and 4? They’re all poor designs that have been retired in favor of 5 and 6. Here’s a brief run-down:
Same idea as a RAID 5, except information is striped at the bit-level across all the drives, and the drives use Hamming codes to provide redundancy and error correction. This means that all the drives must spin in sync, so you can only access one file at a time, and reading and writing at a bit level makes the configuration slower.
Same as RAID 2, but stripes at a byte-level instead of bit, and stores XORs of the bytes on the last drive, which is a dedicated “parity drive”. Again requires that all drives spin in sync.
Same as RAID 3, but stripes at a block-level instead of byte. This means read performance is much better (you can frequently read blocks for one file from two drives while reading blocks for another file from a third drive), but write performance is still poor, since parity blocks are all stored on a single drive, which must be used for all writes.
Traditionally to build a RAID you need a special “RAID card” on the computer, which connects to all the relevant drives and implements the RAID, presenting a single “virtual hard drive” to the motherboard and operating system. On more modern systems you can produce a “software RAID” where the operating system has access to the individual drives and produces a RAID on its own, using tools like mdadm or ZFS. This is sometimes more efficient, especially with ZFS, where the filesystem and RAID software are integrated and can read and write more efficiently than with a virtual disk.
Choosing the type of RAID you want is a decision about how much redundancy you need, versus capacity and speed. Many servers have multiple RAIDs for different purposes. One additional consideration is that most computers can only boot off of mirrored RAIDs. This is because the BIOS, the code burned into the motherboard that initializes enough of the hardware to find the operating system and start it, is very small and not so clever. Stripes and RAID 5 clusters are complicated, but a drive from a mirror can be treated like a single independent drive. The BIOS finds one drive in the mirror and uses it to start the operating system, which then realizes it’s on a mirrored RAID and picks up the other drives.
Therefore, one common server configuration is to use two or more SSDs in a mirrored RAID for booting. These drives contain the operating system and all software, can be read at an absolutely blazing speed, and have redundancy because of the mirror. Then additional conventional drives are placed in a RAID 5 or 6 for a decent trade on performance and capacity, creating a larger pool of drives for data.
This post is mostly theoretical computer science (data structures, distributed systems) leading up to future posts that can talk about the design of decentralized communities with an emphasis on social theory (self-governance, trust, responsibility for content)
I’m at the (virtual) computer-supported collaborative work conference this week, and it’s stirring many ideas related to shared governance of decentralized communities. Before digging into those ideas, though, one more interlude about technical underpinnings of decentralized systems…
We have information on a big central server, and we would like to spread it across many servers. This can be for a variety of technical reasons, including:
Redundancy, if the central server goes offline
Performance, if users can connect to a variety of servers then the average workload per server will be much lower
Resource limitations, if a central server with enough storage, processing, or bandwidth to support all users at once is infeasible
There may also be social reasons for desiring distribution, such as removing trust in a single central entity that could delete or modify data at will, preferring instead a solution where multiple parties have copies of data and can disagree on governance policy.
There are two broad ways of solving “distribution” that at first seem quite different, but are forced to tackle similar problems:
Everyone has a copy of all of the data
Everyone has a piece of the data
Taking the “simple” case first, let’s assume we want to mirror data across multiple servers, such that each has an identical copy of all information. This is often appropriate for load-balancers and content-distribution-networks, where we really want “50 copies of the same website, hosted by servers across the entire planet.”
This is very easy if the content never changes! Just have a single “content provider” upload data to each server, and have users connect to the content distribution servers.
The problem is slightly more complicated, but still not too bad, if the single content provider can send out an update. We may have a chaotic transition period where some CDN servers have updated and some have not, but in practice all servers will have the new content “pretty soon.” If content is pulled rather than pushed, meaning that the CDN servers periodically connect to the main server and check for a new version of the data rather than the main server connecting to each CDN server to upload content, then we’ll need some marker to determine whether content is “new”. Some of the more obvious options are:
Always download content from the server, assume the server has the “ground truth”. Works, but wasteful.
Download content if it has a newer timestamp than the timestamp of the previous data. This works, but timestamps are generally messy because computers clocks can drift and need to be periodically re-synchronized via NTP.
Download content if it has a newer version number than the previous data. Same idea as the timestamp, but without the messiness of dealing with real-world “time”
This “versioning” lets us implement some helpful optimizations, like having CDN servers download updates from one another. CDN server 1 can download an update from the “main server”, while CDN servers 2 and 3 download from server 1, allowing the system to run smoothly even if the main server goes offline before servers 2 and 3 can be updated. All the CDN servers are always in agreement about what data is the “newest”, because a single source of ground truth increments a version number to disambiguate.
Let’s move to a messier problem: Server content is no longer static. Imagine collaborative editing software like Google Docs or Overleaf. Multiple users are making changes to a shared document, but that document doesn’t exist on a single server, but is rather spread across a range of servers for performance and redundancy. We must combine users’ edits to synchronize the servers and create a consistent view of the document.
We’ve lost the idea of single linear incrementing versions: Two users can add changes “simultaneously” (a loose definition, where “simultaneously” can just mean that the users made changes on two different servers before those servers had a chance to sync), and we need to come up with a deterministic ordering. Notice that timestamps don’t matter nearly as much as relative ordering and awareness: If Alice added a sentence to a paragraph, and Bob deleted that paragraph, then to combine the changes we need to know which edit came first, and whether Bob was aware of Alice’s edit at the time.
We can address the above challenges using a vector clock, which is basically a version number for each server indicating both what iteration of content the server is on and what updates it’s aware of from other servers.
When server 1 writes to server 2 it includes a list of all messages it doesn’t think server 2 knows about yet, based on the vector clock it received from server 2 the last time server 2 sent a message. That is, if server 1 has received (2,1,2) from server 2, it knows server 2 has “seen two messages from server 1, sent one message of its own, and seen two messages from server 3”. If server 1 has also received (0,0,3) from server 3, then server 1 knows about a message from server 3 that server 2 doesn’t know about. Therefore, when server 1 is ready to send a new message to server 2 it will first include the (0,0,3) message from server 3, followed by the new (3,1,3) message. In this way, it is not possible to receive a message without first receiving all the messages it depends on, guaranteeing an intact history.
Vector clocks assume all participants are truthful. If a server can lie about message timestamps or send multiple messages with the same timestamp then the “consistent world view” model can be trivially broken.
Notice that while we can use vector clocks to produce an optimal ordering of messages, we cannot eliminate all conflicts. Sometimes two users will introduce two conflicting changes, and both make incompatible changes to the same sentence. By frequently synchronizing servers we can make this scenario infrequent, but we need a resolution protocol like one of the following:
Manual intervention (as with git merge conflicts)
Automatic consensus for deciding which change to keep (as with blockchain stabilization when two competing blocks are mined)
A ranking system for selecting a change (for example, if a user replies to a tweet while the original poster deletes their tweet, either always delete the reply, or create an empty “deleted post” for the new tweet to reply to)
We now have a protocol for ordering changes from different participants and resolving conflicts. This is far from the only solution: We can also build consensus protocols like Paxos that only accept and proliferate one change from one participant at a time, guaranteeing zero conflicts even in the face of equipment failure at the cost of significant delays and overhead and the inability to work “offline” (like with git) and then merge in changes later when you’re online. There are many design trade-offs in this space.
So far we have described decentralized systems for ensuring that all participants end up with the same data at the end. What about distributing data across participants so users can look up information they’re interested in, without having to store the complete dataset? This is where we introduce distributed hash tables, or DHTs.
The premise is simple: Take a hash table (an efficient way of implementing the more abstract “associative array”, also called a “dictionary” or “key-value table”), and sprinkle the key-value pairs across multiple participant servers, in an equal and deterministic way. With a traditional hash table you hash the key to determine the position the value should be stored at - in a distributed hash table we hash the key to determine which participant the key-value pair should be stored at.
In the trivial case, a client would maintain a network connection to every participant in a distributed hash table. When they want to GET or PUT a value for a key, they hash the key, determine which participant is responsible, and send the GET or PUT directly to the node.
Unfortunately, this scales poorly. If a DHT contains hundreds of thousands or millions of participants, expecting a client (or even a participant) to maintain millions of concurrent network connections would be unwieldy. Instead, we’ll employ a finger-table. Each participant will maintain links to the nodes 2^0 through 2^j ahead, where 2^j is less than the total number of participants. In other words, a logarithmic number of hops:
To dive all in on computer science terminology, this guarantees that all lookups are
O(log n). In a DHT with millions of nodes, lookups will take a maximum of 20 or so hops. Much worse than the
O(1) lookup of a traditional hash table, but… pretty good. This trade off means clients can connect to any participant in the DHT to submit a request, and the request will quickly bounce around to the correct destination. One network connection for a client, a handful for participants of the DHT.
Alright, so that’s how we store data in a DHT with a static structure. What about redundancy? How do we handle adding and removing nodes? How do we deploy a DHT in a chaotic peer-to-peer network rather than a data center?
For data redundancy, we can just store the key-value pairs in two locations! Instead of storing in
hash(key) % participants we can store in the original location and in
hash(key) + 1 % participants. For additional redundancy, store in
+ 2, etc. If the “location” of a participant in the DHT ring is random, then there’s no harm in storing in
+ 1. This is also convenient from a lookup perspective: We’re looking for data stored in participant 6, but participant 6 is offline? Send the query to participant 7 instead!
What if a participant and its backup get out of sync? How do we decide which value is “correct” for a key? Well, that’s what we have lamport vector clocks for!
Replacing a node in a DHT is simple: Contact every participant that links to the dead-node, and give them contact information to update their reference. This is relatively painless:
O(log(n)^2) steps to send a RELINK message to all
log(n) nodes with a link to the dead one.
Growing and shrinking the DHT is more challenging. The trivial solution, adding the new edges, informing all nodes of the new DHT size, and re-hashing and re-introducing all keys, is obviously too inefficient to be practical.
Let’s revise the structure of a DHT. Instead of numbering all of the nodes sequentially,
n, what if each node has a large random number associated with it? To start with, just add a few zeros, and assume the nodes are numbered “0”, “100”, “200”, …, “1500”.
Now our key lookup mechanism is broken! If we run
hash(key) % 1600 the vast majority of keys will be assigned to non-existent nodes! Alright, so let’s re-define the assignment: Keys are now assigned to the closest node number that comes before the “ideal” position. This means keys assigned to nodes “1400” through “1499” will be assigned to node “1400”, keys assigned to “1500” through “1599” will be assigned to node “1500”, and keys for nodes “0” through “99” will be assigned to node “0”.
Each node is still responsible for propagating a message forward through the network, until either the correct position is found, or it’s determined that the key does not exist in the DHT.
We’ll also need to change the linking in the network. Instead of linking to “+1”, “+2”, “+4”, “+8”, we’ll instead allocate each participant some “buckets”. These buckets will let a participant track links to “many nodes 1 or 2 distant”, “a moderate number 8 or 10 distant”, “a few 50 or 100 distant”, and so on. The same concept as a finger-table, just non-deterministic. If a participant doesn’t know any participants “about 100 away” they can instead send a lookup request to the known neighbors “about 50 away”, who are more likely to know neighbors that are closer to them.
This bucketing system makes it easier to introduce new participants: We don’t have to calculate all the participants that “should” have links to the current node number, we just have to send out an introduction, and nearby nodes are likely to add the new participant to their buckets, while distant nodes are unlikely to add the participant to their buckets. The same bucketing system is ideal for redundancy, because if a nearby neighbor goes offline (which we can check using a periodic ping/heartbeat system), a participant will have many other nearby participants in their bucket, and can continue operating without loss of connectivity. If one of the few distant links is lost, then the participant needs to send out a new lookup to find other distant peers to add to their finger-table buckets.
Therefore, when we add a new participant, say node “1355”, we need to send out an announcement. Many nearby participants will add “1355” to their finger-tables, and a handful of more distant nodes will, too. Key-value pairs destined for “1355” through “1399” will be re-allocated from node “1300” to our new participant, but will also be kept in “1300” and “1200” for redundancy, depending on the fault tolerance of the network.
This structure is still recognizably a DHT if we squint at it, but it’s a lot fuzzier now, with non-deterministic positioning and linking. Lookups are still deterministic, in that key-value pairs that exist in the network can reliably be found. We can also stabilize the structure of the DHT by adding an age-based probability function: Nodes that have been active for longer in the DHT (and are therefore likely to be online in the future) are more likely to be added to buckets, and more likely to be recommended in response to “find me more neighbor” requests. This means a new node will be added to many of its nearby peers, who keep large lists of nearby neighbors, but only long-lived nodes will be added to distant buckets. This means long hops across the DHT are much more likely to be reliable and efficient, and only once a lookup gets close to its destination, where participants have large redundant buckets, do connections become more chaotic.
With the additions in the “dynamic routing tables” section, we’ve got a very approximate description of Kademlia, a widely used Distributed Hash Table model. BitTorrent, described in a recent blog post, uses a modified Kademlia DHT in place of a tracker, using trackers primarily for bootstrapping by introducing clients to participants in the DHT. The Invisible Internet Protocol, I2P uses a modified Kademlia to track routers and routes connected to the network. Many cryptocurrencies use a bucket structure similar to Kademlia to introduce participants to other peers in the network, but since the blockchain isn’t a key-value storage system they don’t use a DHT for data storage.
Now that we have an understanding of how to build a decentralized content-sharing system with peer introduction and routing, we can move on to more interesting topics: How to build useful systems and communities on top of this communication protocol, and how to build valuable social frameworks on top of those communities. But that’s for another post…
This post is meant to be very approachable and an academic background is not expected. However, it has an academic flavor and deals with applied theory, in the same vein as previous posts on lambda calculus and parallel computing.
This is a post about graphing social relationships. These techniques are used for advertisement, propaganda, predicting disease spread, predicting future relationships (as in LinkedIn “you might know” suggestions), predicting ideology or opinion, and a variety of other tasks. Network science is widely used in academia, in “big data” corporations, and by governments. This post will serve as a brief crash course into network science through the lens of social media scraping, with an emphasis on different models for representing relationships, and their use-cases and shortcomings.
Say we want to identify the most influential members of a community. “Influential” is a ambiguous term, and we could be referring to “the most well-connected individual”, or “the people that bring in ideas from outside the community”, or some notion of a “trend-setter”. To explore any of those definitions, our first task is to identify the members of the community and how they are inter-related.
To start we’ll draw a node (or a vertex if you come from a math/graph-theory background instead of computer/network-science) for Bob. We’ll identify all of Bob’s peers: their friends on Facebook, mutuals on Twitter, contacts on LinkedIn, or whatever parallel makes sense for the platform we’re studying. We’ll create a node for each peer, and we’ll draw an edge between Bob’s node and their friends:
We have our first network (or graph in graph-theory terminology). We can use this network to identify important nodes for the community, which usually involves some of the following characteristics:
The nodes with the most connections (the highest degree)
The bridge nodes that connect two communities together (Bob connects Alice to Dave and Carol)
The central nodes (Betweenness Centrality is a score based on how many fastest routes between any two nodes pass through this node)
We can create a larger network with more meaningful data by looking at Alice, Carol, and Dave’s peers, and building out further and further. The only limits are time and available information from the platform we’re studying.
Not all relationships can be described bidirectionally, like a friendship. For example, Twitter allows one user to follow another without reciprocity. Retweets, likes, and replies are all a form of connection that may or may not be bidirectional. To capture this distinction, we need to add direction to the edges of our graph to create a directed graph or digraph:
This changes the attributes we can measure, but only a little. Instead of degree to indicate how many connections a node has, we now have indegree and outdegree to indicate how many edges lead into and out of the node. This is usually even better, since we can now distinguish users that many people listen to from users that follow many people. Our measurements of bridges and centrality can also utilize direction, tracing only the paths a message can flow from one user to the next through a community.
There may be several attributes that can be thought of as a “relationship”. Returning to the Twitter example again, we have the following relationships:
All of these could be represented as edges on a social graph, but each relationship type has a different implication. Retweets (not including quote-retweets) most strongly indicate “support” and a positive relationship, since one user is rebroadcasting the message of another without commentary. Mentions and replies, on the other hand, could be positive or could indicate arguments and distrust. Follows are also ambiguous, since many users will follow individuals for news purposes, like their politicians, regardless of whether they support those individuals. Volume may also vary significantly between relationship types: Users may “like” messages widely, but retweet or reply to more select content.
Therefore, while we could draw one graph with an edge indicating any of the above relationships, we probably want to separate them. This could mean creating a separate graph for each category of relationship, or it could mean adding edge attributes that indicate which type of relationship each edge refers to. We can also use edge attributes to encode data like the number of retweets, or the age of a response. Comparing the attributes can lead to interesting discoveries, such as identifying a universally despised user that’s frequently mentioned by members of a community but never uncritically retweeted, or a user that used to be regularly retweeted, but has fallen from grace and is no longer amplified.
In addition to metrics for individual nodes, we can take measurements of an entire community. For example the degree distribution illustrates whether a community has about equal engagement, or whether a minority of users massively stand out as being followed more, mentioned more, retweeted more, depending on what the edges represent. We can also define group-level measurements like insularity, indicating what percentage of retweets by users inside of a group are retweeting other members of the group versus retweeting people outside of the group.
Most of these measurements only make sense if we take a much larger network sample, growing from our example diagrams of four users above to tens or hundreds of thousands. The following is one such network graph from Twitter data, created with SocMap:
Of course, community-level analysis requires a clear definition of who is “in” a community and who is not. Sometimes there’s a convenient external data point: If our community is “MIT students and graduates on LinkedIn” then we can define our in-group based on users with MIT in the education section of their profiles with a low degree of error. If our community is “right-wing users” on a platform like Twitter or Facebook then maybe we can create a fuzzy metric that scores users that repeatedly link to right-wing websites or frequently post specific right-wing-affiliated phrases. Highly scored users are likely to be part of the in-group.
Given solely network data there are algorithms for trying to “autodetect” communities based on the assumption that people in a community tend to be linked to other members of the community, but these algorithms are never as reliable as using external data, and frequently depend on analyst-supplied information like the number of communities to split users into.
Networks are constrained by what information is available, and it’s important not to overstate their accuracy. For example, not every friend will be friends on Facebook or connections on LinkedIn, or several users may know one another through a mutual friend that isn’t on the platform. There will almost always be nodes and edges “missing” from a social network graph. Sometimes this missing data is of primary interest! For example, “You may know” suggested connections on LinkedIn are based on a simple algorithm:
Identify 2nd and 3rd degree connections (that is, connections of your connections who you are not connected to)
Sort the potential connections by a combination of:
Shared peers (more connections in common)
Shared place of employment
Shared skills and interests
Determining which attributes are most accurate predictors of a connection to optimize the above algorithm is a more difficult problem, and one LinkedIn has no doubt spent a great deal of time studying.
While networks are valuable tools for predicting patterns of behavior, it’s critical to remember that these network graphs represent only a slice of real-world connections. A snapshot of Twitter misses that many users may connect over Instagram, Facebook, or SMS, and messages spread across these “invisible” edges frequently.
The biggest limitation we’ve seen with graphs so far is that it assumes all relationships involve only two parties. This is frequently appropriate, and accurately describes most phone calls, emails, and text messages. Unfortunately, it’s just as frequently inappropriate: A group chat between three people is not the same as as three two-party conversations between each participant. There may be topics you would discuss in private that you wouldn’t discuss in shared company, or conversely information you would dismiss as a rumor if it were shared with you privately, but seems more believable if it’s shared with your full peer-group. The context of a conversation is critical for understanding how information will be shared or accepted. Further, we can’t even assume that a group context implies members can speak individually: members of a group project may only speak together and never independently.
The simplest way to model these group contexts is to extend our definition of a graph. What if an edge can connect three or more nodes together? We’ll call this a hyperedge to distinguish from traditional edges, and we’ll call graphs containing hyperedges hypergraphs. For now, we can represent a hyperedge as a dotted line encompassing all nodes within it:
Obviously this will be messy to draw with many intersecting hyperedges, but we can perform a lot of mathematical and computer-sciency analysis without visualizing the network we’re working with, so that’s far from a show-stopper.
Note that our example includes only undirected hyperedges. We may also desire a concept of directed hyperedges to represent multicast messages. For example, an emergency hurricane alert broadcast to every cellphone in a city represents a shared message context, but only the emergency service can send messages in the group. Alternatively, consider a Telegram group chat configured as an announcement service, so a dozen or so administrators can write to the group, but thousands can listen. For some types of analysis it may be appropriate to represent these “broadcasts” as many directed edges from the broadcaster to every listener, but if the group context is important to preserve then we need a directed hyperedge to model the conversation space.
Even directed hypergraphs have limitations, but to demonstrate them we’ll need to back up and explain an alternative solution to modeling group context.
The simplicial set represents group relationships with geometry. For example, if a triangle of edges represents three users with independent relationships with one another, then a filled triangle represents three users with a shared relationship with each-other:
If we want to represent a shared relationship between four individuals, we can switch to a tetrahedron (three sided pyramid, or a 3-dimensional version of a triangle). For five individuals, we create a 5-cell, the 4-dimensional equivalent of a triangle, and so on. Higher-dimensionality shapes rapidly become difficult to visualize, but it’s conceptually sound.
Multiple shapes in this geometric space can interact with one another. For example, consider two adjoining triangles:
We can describe DE in two ways. DE can be the shared edge between CDE and DEF, indicating a shared context in that DE is a sub-group that bridges the two larger triangles. However, we can also add an edge between D and E, indicating that they have a relationship outside of this shared bridge space.
Similarly, we can describe a tetrahedron either as the three-dimensional space encompassing four nodes, or as a union of three triangles, or as a combination of triangles and three-space. The difference in phrasing can represent a group of four people or a collaboration between multiple sub-groups.
Sub-grouping and intersection is extremely difficult to describe in a hypergraph. We can create a concept of a hyper-hyperedge which links two hyperedges together to simulate a metagroup, but this is at best an awkward substitute. A hyper-hyperedge still leaves great ambiguity distinguishing separate teams that communicate versus intersections between teams, and making a group that consists of some individuals and some other groups becomes messy very quickly. If we stick to hypergraphs we must content ourselves with representing many group dynamics as footnotes outside of the graph itself, which makes analysis extremely difficult.
Finally, simplicial sets are always directional. We can have multiple congruent but distinct triangles, ABC, BCA, CAB, ACB, and so on, which represent distinct social contexts involving the same three people. We can easily simulate undirected groups using simplicial sets (by sorting all participants before describing a group), but if directionality is desired to represent social hierarchy or multicast communication then the distinction is already built into simplicial group definitions.
Unfortunately, moving from theory to practice is more challenging. Simplicial sets are based on category theory and algebraic geometry, and the math involved reflects that. While there are well-developed software tools for working with undirected and directed graphs, there are few for hypergraphs, and almost none for simplicial sets, limiting present adoption outside of theoretical mathematical spaces.
This post provides an extremely high-level overview of network science within the context of building relationship maps from social media. It’s a flexible discipline, and network analysts spend as much time (if not more) determining which measurements are appropriate and what metrics mean in terms of real-world behavior as they do working through math and code. Because nodes and edges can represent such diverse phenomenon (and this post only scratches the surface without mentioning multi-layer and bipartite networks) most network analysis tools require significant configuration and code from analysts to produce meaningful results.
With that said, some of the versatile libraries used for network analysis in Python include NetworkX, iGraph, and graph-tool. While each library has a limited ability to render networks for visual inspection, most analysts turn to Gephi or (my personal favorite) Cytoscape to explore their networks and display them for publication.
For more on hypergraphs and simplicial sets, I found this paper to be approachable despite lacking any category theory background.