Peer-to-Peer Network Models and their Implications

Let’s take the following as a given:

Mainstream social media is harmful because it puts a single company in control of the human social experience and places them in the role of cultural censor

If you agree and want to get on with addressing the problem, skip ahead to the next section.

How We Got Here

When we say “control of the human social experience” we refer to something like Elinor Ostrom’s Institutional Analysis and Development Framework (here’s a good paper on applying IAD to digital institutions if you want to dive in), which describes platforms in terms of the following rules:

  1. Operational Rules: Describe the interface and what actions a user can take, such as tweeting and retweeting and liking

  2. Collective Rules: Describe the higher-level context in which the operational rules are utilized, like how the Twitter content feed orders tweets based on what users you follow, how popular the tweet was in terms of retweets and likes, and how old the content is

  3. Constitutional Rules: Describe by what process the operational, collective, and constitutional rules can be changed

In a corporate-owned social media network, the corporation has complete control over the operational and collective rules, and most of the control over the constitutional rules. There may be some external influence, such as summoning the CEOs in front of Congress for questioning, or threatening to amend FCC regulations governing Internet platforms, or DMCA takedown requests. Regardless, the users of the platform, those most affected by operational and collective rules, have almost no say over those rules.

The only influence a user has over a platform is the option to leave, which en masse might impact the company’s bottom line. However, if we assume users on social media generally want to be social, then they’ll want to migrate to another social platform when they leave, and all the major social media companies have similarly tight control over their platforms with remarkably similar acceptable content and moderation policies.

When we describe platforms as cultural censors, we mean that they decide what content they will permit on their platform and what is a banable offense. Across social sites like Tumblr and Facebook and infrastructure sites like Paypal and Patreon, we’ve repeatedly seen companies take a puritanical stance against LGBT+, sex-positive, and sex-work content. Violet Blue (fantastic!) has written lots about this. Simultaneously, platforms are extremely hesitant to censor white supremacist or neo-nazi content, because they do not want to be accused of political censorship or bias by Congress or the White House, and the aforementioned content is increasing adjacent to Republican political talking points.

So, corporate-run social media implies a structure the users have no say in, with content limits the users have no say in, which favor harmful and icky views while inhibiting freedom of expression. There’s no way out by hopping to another corporate-run social media platform, because the next platform has the same problems and policies. “I’ll just build my own platform” leads to the same pattern with a different set of oligarchs, as we’ve seen (and I’ve written about before) with a range of “alt-tech” platforms like Voat, Parler, Gab, and BitChute that were created to host right-wing extremist content banned from their mainstream counterparts.

Decentralizing Governance

To address this problem we need a radical new kind of social media, with no central governors. This new media should not be susceptible to a small oligarchy enforcing social views, but ideally should have some conception of moderation, so community spaces can remove harassers, abusers, spammers, and hateful content.

The clear solution is a decentralized network, where content isn’t hosted on a single server hosted by a single entity. There can be collective storage of social media data, with collective agreement on what content to support and how.

Okay, great! So what does “decentralized” look like? A distributed hash table? A blockchain? A bunch of mirrored servers? Let’s look before we leap and consider common decentralized network models and their implications when used in social media of any sort.

Distributed Hash Tables

I recently described distributed hash tables in more depth, but the one line summary is that in a DHT users are connected directly to multiple peers, and data in the form of key-value pairs (like a Python dictionary) are distributed across all peers with some redundancy, so anyone can quickly look up information associated with a known key. Participants use a centralized introduction server (called a “tracker” for torrents) to find their initial peers, but this introduction point has little information and can be easily replaced or made redundant.

DHTs have two significant limitations. The first is that all content must be framed as a key-value pair, and it’s not always obvious how to design systems around this constraint. Nevertheless, a variety of projects use DHTs at their core, including Hyperswarm, the underlying peer-discovery system used by Hypercore (previously known as “DAT protocol”), which in turn is the peer-to-peer content sharing system powering chat applications like Cabal. DHTs are also at the heart of GNUnet (another generic p2p content sharing system used for file-sharing and chat), and a similar key-value routing technology is used in Freenet, which is aimed at distributed file hosting, microblogging, version-control systems, and other append-only shared data. Finally, DHTs are used for routing in the Invisible Internet Project, and for connecting to onion services in Tor, so it’s safe to say that DHTs are a common design pattern in decentralized spaces.

The larger limitation of DHTs is that they produce a singleton. While a distributed hash table is “distributed” in the sense that content can be scattered across a wide range of peers, it is centralized in the sense that there is only a single DHT network shared by all users, with a single shared namespace, storing content for all peers in the network. This may not always be desirable: users may not want to host content for peers outside their social circle, may not want to reveal their IP address to peers outside their circle, or may simply want to extend the DHT and change its functionality in a way incompatible with the larger community. While it is technically possible to run multiple competing DHTs with no connection to one another, utilizing separate introduction servers, this is strongly disincentivized, since DHTs gain performance, redundancy, reliability, and storage capacity with scale.

Blockchains (Sigh)

Blockchains are a form of data storage with two interesting attributes:

  1. They form an append-only log that cannot be rewritten (once a block has proliferated and several more blocks are added to “solidify” its presence in the chain)

  2. They are highly redundant

If those two attributes are not both valuable for your application, then a blockchain is the wrong choice for you. There are alternative solutions to append-only logs (like signed messages stored in a merkle tree, or Paxos), and to data redundancy (like a DHT, or client-stored messages that can be re-uploaded to peers later, as in sneakernets). But let’s look at the structure of a blockchain network.

Blockchain network traffic looks remarkably similar to DHT traffic, except that instead of using key-value pairs, every peer stores the complete data log:

This certainly provides more redundancy than a DHT, and maybe that level of redundancy is important if you’re building a ledger of all financial transactions ever and your entire economy relies on its stability. For most use-cases, the redundancy afforded by a DHT is sufficient, and requires far less storage for each peer. Blockchains also imply a significant computational cost to write any new data if the blockchain uses proof-of-work to ensure immutability. It’s an expensive and over-applied data structure.

We can address some of the limitations of blockchain using a sidechain, where the primary blockchain (or another data structure like a merkle tree) includes references to the heads of miniature blockchains that are smaller and can be updated more quickly. These sidechains can follow simpler rules than a public blockchain, such as allowing a user with a private key to craft new blocks instead of using a “proof of work” algorithm.

Signal, the centralized but end-to-end encrypted chat software, uses a chat protocol with some similarities to a blockchain. In the Signal protocol, each message in a conversation includes a reference to both the previous message that user sent, and the most recent messages that user has seen. This means Signal’s central server can’t ommit messages without clients noticing, and at worst can deny all service to a user. However, there is no proof-of-work involved in this chain; the only requirement for adding a new block is that it must be signed by the user, eliminating one of the largest limitations of a full blockchain.

Sidechains are also at the heart of Keybase, a Slack-like encrypted messaging and file hosting system that maintains a chain for each user to immutably store information about the user’s identity. Keybase also maintains each shared folder as a blockchain of write operations. Notably, however, Keybase is a centralized system that uses blockchains and signature verification to keep the server honest. The blockchains serve as a tamper-evidence mechanism that makes it difficult for the server to revert or manipulate messages (along with aggressive client caching), but the Keybase server is a sole central data repository for the network.

As with DHTs, blockchain networks form singletons (even if not run on a central server like Keybase), and running parallel blockchains or forking a chain is frowned upon because it shrinks the network and dilutes the benefit of a single shared ground truth.

Federation

Federation is an interesting combination of centralization and decentralization. From the user’s perspective, federated social networks work much like mainstream networks, except that they have several Facebook- or Twitter-like servers to choose from. Each server operates as an independent miniature social community. Server operators, however, have a different experience. Each operator can choose to federate with another server, bidirectionally linking the two servers, exchanging messages to create a larger collaborative social network. Collections of federated servers are referred to as a fediverse.

The most well-known federated social network is Mastodon, a Twitter-like cluster of servers. Each server has its own content policies and moderators, and usually only federates with servers with similar content policies. This lessens the likelihood of the network implementing extremely puritanical social policies, and allows easy migration if moderators on a single server go power-mad. When Gab (an alt-right Twitter clone) abandoned their own software stack and became a Mastodon instance they were universally condemned, and most other server operators refused to federate with the Gab instance, isolating them on their own server and proving the effectiveness of Mastodon’s moderation strategy.

Unfortunately, Mastodon also suffers from the singleton problem. While servers can federate, it is difficult to convince an operator of a major server to create a bidirectional link with a minor one, and there is little incentive to run a Mastodon server and less incentive if it is unfederated with major servers. As a result, three Mastodon servers contain almost 60% of the known Mastodon population.

The Diaspora Network, a decentralized Facebook-like network, also follows a federation model. Of note, they popularized clustering contacts by “aspects” of your life, so you can easily share contact with some categories of peers and not others.

It’s worth pointing out that federation is far from a new concept. Internet Relay Chat also provides functionality for federating servers, synchronizing chatrooms and messages between two servers, though the process is very rarely used since it grants operators on both servers extreme power over conversations that occur on the other. Similarly, NNTP (the network news transfer protocol underlying Usenet) allows servers to exchange news postings and comments to create a shared community. In Usenet’s case federation was commonplace, and resulted in a singleton network similar to Reddit with far less moderation.

Pubs

Pubs (in the “public house”, inn/bar sense of the word) invert the idea of federation. Instead of users connecting to a single server and allowing the servers to interlink, users now connect to several servers, and serve as the links. In a pub model, servers are reduced to the role of content caches and introduction points: They allow users to leave messages for other users to find, allowing two participants to communicate asynchronously without being online at the same time for a peer-to-peer connection. They also allow users to leave messages publicly on the server, making it possible to introduce themselves and meet other users.

Since users are connected to multiple servers at once, they can post the same message to multiple servers concurrently, and rely on clients to recognize and combine the duplicates. This means users are not bound to a single server where they host their content, as in a federated service, but can store their content feed on multiple servers, more akin to a distributed hash table. Since servers have no federation, there is little cost to running a pub server. Unlike in federated spaces, a small pub can be valuable by providing a closed but shared conversation space, representing a tighter group of friends, or colleagues with a similar and obscure interest. Users can choose to only post some of their messages to these private spaces, moving between more public and more private content feeds.

The most popular pub-model network in use today is Secure Scuttlebutt, which works both via pub servers and via syncing with peers over WiFi networks, exchanging both their own messages and cached messages from friends and friends-of-friends (gossip). Notably, Scuttlebutt is offline-first: the entire “content feed” is stored locally, so you can browse and reply to messages while offline, and then re-sync when you next connect to a pub or are on a LAN with another peer. The entire network can theoretically run without pubs, purely on local network exchanges, and place no authority on pubs at all. Without a reliance on central servers there is also no clear opportunity for community moderation. Scuttlebutt supports a peer blocking individual users and hiding them from their content feed, but this scales poorly on a larger network.

The Open Privacy Research Society is working on their own pub-based chat network, Cwtch, with the twist that pubs can’t read the messages they host. Cwtch is more like a signal group chat, sans the reliance on the central Signal servers, by caching messages in a variety of servers and storing them locally. Cwtch operates entirely over Tor, reaching pubs and other peers via Tor onion services. When two peers are online at the same time they can exchange their message logs from the group, filling in each-other’s blanks, and using a server only to leave messages asynchronously when peers are offline.

Pub-style networks have the distinct advantage of only caching content relevant to the user (or close to them in a social graph). Servers need to store considerably more content, but can auto-delete older messages to limit the load, since messages will live on in users’ local storage.

The Takeaways

DHTs, Blockchains, Federation, and Pubs provide distinctly anti-capitalist models for sharing social content without a patron corporation facilitating discussion. Each decentralized model has unique characteristics shaping the kinds of information sharing that are possible, the opportunity (and dangers) for moderation, and the kind of clustering that is likely to result. I’m personally most excited about pubs, while acknowledging the utility of DHT networks, but all four paradigms have borne fruit already and should be pursued.

The folks at SimplySecure (plus some community members like me!) are exploring decentralized design concepts at decentralization off the shelf.

Posted 11/21/20


RAIDs: Combining Hard Drives for Fun and Profit

After writing about decentralized data storage and torrents, I’ve had data storage on my mind. I drafted this post while setting up a RAID for the MeLa research group today, and built RAIDs for DDoSecrets a few weeks ago, so there’s a ton of “data storage” in my life right now and it only seems natural to quickly write up how large data storage on a centralized server works.

Hard drives get more expensive as storage capacity increases, and traditional spinning-plate hard drives have a limited lifespan, because spinning pieces and motors wear out relatively quickly from friction. Hard drives can read and write data more quickly when they spin faster, but this also wears out the drive more quickly. Therefore, getting a very big hard drive with a long lifespan that’s also fast becomes prohibitively expensive.

What if instead we could slap together several cheap and unreliable hard drives, and out of the harmony of garbage get high-capacity, reliable, high-speed data storage? A Redundant Array of Inexpensive Disks, or RAID, does exactly this. But how?

Types of RAIDs used today

Linear RAID

The most obvious way to “combine” two hard drives is to (metaphorically) glue one to the end of another. When the first hard drive is filled, start writing data to the second. We can create a virtual hard drive consisting of two or more physical hard drives glued together in this way, and write data to the virtual drive as if it’s one big drive.

Alright, that’s easy, but it’s lacking in a few ways. First, there’s no redundancy: If one of the cheap hard drives in our stack of three fails, we’ve just lost a third of our data. Second, it seems slow: Accessing most files will only use one hard drive at a time, and couldn’t we get more performance by using multiple drives at once?

RAID 0: Striping

A striped RAID works the same way as a linear RAID, but it splits data across all drives equally. If you had two drives then you’d put the even bytes on the first drive and the odd bytes on the second drive. Then when reading or writing data you use both drives at once, for half as long, and so in theory get twice the speed!

In the real world we don’t write “even and odd bytes” but rather “even and odd chunks of bytes” called stripes, because that’s faster in practice. Same idea.

Redundancy is still a problem with stripes, perhaps even more than with linear RAIDs: If a hard drive dies we now lose “all the even chunks of every file”, which makes our remaining data just about worthless.

RAID 1: Mirroring

Mirroring creates a perfect backup of a drive. Every time we write data to one drive, we also write the data to all the backup drives. If one of the drives dies, we seamlessly continue using the rest. When we replace a dead drive, we copy all the data from the other drives in the mirror.

When reading data we can actually get performance similar to striping, by reading some chunks from one drive while reading other chunks from a second drive. Only data writes need to be synchronized across all the drives. Mirrors limit you to the size of a single drive (if you have three 1-TB drives that are all perfect copies of one another, you only get one terabyte of “real storage”), and the write speed of a single drive, but the combined read speed of all your drives.

RAID 10 (1+0): Striped Mirrors

If we have four hard-drives we can easily combine last two strategies: Create two sets of mirrors, and stripe data across them.

We get the storage capacity of two out of the four drives, the write speed of two drives, the read speed of four drives, and redundancy. The redundancy is slightly unintuitive: We lose nothing if any one drive fails, and we lose nothing if a second drive fails and it wasn’t the mirrored copy of the first drive that failed. In other words, as long as we’re lucky and we still have a full copy of the data across some combination of drives, then we’re okay.

With more hard drives comes more flexibility. Six hard drives can be organized as three mirrors of two drives each, or two mirrors of three drives each. The administrator chooses a trade off between more storage, more speed, and more redundancy.

RAID 01 (0+1): Mirrored Stripes

Don’t do this. If we reverse the order, striping two drives together and then mirroring the data to another two drives, we conceptually get the same result as a RAID 10. In practice however, a RAID 01 is more fragile. In most implementations, when one half of a stripe fails, the other half is disabled, too. Critical metadata, which tracks what stripes were allocated and placed on which drives, was kept on the now-dead drive, shattering the puzzle. Therefore when one drive in a RAID 01 fails, its striped partner also shuts down, reducing the 01 to a RAID 0. Don’t use 01, use 10.

RAID 5 / RAID Z

A RAID 5 distributes chunks across three or more drives so that any one drive can be lost without losing data. For example, assuming we have three drives, we can store even chunks on drive 1, odd chunks on drive 2, and the XOR of the two chunks on drive 3. Given any two drives, the information on the third drive can be re-created.

This lets us keep two thirds of the storage from our three drives, along with the read and write speed of two drives. Mostly a better trade off than a striped mirror! With four or more drives the dividends are even better, since we’ll get the storage capacity and read and write speed of three drives, then four, etc.

RAID 6 / RAID 2Z

Same idea as RAID 5, but the parity blocks (the XOR of the data blocks) are written to at least two drives. This means two drives can fail without losing data, and a RAID 6 is only possible with at least four drives.

Antique RAIDs

Alright, we covered RAID 0, 1, 5, and 6, what happened to 2, 3, and 4? They’re all poor designs that have been retired in favor of 5 and 6. Here’s a brief run-down:

RAID 2

Same idea as a RAID 5, except information is striped at the bit-level across all the drives, and the drives use Hamming codes to provide redundancy and error correction. This means that all the drives must spin in sync, so you can only access one file at a time, and reading and writing at a bit level makes the configuration slower.

RAID 3

Same as RAID 2, but stripes at a byte-level instead of bit, and stores XORs of the bytes on the last drive, which is a dedicated “parity drive”. Again requires that all drives spin in sync.

RAID 4

Same as RAID 3, but stripes at a block-level instead of byte. This means read performance is much better (you can frequently read blocks for one file from two drives while reading blocks for another file from a third drive), but write performance is still poor, since parity blocks are all stored on a single drive, which must be used for all writes.

Hardware or Software?

Traditionally to build a RAID you need a special “RAID card” on the computer, which connects to all the relevant drives and implements the RAID, presenting a single “virtual hard drive” to the motherboard and operating system. On more modern systems you can produce a “software RAID” where the operating system has access to the individual drives and produces a RAID on its own, using tools like mdadm or ZFS. This is sometimes more efficient, especially with ZFS, where the filesystem and RAID software are integrated and can read and write more efficiently than with a virtual disk.

Which type of RAID is right for me?

Choosing the type of RAID you want is a decision about how much redundancy you need, versus capacity and speed. Many servers have multiple RAIDs for different purposes. One additional consideration is that most computers can only boot off of mirrored RAIDs. This is because the BIOS, the code burned into the motherboard that initializes enough of the hardware to find the operating system and start it, is very small and not so clever. Stripes and RAID 5 clusters are complicated, but a drive from a mirror can be treated like a single independent drive. The BIOS finds one drive in the mirror and uses it to start the operating system, which then realizes it’s on a mirrored RAID and picks up the other drives.

Therefore, one common server configuration is to use two or more SSDs in a mirrored RAID for booting. These drives contain the operating system and all software, can be read at an absolutely blazing speed, and have redundancy because of the mirror. Then additional conventional drives are placed in a RAID 5 or 6 for a decent trade on performance and capacity, creating a larger pool of drives for data.

Posted 10/22/20


Distributed Hash Tables and Decentralized Data Storage

This post is mostly theoretical computer science (data structures, distributed systems) leading up to future posts that can talk about the design of decentralized communities with an emphasis on social theory (self-governance, trust, responsibility for content)

I’m at the (virtual) computer-supported collaborative work conference this week, and it’s stirring many ideas related to shared governance of decentralized communities. Before digging into those ideas, though, one more interlude about technical underpinnings of decentralized systems…

The Problem Space

We have information on a big central server, and we would like to spread it across many servers. This can be for a variety of technical reasons, including:

  • Redundancy, if the central server goes offline

  • Performance, if users can connect to a variety of servers then the average workload per server will be much lower

  • Resource limitations, if a central server with enough storage, processing, or bandwidth to support all users at once is infeasible

There may also be social reasons for desiring distribution, such as removing trust in a single central entity that could delete or modify data at will, preferring instead a solution where multiple parties have copies of data and can disagree on governance policy.

There are two broad ways of solving “distribution” that at first seem quite different, but are forced to tackle similar problems:

  1. Everyone has a copy of all of the data

  2. Everyone has a piece of the data

Mirrored Data Stores

Taking the “simple” case first, let’s assume we want to mirror data across multiple servers, such that each has an identical copy of all information. This is often appropriate for load-balancers and content-distribution-networks, where we really want “50 copies of the same website, hosted by servers across the entire planet.”

This is very easy if the content never changes! Just have a single “content provider” upload data to each server, and have users connect to the content distribution servers.

The problem is slightly more complicated, but still not too bad, if the single content provider can send out an update. We may have a chaotic transition period where some CDN servers have updated and some have not, but in practice all servers will have the new content “pretty soon.” If content is pulled rather than pushed, meaning that the CDN servers periodically connect to the main server and check for a new version of the data rather than the main server connecting to each CDN server to upload content, then we’ll need some marker to determine whether content is “new”. Some of the more obvious options are:

  1. Always download content from the server, assume the server has the “ground truth”. Works, but wasteful.

  2. Download content if it has a newer timestamp than the timestamp of the previous data. This works, but timestamps are generally messy because computers clocks can drift and need to be periodically re-synchronized via NTP.

  3. Download content if it has a newer version number than the previous data. Same idea as the timestamp, but without the messiness of dealing with real-world “time”

This “versioning” lets us implement some helpful optimizations, like having CDN servers download updates from one another. CDN server 1 can download an update from the “main server”, while CDN servers 2 and 3 download from server 1, allowing the system to run smoothly even if the main server goes offline before servers 2 and 3 can be updated. All the CDN servers are always in agreement about what data is the “newest”, because a single source of ground truth increments a version number to disambiguate.

Let’s move to a messier problem: Server content is no longer static. Imagine collaborative editing software like Google Docs or Overleaf. Multiple users are making changes to a shared document, but that document doesn’t exist on a single server, but is rather spread across a range of servers for performance and redundancy. We must combine users’ edits to synchronize the servers and create a consistent view of the document.

We’ve lost the idea of single linear incrementing versions: Two users can add changes “simultaneously” (a loose definition, where “simultaneously” can just mean that the users made changes on two different servers before those servers had a chance to sync), and we need to come up with a deterministic ordering. Notice that timestamps don’t matter nearly as much as relative ordering and awareness: If Alice added a sentence to a paragraph, and Bob deleted that paragraph, then to combine the changes we need to know which edit came first, and whether Bob was aware of Alice’s edit at the time.

Lamport Vector Clocks

We can address the above challenges using a vector clock, which is basically a version number for each server indicating both what iteration of content the server is on and what updates it’s aware of from other servers.

When server 1 writes to server 2 it includes a list of all messages it doesn’t think server 2 knows about yet, based on the vector clock it received from server 2 the last time server 2 sent a message. That is, if server 1 has received (2,1,2) from server 2, it knows server 2 has “seen two messages from server 1, sent one message of its own, and seen two messages from server 3”. If server 1 has also received (0,0,3) from server 3, then server 1 knows about a message from server 3 that server 2 doesn’t know about. Therefore, when server 1 is ready to send a new message to server 2 it will first include the (0,0,3) message from server 3, followed by the new (3,1,3) message. In this way, it is not possible to receive a message without first receiving all the messages it depends on, guaranteeing an intact history.

Vector clocks assume all participants are truthful. If a server can lie about message timestamps or send multiple messages with the same timestamp then the “consistent world view” model can be trivially broken.

Notice that while we can use vector clocks to produce an optimal ordering of messages, we cannot eliminate all conflicts. Sometimes two users will introduce two conflicting changes, and both make incompatible changes to the same sentence. By frequently synchronizing servers we can make this scenario infrequent, but we need a resolution protocol like one of the following:

  1. Manual intervention (as with git merge conflicts)

  2. Automatic consensus for deciding which change to keep (as with blockchain stabilization when two competing blocks are mined)

  3. A ranking system for selecting a change (for example, if a user replies to a tweet while the original poster deletes their tweet, either always delete the reply, or create an empty “deleted post” for the new tweet to reply to)

We now have a protocol for ordering changes from different participants and resolving conflicts. This is far from the only solution: We can also build consensus protocols like Paxos that only accept and proliferate one change from one participant at a time, guaranteeing zero conflicts even in the face of equipment failure at the cost of significant delays and overhead and the inability to work “offline” (like with git) and then merge in changes later when you’re online. There are many design trade-offs in this space.

Distributed Hash Tables

So far we have described decentralized systems for ensuring that all participants end up with the same data at the end. What about distributing data across participants so users can look up information they’re interested in, without having to store the complete dataset? This is where we introduce distributed hash tables, or DHTs.

The premise is simple: Take a hash table (an efficient way of implementing the more abstract “associative array”, also called a “dictionary” or “key-value table”), and sprinkle the key-value pairs across multiple participant servers, in an equal and deterministic way. With a traditional hash table you hash the key to determine the position the value should be stored at - in a distributed hash table we hash the key to determine which participant the key-value pair should be stored at.

In the trivial case, a client would maintain a network connection to every participant in a distributed hash table. When they want to GET or PUT a value for a key, they hash the key, determine which participant is responsible, and send the GET or PUT directly to the node.

Unfortunately, this scales poorly. If a DHT contains hundreds of thousands or millions of participants, expecting a client (or even a participant) to maintain millions of concurrent network connections would be unwieldy. Instead, we’ll employ a finger-table. Each participant will maintain links to the nodes 2^0 through 2^j ahead, where 2^j is less than the total number of participants. In other words, a logarithmic number of hops:

To dive all in on computer science terminology, this guarantees that all lookups are O(log n). In a DHT with millions of nodes, lookups will take a maximum of 20 or so hops. Much worse than the O(1) lookup of a traditional hash table, but… pretty good. This trade off means clients can connect to any participant in the DHT to submit a request, and the request will quickly bounce around to the correct destination. One network connection for a client, a handful for participants of the DHT.

Alright, so that’s how we store data in a DHT with a static structure. What about redundancy? How do we handle adding and removing nodes? How do we deploy a DHT in a chaotic peer-to-peer network rather than a data center?

Data Redundancy

For data redundancy, we can just store the key-value pairs in two locations! Instead of storing in hash(key) % participants we can store in the original location and in hash(key) + 1 % participants. For additional redundancy, store in + 2, etc. If the “location” of a participant in the DHT ring is random, then there’s no harm in storing in + 1. This is also convenient from a lookup perspective: We’re looking for data stored in participant 6, but participant 6 is offline? Send the query to participant 7 instead!

What if a participant and its backup get out of sync? How do we decide which value is “correct” for a key? Well, that’s what we have lamport vector clocks for!

Adding and Removing Nodes: Dynamic Routing Tables

Replacing a node in a DHT is simple: Contact every participant that links to the dead-node, and give them contact information to update their reference. This is relatively painless: O(log(n)^2) steps to send a RELINK message to all log(n) nodes with a link to the dead one.

Growing and shrinking the DHT is more challenging. The trivial solution, adding the new edges, informing all nodes of the new DHT size, and re-hashing and re-introducing all keys, is obviously too inefficient to be practical.

Let’s revise the structure of a DHT. Instead of numbering all of the nodes sequentially, 0 to n, what if each node has a large random number associated with it? To start with, just add a few zeros, and assume the nodes are numbered “0”, “100”, “200”, …, “1500”.

Now our key lookup mechanism is broken! If we run hash(key) % 1600 the vast majority of keys will be assigned to non-existent nodes! Alright, so let’s re-define the assignment: Keys are now assigned to the closest node number that comes before the “ideal” position. This means keys assigned to nodes “1400” through “1499” will be assigned to node “1400”, keys assigned to “1500” through “1599” will be assigned to node “1500”, and keys for nodes “0” through “99” will be assigned to node “0”.

Each node is still responsible for propagating a message forward through the network, until either the correct position is found, or it’s determined that the key does not exist in the DHT.

We’ll also need to change the linking in the network. Instead of linking to “+1”, “+2”, “+4”, “+8”, we’ll instead allocate each participant some “buckets”. These buckets will let a participant track links to “many nodes 1 or 2 distant”, “a moderate number 8 or 10 distant”, “a few 50 or 100 distant”, and so on. The same concept as a finger-table, just non-deterministic. If a participant doesn’t know any participants “about 100 away” they can instead send a lookup request to the known neighbors “about 50 away”, who are more likely to know neighbors that are closer to them.

This bucketing system makes it easier to introduce new participants: We don’t have to calculate all the participants that “should” have links to the current node number, we just have to send out an introduction, and nearby nodes are likely to add the new participant to their buckets, while distant nodes are unlikely to add the participant to their buckets. The same bucketing system is ideal for redundancy, because if a nearby neighbor goes offline (which we can check using a periodic ping/heartbeat system), a participant will have many other nearby participants in their bucket, and can continue operating without loss of connectivity. If one of the few distant links is lost, then the participant needs to send out a new lookup to find other distant peers to add to their finger-table buckets.

Therefore, when we add a new participant, say node “1355”, we need to send out an announcement. Many nearby participants will add “1355” to their finger-tables, and a handful of more distant nodes will, too. Key-value pairs destined for “1355” through “1399” will be re-allocated from node “1300” to our new participant, but will also be kept in “1300” and “1200” for redundancy, depending on the fault tolerance of the network.

This structure is still recognizably a DHT if we squint at it, but it’s a lot fuzzier now, with non-deterministic positioning and linking. Lookups are still deterministic, in that key-value pairs that exist in the network can reliably be found. We can also stabilize the structure of the DHT by adding an age-based probability function: Nodes that have been active for longer in the DHT (and are therefore likely to be online in the future) are more likely to be added to buckets, and more likely to be recommended in response to “find me more neighbor” requests. This means a new node will be added to many of its nearby peers, who keep large lists of nearby neighbors, but only long-lived nodes will be added to distant buckets. This means long hops across the DHT are much more likely to be reliable and efficient, and only once a lookup gets close to its destination, where participants have large redundant buckets, do connections become more chaotic.

DHTs in the Real-World

With the additions in the “dynamic routing tables” section, we’ve got a very approximate description of Kademlia, a widely used Distributed Hash Table model. BitTorrent, described in a recent blog post, uses a modified Kademlia DHT in place of a tracker, using trackers primarily for bootstrapping by introducing clients to participants in the DHT. The Invisible Internet Protocol, I2P uses a modified Kademlia to track routers and routes connected to the network. Many cryptocurrencies use a bucket structure similar to Kademlia to introduce participants to other peers in the network, but since the blockchain isn’t a key-value storage system they don’t use a DHT for data storage.

Now that we have an understanding of how to build a decentralized content-sharing system with peer introduction and routing, we can move on to more interesting topics: How to build useful systems and communities on top of this communication protocol, and how to build valuable social frameworks on top of those communities. But that’s for another post…

Posted 10/21/20


Network Science for Social Modeling

This post is meant to be very approachable and an academic background is not expected. However, it has an academic flavor and deals with applied theory, in the same vein as previous posts on lambda calculus and parallel computing.

This is a post about graphing social relationships. These techniques are used for advertisement, propaganda, predicting disease spread, predicting future relationships (as in LinkedIn “you might know” suggestions), predicting ideology or opinion, and a variety of other tasks. Network science is widely used in academia, in “big data” corporations, and by governments. This post will serve as a brief crash course into network science through the lens of social media scraping, with an emphasis on different models for representing relationships, and their use-cases and shortcomings.

Network Science and Social Scraping

Say we want to identify the most influential members of a community. “Influential” is a ambiguous term, and we could be referring to “the most well-connected individual”, or “the people that bring in ideas from outside the community”, or some notion of a “trend-setter”. To explore any of those definitions, our first task is to identify the members of the community and how they are inter-related.

To start we’ll draw a node (or a vertex if you come from a math/graph-theory background instead of computer/network-science) for Bob. We’ll identify all of Bob’s peers: their friends on Facebook, mutuals on Twitter, contacts on LinkedIn, or whatever parallel makes sense for the platform we’re studying. We’ll create a node for each peer, and we’ll draw an edge between Bob’s node and their friends:

We have our first network (or graph in graph-theory terminology). We can use this network to identify important nodes for the community, which usually involves some of the following characteristics:

  • The nodes with the most connections (the highest degree)

  • The bridge nodes that connect two communities together (Bob connects Alice to Dave and Carol)

  • The central nodes (Betweenness Centrality is a score based on how many fastest routes between any two nodes pass through this node)

We can create a larger network with more meaningful data by looking at Alice, Carol, and Dave’s peers, and building out further and further. The only limits are time and available information from the platform we’re studying.

Directional Relationships

Not all relationships can be described bidirectionally, like a friendship. For example, Twitter allows one user to follow another without reciprocity. Retweets, likes, and replies are all a form of connection that may or may not be bidirectional. To capture this distinction, we need to add direction to the edges of our graph to create a directed graph or digraph:

This changes the attributes we can measure, but only a little. Instead of degree to indicate how many connections a node has, we now have indegree and outdegree to indicate how many edges lead into and out of the node. This is usually even better, since we can now distinguish users that many people listen to from users that follow many people. Our measurements of bridges and centrality can also utilize direction, tracing only the paths a message can flow from one user to the next through a community.

There may be several attributes that can be thought of as a “relationship”. Returning to the Twitter example again, we have the following relationships:

  1. Follows

  2. Mentions

  3. Retweets

  4. Replies

All of these could be represented as edges on a social graph, but each relationship type has a different implication. Retweets (not including quote-retweets) most strongly indicate “support” and a positive relationship, since one user is rebroadcasting the message of another without commentary. Mentions and replies, on the other hand, could be positive or could indicate arguments and distrust. Follows are also ambiguous, since many users will follow individuals for news purposes, like their politicians, regardless of whether they support those individuals. Volume may also vary significantly between relationship types: Users may “like” messages widely, but retweet or reply to more select content.

Therefore, while we could draw one graph with an edge indicating any of the above relationships, we probably want to separate them. This could mean creating a separate graph for each category of relationship, or it could mean adding edge attributes that indicate which type of relationship each edge refers to. We can also use edge attributes to encode data like the number of retweets, or the age of a response. Comparing the attributes can lead to interesting discoveries, such as identifying a universally despised user that’s frequently mentioned by members of a community but never uncritically retweeted, or a user that used to be regularly retweeted, but has fallen from grace and is no longer amplified.

Community-Level Analysis

In addition to metrics for individual nodes, we can take measurements of an entire community. For example the degree distribution illustrates whether a community has about equal engagement, or whether a minority of users massively stand out as being followed more, mentioned more, retweeted more, depending on what the edges represent. We can also define group-level measurements like insularity, indicating what percentage of retweets by users inside of a group are retweeting other members of the group versus retweeting people outside of the group.

Most of these measurements only make sense if we take a much larger network sample, growing from our example diagrams of four users above to tens or hundreds of thousands. The following is one such network graph from Twitter data, created with SocMap:

Screenshot of a Twitter network graph produced with SocMap

Of course, community-level analysis requires a clear definition of who is “in” a community and who is not. Sometimes there’s a convenient external data point: If our community is “MIT students and graduates on LinkedIn” then we can define our in-group based on users with MIT in the education section of their profiles with a low degree of error. If our community is “right-wing users” on a platform like Twitter or Facebook then maybe we can create a fuzzy metric that scores users that repeatedly link to right-wing websites or frequently post specific right-wing-affiliated phrases. Highly scored users are likely to be part of the in-group.

Given solely network data there are algorithms for trying to “autodetect” communities based on the assumption that people in a community tend to be linked to other members of the community, but these algorithms are never as reliable as using external data, and frequently depend on analyst-supplied information like the number of communities to split users into.

Missing Data

Networks are constrained by what information is available, and it’s important not to overstate their accuracy. For example, not every friend will be friends on Facebook or connections on LinkedIn, or several users may know one another through a mutual friend that isn’t on the platform. There will almost always be nodes and edges “missing” from a social network graph. Sometimes this missing data is of primary interest! For example, “You may know” suggested connections on LinkedIn are based on a simple algorithm:

  1. Identify 2nd and 3rd degree connections (that is, connections of your connections who you are not connected to)

  2. Sort the potential connections by a combination of:

  • Shared peers (more connections in common)

  • Shared place of employment

  • Shared education

  • Shared skills and interests

Determining which attributes are most accurate predictors of a connection to optimize the above algorithm is a more difficult problem, and one LinkedIn has no doubt spent a great deal of time studying.

While networks are valuable tools for predicting patterns of behavior, it’s critical to remember that these network graphs represent only a slice of real-world connections. A snapshot of Twitter misses that many users may connect over Instagram, Facebook, or SMS, and messages spread across these “invisible” edges frequently.

Group Context and Hypergraphs

The biggest limitation we’ve seen with graphs so far is that it assumes all relationships involve only two parties. This is frequently appropriate, and accurately describes most phone calls, emails, and text messages. Unfortunately, it’s just as frequently inappropriate: A group chat between three people is not the same as as three two-party conversations between each participant. There may be topics you would discuss in private that you wouldn’t discuss in shared company, or conversely information you would dismiss as a rumor if it were shared with you privately, but seems more believable if it’s shared with your full peer-group. The context of a conversation is critical for understanding how information will be shared or accepted. Further, we can’t even assume that a group context implies members can speak individually: members of a group project may only speak together and never independently.

The simplest way to model these group contexts is to extend our definition of a graph. What if an edge can connect three or more nodes together? We’ll call this a hyperedge to distinguish from traditional edges, and we’ll call graphs containing hyperedges hypergraphs. For now, we can represent a hyperedge as a dotted line encompassing all nodes within it:

Obviously this will be messy to draw with many intersecting hyperedges, but we can perform a lot of mathematical and computer-sciency analysis without visualizing the network we’re working with, so that’s far from a show-stopper.

Note that our example includes only undirected hyperedges. We may also desire a concept of directed hyperedges to represent multicast messages. For example, an emergency hurricane alert broadcast to every cellphone in a city represents a shared message context, but only the emergency service can send messages in the group. Alternatively, consider a Telegram group chat configured as an announcement service, so a dozen or so administrators can write to the group, but thousands can listen. For some types of analysis it may be appropriate to represent these “broadcasts” as many directed edges from the broadcaster to every listener, but if the group context is important to preserve then we need a directed hyperedge to model the conversation space.

Complex Group Relationships and Simplicial Sets

Even directed hypergraphs have limitations, but to demonstrate them we’ll need to back up and explain an alternative solution to modeling group context.

The simplicial set represents group relationships with geometry. For example, if a triangle of edges represents three users with independent relationships with one another, then a filled triangle represents three users with a shared relationship with each-other:

If we want to represent a shared relationship between four individuals, we can switch to a tetrahedron (three sided pyramid, or a 3-dimensional version of a triangle). For five individuals, we create a 5-cell, the 4-dimensional equivalent of a triangle, and so on. Higher-dimensionality shapes rapidly become difficult to visualize, but it’s conceptually sound.

Multiple shapes in this geometric space can interact with one another. For example, consider two adjoining triangles:

We can describe DE in two ways. DE can be the shared edge between CDE and DEF, indicating a shared context in that DE is a sub-group that bridges the two larger triangles. However, we can also add an edge between D and E, indicating that they have a relationship outside of this shared bridge space.

Similarly, we can describe a tetrahedron either as the three-dimensional space encompassing four nodes, or as a union of three triangles, or as a combination of triangles and three-space. The difference in phrasing can represent a group of four people or a collaboration between multiple sub-groups.

Sub-grouping and intersection is extremely difficult to describe in a hypergraph. We can create a concept of a hyper-hyperedge which links two hyperedges together to simulate a metagroup, but this is at best an awkward substitute. A hyper-hyperedge still leaves great ambiguity distinguishing separate teams that communicate verses intersections between teams, and making a group that consists of some individuals and some other groups becomes messy very quickly. If we stick to hypergraphs we must content ourselves with representing many group dynamics as footnotes outside of the graph itself, which makes analysis extremely difficult.

Finally, simplicial sets are always directional. We can have multiple congruent but distinct triangles, ABC, BCA, CAB, ACB, and so on, which represent distinct social contexts involving the same three people. We can easily simulate undirected groups using simplicial sets (by sorting all participants before describing a group), but if directionality is desired to represent social hierarchy or multicast communication then the distinction is already built into simplicial group definitions.

Unfortunately, moving from theory to practice is more challenging. Simplicial sets are based on category theory and algebraic geometry, and the math involved reflects that. While there are well-developed software tools for working with undirected and directed graphs, there are few for hypergraphs, and almost none for simplicial sets, limiting present adoption outside of theoretical mathematical spaces.

Conclusion and Real-World Tools

This post provides an extremely high-level overview of network science within the context of building relationship maps from social media. It’s a flexible discipline, and network analysts spend as much time (if not more) determining which measurements are appropriate and what metrics mean in terms of real-world behavior as they do working through math and code. Because nodes and edges can represent such diverse phenomenon (and this post only scratches the surface without mentioning multi-layer and bipartite networks) most network analysis tools require significant configuration and code from analysts to produce meaningful results.

With that said, some of the versatile libraries used for network analysis in Python include NetworkX, iGraph, and graph-tool. While each library has a limited ability to render networks for visual inspection, most analysts turn to Gephi or (my personal favorite) Cytoscape to explore their networks and display them for publication.

For more on hypergraphs and simplicial sets, I found this paper to be approachable despite lacking any category theory background.

Posted 9/24/20


What is a Supercomputer?

This will be another introductory academic post like the last post explaining how torrents work.

We’ve all seen references to “supercomputers” in popular culture, run by institutions like NASA, the Chinese government, Bond villains, and other nefarious groups. But what is a supercomputer, and what distinguishes one from a “normal” computer? Surprisingly, this isn’t even discussed in the curriculums of many computer science programs unless you happen to take electives in parallel computing.

Wittgenstein, the greatest supercomputer ever

The Basics

Supercomputers, better called cluster computers and often referred to as high performance computing (HPC), consist of racks of conventional computers, tied together with special interlinks to share information as quickly as possible, and loaded with software to run pieces of a program across each of the computers in the racks. Whereas most desktop and laptop computers have a single processor, allowing them to do only one thing at once (or, with a 4-core or 8-core processor, to almost do 4 things or 8 things at once), a supercomputer consists of dozens to tens of thousands of CPUs, and up to millions of cores, allowing it to run many tasks concurrently. Notably, the processors inside aren’t any different than the ones in a desktop, and certainly aren’t any faster: Many of the computers on the Top500 High Performance Computers list run Intel Xeons, and some clusters are clocked as low as 1.45 Gigahertz. If you could somehow run the latest Halo game on a supercomputer there’d be no meaningful speed-up over your home computer. Code must be written specifically to take advantage of the enormous parallelism available on a cluster computer to achieve any performance gain.

What workloads benefit from this kind of parallelism? Mostly large simulation work: weather prediction, epidemic spread, economic impact estimation, industrial engineering to design boxes that can be moved quickly on an assembly line without tipping over, etc. These are usually simulations with a large number of variables, where it is desirable to run a hundred thousand slightly different configurations of the model and determine optimal, average, or worst-case outcome. All problems that require an enormous number of calculations that mostly do not depend on one another and so do not have to be run sequentially.

The Hardware

We made an allusion to hardware interlinks in clusters being a “magic sauce” that makes everything so quick. Before discussing the software written for these magic interlinks, we should dig deeper into how they work.

Most cluster systems include some kind of peer-to-peer network system with very custom attributes: Usually it can directly write to memory in userspace, the network itself can handle operations like receiving multiple messages and adding them together before delivery, and it all runs very quickly with as much networking logic implemented in hardware as possible. For those familiar with Internet networking, these networks are usually similar to UDP in that there’s no need for fault tolerance, guaranteed delivery, or checksumming if the cables are high enough quality to ensure zero data loss, and routing is much simpler since the entire network topology is fixed and predefined.

So that’s the hardware link, but equally important is the network topology, or which computers are linked to which others. This networking hardware is extraordinarily expensive, so linking every node to every other is infeasible, and for most programs wouldn’t give much of a performance boost anyway. Supercomputer designers must make tradeoffs to allow information to be distributed through the cluster efficiently using as few links as possible.

Some supercomputers use a simple Fat Tree topology where high level routers forward messages to “pods” of compute nodes:

This is appropriate for simple workloads where each node in the cluster needs to receive information at the start and then works independently until results are combined at the end. However, for any workload where nodes regularly need to share data with one another this puts a great deal of strain on the switches, and introduces latency in larger trees.

Some cluster systems, like the now-retired IBM Blue Gene series use a Torus topology that organizes nodes into a rectangular prism with links along every axis and wrapping around each row and column. The Blue Gene systems use 3-dimensional and 5-dimensional torus networks, but we’ve limited ourselves to two dimensions to simplify the diagram:

Other supercomputers use radically different topologies, like the Cray butterfly network, which lacks the wrap-around flexibility of a Torus but can quickly distribute and re-combine top-level results using few links:

Each of these network structures changes the number of hops required to send information from one node to another, and whether there are local “groupings” of compute nodes that can communicate quickly without sending messages to distant nodes.

The Software

Now we have a cluster of computers, wired in an elaborate communications network using custom very high-performance interlinks. Cool, but how do we write code that actually uses that architecture? Most supercomputers use some variant of the Message Passing Interface, like OpenMPI, to describe parallel operations.

From the programmers perspective, an identical copy of their program runs on every compute node in the cluster, except that each copy is aware of both how many nodes exist, and the number of their own node in the cluster. For anyone used to systems programming, think “the program has been forked once for each node before the first line of main”.

The program then loads data into each node, either by loading all the data into one node and distributing it, or by using a networked file system so that each node can directly read the starting data relevant to its work.

The message passing interface defines a number of basic operations that form the basis of parallel programming:

  • Scatter: Take an array and send a subset of the array to each node in a list

  • Gather: Take a small array from each node in a list and combine into a single large array on the gathering node

  • Send / Recv: Sent a single message directly to another node, or block on receiving a message from another node

  • Barrier: Similar to a multi-process breakpoint, all processes must reach this line in the code before they can proceed, synchronizing the nodes for scatter and gather operations

Since each node is a separate process with independent memory, there are few shared resources between nodes and usually no complexities around threading and mutexes and variable race conditions unless a process uses multithreading internally. Data sharing between nodes is entirely via send and receive calls or synchronized scatters and gathers, making it (relatively) easy to track data dependencies and avoid collisions.

Message passing performance is closely tied with the network structure of the cluster computer. Therefore, for more complex simulations with frequent message passing the programmer must be familiar with the configuration of their particular cluster system, so they can break up work in a way that places tasks with data dependencies on “close” nodes within the cluster. This also means that programs written for one cluster computer must be re-tuned before they can be effectively deployed on another cluster, or risk massive slow-downs from inefficient message passing and network clogging.

The Interface

We’ve described how a supercomputer is built, and how code is written for it. The last piece is how to interact with it. You can’t exactly ssh into a cluster system, because it isn’t a singular computer: Each compute node is running its own operating system (usually a thoroughly tuned Linux distribution), and the only applications that cross between nodes are ones written specifically for use with the messaging interconnect system.

Instead, one or more nodes in the cluster are designated as “I/O nodes” that can be sshed into. The user can upload or compile their software on these landing pads, and from these systems can submit their executable as a job. Then, much like a mainframe system in the 1970s, a batch scheduling system will decide which jobs will run on which nodes in what order to maximize use of the cluster and potentially ensure fair sharing of resources between users.

What about Graphics Cards?

While general-purpose Central Processing Units (CPUs) usually have only four to sixteen cores, more special-purpose Graphics Processing Units (GPUs) in graphics cards typically have hundreds to tens of thousands of cores in a single computer! Why don’t we use these for massive parallelism? The answer is “we do when we can” and “it’s very hard”.

The reason graphics cards can have so many more cores than a CPU is that graphics processors are simpler and can do far less, which means the cores are physically smaller and require less power, so many more can fit on a chip. Many GPU operations involve working on vectors: for example, you can multiply a vector of a thousand elements by a scalar in one step by using a thousand cores to manipulate the vector in parallel, but you cannot direct those thousand cores to run independent operations in that single step. If and when programs can be expressed in terms of the limited operations possible on a graphics card then we can take advantage of the massive parallelism available there.

Most recently-built cluster systems include graphics cards in each node, so that complex work can be distributed across compute nodes, with the abstract tasks handled by the CPUs, and the rote mathematics handled by each graphics card using APIs like CUDA and OpenCL when possible.

Posted 8/19/20


View older posts