Network Science for Social Modeling

This post is meant to be very approachable and an academic background is not expected. However, it has an academic flavor and deals with applied theory, in the same vein as previous posts on lambda calculus and parallel computing.

This is a post about graphing social relationships. These techniques are used for advertisement, propaganda, predicting disease spread, predicting future relationships (as in LinkedIn “you might know” suggestions), predicting ideology or opinion, and a variety of other tasks. Network science is widely used in academia, in “big data” corporations, and by governments. This post will serve as a brief crash course into network science through the lens of social media scraping, with an emphasis on different models for representing relationships, and their use-cases and shortcomings.

Network Science and Social Scraping

Say we want to identify the most influential members of a community. “Influential” is a ambiguous term, and we could be referring to “the most well-connected individual”, or “the people that bring in ideas from outside the community”, or some notion of a “trend-setter”. To explore any of those definitions, our first task is to identify the members of the community and how they are inter-related.

To start we’ll draw a node (or a vertex if you come from a math/graph-theory background instead of computer/network-science) for Bob. We’ll identify all of Bob’s peers: their friends on Facebook, mutuals on Twitter, contacts on LinkedIn, or whatever parallel makes sense for the platform we’re studying. We’ll create a node for each peer, and we’ll draw an edge between Bob’s node and their friends:

We have our first network (or graph in graph-theory terminology). We can use this network to identify important nodes for the community, which usually involves some of the following characteristics:

  • The nodes with the most connections (the highest degree)

  • The bridge nodes that connect two communities together (Bob connects Alice to Dave and Carol)

  • The central nodes (Betweenness Centrality is a score based on how many fastest routes between any two nodes pass through this node)

We can create a larger network with more meaningful data by looking at Alice, Carol, and Dave’s peers, and building out further and further. The only limits are time and available information from the platform we’re studying.

Directional Relationships

Not all relationships can be described bidirectionally, like a friendship. For example, Twitter allows one user to follow another without reciprocity. Retweets, likes, and replies are all a form of connection that may or may not be bidirectional. To capture this distinction, we need to add direction to the edges of our graph to create a directed graph or digraph:

This changes the attributes we can measure, but only a little. Instead of degree to indicate how many connections a node has, we now have indegree and outdegree to indicate how many edges lead into and out of the node. This is usually even better, since we can now distinguish users that many people listen to from users that follow many people. Our measurements of bridges and centrality can also utilize direction, tracing only the paths a message can flow from one user to the next through a community.

There may be several attributes that can be thought of as a “relationship”. Returning to the Twitter example again, we have the following relationships:

  1. Follows

  2. Mentions

  3. Retweets

  4. Replies

All of these could be represented as edges on a social graph, but each relationship type has a different implication. Retweets (not including quote-retweets) most strongly indicate “support” and a positive relationship, since one user is rebroadcasting the message of another without commentary. Mentions and replies, on the other hand, could be positive or could indicate arguments and distrust. Follows are also ambiguous, since many users will follow individuals for news purposes, like their politicians, regardless of whether they support those individuals. Volume may also vary significantly between relationship types: Users may “like” messages widely, but retweet or reply to more select content.

Therefore, while we could draw one graph with an edge indicating any of the above relationships, we probably want to separate them. This could mean creating a separate graph for each category of relationship, or it could mean adding edge attributes that indicate which type of relationship each edge refers to. We can also use edge attributes to encode data like the number of retweets, or the age of a response. Comparing the attributes can lead to interesting discoveries, such as identifying a universally despised user that’s frequently mentioned by members of a community but never uncritically retweeted, or a user that used to be regularly retweeted, but has fallen from grace and is no longer amplified.

Community-Level Analysis

In addition to metrics for individual nodes, we can take measurements of an entire community. For example the degree distribution illustrates whether a community has about equal engagement, or whether a minority of users massively stand out as being followed more, mentioned more, retweeted more, depending on what the edges represent. We can also define group-level measurements like insularity, indicating what percentage of retweets by users inside of a group are retweeting other members of the group versus retweeting people outside of the group.

Most of these measurements only make sense if we take a much larger network sample, growing from our example diagrams of four users above to tens or hundreds of thousands. The following is one such network graph from Twitter data, created with SocMap:

Screenshot of a Twitter network graph produced with SocMap

Of course, community-level analysis requires a clear definition of who is “in” a community and who is not. Sometimes there’s a convenient external data point: If our community is “MIT students and graduates on LinkedIn” then we can define our in-group based on users with MIT in the education section of their profiles with a low degree of error. If our community is “right-wing users” on a platform like Twitter or Facebook then maybe we can create a fuzzy metric that scores users that repeatedly link to right-wing websites or frequently post specific right-wing-affiliated phrases. Highly scored users are likely to be part of the in-group.

Given solely network data there are algorithms for trying to “autodetect” communities based on the assumption that people in a community tend to be linked to other members of the community, but these algorithms are never as reliable as using external data, and frequently depend on analyst-supplied information like the number of communities to split users into.

Missing Data

Networks are constrained by what information is available, and it’s important not to overstate their accuracy. For example, not every friend will be friends on Facebook or connections on LinkedIn, or several users may know one another through a mutual friend that isn’t on the platform. There will almost always be nodes and edges “missing” from a social network graph. Sometimes this missing data is of primary interest! For example, “You may know” suggested connections on LinkedIn are based on a simple algorithm:

  1. Identify 2nd and 3rd degree connections (that is, connections of your connections who you are not connected to)

  2. Sort the potential connections by a combination of:

  • Shared peers (more connections in common)

  • Shared place of employment

  • Shared education

  • Shared skills and interests

Determining which attributes are most accurate predictors of a connection to optimize the above algorithm is a more difficult problem, and one LinkedIn has no doubt spent a great deal of time studying.

While networks are valuable tools for predicting patterns of behavior, it’s critical to remember that these network graphs represent only a slice of real-world connections. A snapshot of Twitter misses that many users may connect over Instagram, Facebook, or SMS, and messages spread across these “invisible” edges frequently.

Group Context and Hypergraphs

The biggest limitation we’ve seen with graphs so far is that it assumes all relationships involve only two parties. This is frequently appropriate, and accurately describes most phone calls, emails, and text messages. Unfortunately, it’s just as frequently inappropriate: A group chat between three people is not the same as as three two-party conversations between each participant. There may be topics you would discuss in private that you wouldn’t discuss in shared company, or conversely information you would dismiss as a rumor if it were shared with you privately, but seems more believable if it’s shared with your full peer-group. The context of a conversation is critical for understanding how information will be shared or accepted. Further, we can’t even assume that a group context implies members can speak individually: members of a group project may only speak together and never independently.

The simplest way to model these group contexts is to extend our definition of a graph. What if an edge can connect three or more nodes together? We’ll call this a hyperedge to distinguish from traditional edges, and we’ll call graphs containing hyperedges hypergraphs. For now, we can represent a hyperedge as a dotted line encompassing all nodes within it:

Obviously this will be messy to draw with many intersecting hyperedges, but we can perform a lot of mathematical and computer-sciency analysis without visualizing the network we’re working with, so that’s far from a show-stopper.

Note that our example includes only undirected hyperedges. We may also desire a concept of directed hyperedges to represent multicast messages. For example, an emergency hurricane alert broadcast to every cellphone in a city represents a shared message context, but only the emergency service can send messages in the group. Alternatively, consider a Telegram group chat configured as an announcement service, so a dozen or so administrators can write to the group, but thousands can listen. For some types of analysis it may be appropriate to represent these “broadcasts” as many directed edges from the broadcaster to every listener, but if the group context is important to preserve then we need a directed hyperedge to model the conversation space.

Complex Group Relationships and Simplicial Sets

Even directed hypergraphs have limitations, but to demonstrate them we’ll need to back up and explain an alternative solution to modeling group context.

The simplicial set represents group relationships with geometry. For example, if a triangle of edges represents three users with independent relationships with one another, then a filled triangle represents three users with a shared relationship with each-other:

If we want to represent a shared relationship between four individuals, we can switch to a tetrahedron (three sided pyramid, or a 3-dimensional version of a triangle). For five individuals, we create a 5-cell, the 4-dimensional equivalent of a triangle, and so on. Higher-dimensionality shapes rapidly become difficult to visualize, but it’s conceptually sound.

Multiple shapes in this geometric space can interact with one another. For example, consider two adjoining triangles:

We can describe DE in two ways. DE can be the shared edge between CDE and DEF, indicating a shared context in that DE is a sub-group that bridges the two larger triangles. However, we can also add an edge between D and E, indicating that they have a relationship outside of this shared bridge space.

Similarly, we can describe a tetrahedron either as the three-dimensional space encompassing four nodes, or as a union of three triangles, or as a combination of triangles and three-space. The difference in phrasing can represent a group of four people or a collaboration between multiple sub-groups.

Sub-grouping and intersection is extremely difficult to describe in a hypergraph. We can create a concept of a hyper-hyperedge which links two hyperedges together to simulate a metagroup, but this is at best an awkward substitute. A hyper-hyperedge still leaves great ambiguity distinguishing separate teams that communicate verses intersections between teams, and making a group that consists of some individuals and some other groups becomes messy very quickly. If we stick to hypergraphs we must content ourselves with representing many group dynamics as footnotes outside of the graph itself, which makes analysis extremely difficult.

Finally, simplicial sets are always directional. We can have multiple congruent but distinct triangles, ABC, BCA, CAB, ACB, and so on, which represent distinct social contexts involving the same three people. We can easily simulate undirected groups using simplicial sets (by sorting all participants before describing a group), but if directionality is desired to represent social hierarchy or multicast communication then the distinction is already built into simplicial group definitions.

Unfortunately, moving from theory to practice is more challenging. Simplicial sets are based on category theory and algebraic geometry, and the math involved reflects that. While there are well-developed software tools for working with undirected and directed graphs, there are few for hypergraphs, and almost none for simplicial sets, limiting present adoption outside of theoretical mathematical spaces.

Conclusion and Real-World Tools

This post provides an extremely high-level overview of network science within the context of building relationship maps from social media. It’s a flexible discipline, and network analysts spend as much time (if not more) determining which measurements are appropriate and what metrics mean in terms of real-world behavior as they do working through math and code. Because nodes and edges can represent such diverse phenomenon (and this post only scratches the surface without mentioning multi-layer and bipartite networks) most network analysis tools require significant configuration and code from analysts to produce meaningful results.

With that said, some of the versatile libraries used for network analysis in Python include NetworkX, iGraph, and graph-tool. While each library has a limited ability to render networks for visual inspection, most analysts turn to Gephi or (my personal favorite) Cytoscape to explore their networks and display them for publication.

For more on hypergraphs and simplicial sets, I found this paper to be approachable despite lacking any category theory background.

Posted 9/24/20


What is a Supercomputer?

This will be another introductory academic post like the last post explaining how torrents work.

We’ve all seen references to “supercomputers” in popular culture, run by institutions like NASA, the Chinese government, Bond villains, and other nefarious groups. But what is a supercomputer, and what distinguishes one from a “normal” computer? Surprisingly, this isn’t even discussed in the curriculums of many computer science programs unless you happen to take electives in parallel computing.

Wittgenstein, the greatest supercomputer ever

The Basics

Supercomputers, better called cluster computers and often referred to as high performance computing (HPC), consist of racks of conventional computers, tied together with special interlinks to share information as quickly as possible, and loaded with software to run pieces of a program across each of the computers in the racks. Whereas most desktop and laptop computers have a single processor, allowing them to do only one thing at once (or, with a 4-core or 8-core processor, to almost do 4 things or 8 things at once), a supercomputer consists of dozens to tens of thousands of CPUs, and up to millions of cores, allowing it to run many tasks concurrently. Notably, the processors inside aren’t any different than the ones in a desktop, and certainly aren’t any faster: Many of the computers on the Top500 High Performance Computers list run Intel Xeons, and some clusters are clocked as low as 1.45 Gigahertz. If you could somehow run the latest Halo game on a supercomputer there’d be no meaningful speed-up over your home computer. Code must be written specifically to take advantage of the enormous parallelism available on a cluster computer to achieve any performance gain.

What workloads benefit from this kind of parallelism? Mostly large simulation work: weather prediction, epidemic spread, economic impact estimation, industrial engineering to design boxes that can be moved quickly on an assembly line without tipping over, etc. These are usually simulations with a large number of variables, where it is desirable to run a hundred thousand slightly different configurations of the model and determine optimal, average, or worst-case outcome. All problems that require an enormous number of calculations that mostly do not depend on one another and so do not have to be run sequentially.

The Hardware

We made an allusion to hardware interlinks in clusters being a “magic sauce” that makes everything so quick. Before discussing the software written for these magic interlinks, we should dig deeper into how they work.

Most cluster systems include some kind of peer-to-peer network system with very custom attributes: Usually it can directly write to memory in userspace, the network itself can handle operations like receiving multiple messages and adding them together before delivery, and it all runs very quickly with as much networking logic implemented in hardware as possible. For those familiar with Internet networking, these networks are usually similar to UDP in that there’s no need for fault tolerance, guaranteed delivery, or checksumming if the cables are high enough quality to ensure zero data loss, and routing is much simpler since the entire network topology is fixed and predefined.

So that’s the hardware link, but equally important is the network topology, or which computers are linked to which others. This networking hardware is extraordinarily expensive, so linking every node to every other is infeasible, and for most programs wouldn’t give much of a performance boost anyway. Supercomputer designers must make tradeoffs to allow information to be distributed through the cluster efficiently using as few links as possible.

Some supercomputers use a simple Fat Tree topology where high level routers forward messages to “pods” of compute nodes:

This is appropriate for simple workloads where each node in the cluster needs to receive information at the start and then works independently until results are combined at the end. However, for any workload where nodes regularly need to share data with one another this puts a great deal of strain on the switches, and introduces latency in larger trees.

Some cluster systems, like the now-retired IBM Blue Gene series use a Torus topology that organizes nodes into a rectangular prism with links along every axis and wrapping around each row and column. The Blue Gene systems use 3-dimensional and 5-dimensional torus networks, but we’ve limited ourselves to two dimensions to simplify the diagram:

Other supercomputers use radically different topologies, like the Cray butterfly network, which lacks the wrap-around flexibility of a Torus but can quickly distribute and re-combine top-level results using few links:

Each of these network structures changes the number of hops required to send information from one node to another, and whether there are local “groupings” of compute nodes that can communicate quickly without sending messages to distant nodes.

The Software

Now we have a cluster of computers, wired in an elaborate communications network using custom very high-performance interlinks. Cool, but how do we write code that actually uses that architecture? Most supercomputers use some variant of the Message Passing Interface, like OpenMPI, to describe parallel operations.

From the programmers perspective, an identical copy of their program runs on every compute node in the cluster, except that each copy is aware of both how many nodes exist, and the number of their own node in the cluster. For anyone used to systems programming, think “the program has been forked once for each node before the first line of main”.

The program then loads data into each node, either by loading all the data into one node and distributing it, or by using a networked file system so that each node can directly read the starting data relevant to its work.

The message passing interface defines a number of basic operations that form the basis of parallel programming:

  • Scatter: Take an array and send a subset of the array to each node in a list

  • Gather: Take a small array from each node in a list and combine into a single large array on the gathering node

  • Send / Recv: Sent a single message directly to another node, or block on receiving a message from another node

  • Barrier: Similar to a multi-process breakpoint, all processes must reach this line in the code before they can proceed, synchronizing the nodes for scatter and gather operations

Since each node is a separate process with independent memory, there are few shared resources between nodes and usually no complexities around threading and mutexes and variable race conditions unless a process uses multithreading internally. Data sharing between nodes is entirely via send and receive calls or synchronized scatters and gathers, making it (relatively) easy to track data dependencies and avoid collisions.

Message passing performance is closely tied with the network structure of the cluster computer. Therefore, for more complex simulations with frequent message passing the programmer must be familiar with the configuration of their particular cluster system, so they can break up work in a way that places tasks with data dependencies on “close” nodes within the cluster. This also means that programs written for one cluster computer must be re-tuned before they can be effectively deployed on another cluster, or risk massive slow-downs from inefficient message passing and network clogging.

The Interface

We’ve described how a supercomputer is built, and how code is written for it. The last piece is how to interact with it. You can’t exactly ssh into a cluster system, because it isn’t a singular computer: Each compute node is running its own operating system (usually a thoroughly tuned Linux distribution), and the only applications that cross between nodes are ones written specifically for use with the messaging interconnect system.

Instead, one or more nodes in the cluster are designated as “I/O nodes” that can be sshed into. The user can upload or compile their software on these landing pads, and from these systems can submit their executable as a job. Then, much like a mainframe system in the 1970s, a batch scheduling system will decide which jobs will run on which nodes in what order to maximize use of the cluster and potentially ensure fair sharing of resources between users.

What about Graphics Cards?

While general-purpose Central Processing Units (CPUs) usually have only four to sixteen cores, more special-purpose Graphics Processing Units (GPUs) in graphics cards typically have hundreds to tens of thousands of cores in a single computer! Why don’t we use these for massive parallelism? The answer is “we do when we can” and “it’s very hard”.

The reason graphics cards can have so many more cores than a CPU is that graphics processors are simpler and can do far less, which means the cores are physically smaller and require less power, so many more can fit on a chip. Many GPU operations involve working on vectors: for example, you can multiply a vector of a thousand elements by a scalar in one step by using a thousand cores to manipulate the vector in parallel, but you cannot direct those thousand cores to run independent operations in that single step. If and when programs can be expressed in terms of the limited operations possible on a graphics card then we can take advantage of the massive parallelism available there.

Most recently-built cluster systems include graphics cards in each node, so that complex work can be distributed across compute nodes, with the abstract tasks handled by the CPUs, and the rote mathematics handled by each graphics card using APIs like CUDA and OpenCL when possible.

Posted 8/19/20


Torrents: Decentralized Data Storage

This post is explanatory in the vein of past posts on memory allocation and lambda calculus, rather than introducing new social or technical theory

Torrents allow your computer to download files from many computers simultaneously rather than from a single web or file server. BitTorrent is frequently associated with piracy, but is on its own a benign technology, used for distributing Linux installation files and World of Warcraft updates. But how do torrents work, and how can that architecture be re-purposed for other applications?

The Motivation: Consumer ISP Bottlenecks

To understand the design of BitTorrent we’ll first look at the exact problem it was built to solve. Home ISPs sell consumers large download speeds and comparatively minimal upload speeds. This is because most users do very little uploading: Maybe they send an email or upload a photo to Facebook, but most of their “upload” bandwidth is just used to send HTTP requests to download more things. Therefore, ISPs can maximize the use of their infrastructure by designating more equipment for bringing data into a neighborhood and very little of it to bringing data out. This allocation also means consumers can’t effectively run a website or other service from their home without paying through the nose for a “commercial Internet plan”. How much of this bottleneck is a true technical limitation and how much is purely a money-grab is hard to answer since ISPs are usually not forthcoming about the resources they have available. Regardless of the reason, this is the common reality in U.S. home networks.

So, when some home users do want to send files to one another, they face a dilemma: They can download files quickly, but sending files to their friends takes ages. For example, Alice may have a download speed of 30mbps but an upload speed of 1mbps. If only they could work together with 29 friends, then they could share files at a full 30mbps…

What would home users want to distribute this way? Anything bigger than an image you could text to a friend. Music, movies, operating systems, leaked documents, libraries of scientific papers. Torrents present a community-driven approach to disseminating information, opposed to the top-down centralized paradigm of “files are distributed by companies like Netflix and Google that have the wealth and means to send data.”

Technical Details

Alright, so how does that work in practice? Well to start with we need to break the target file into small parts that can be downloaded independently. Each part will require an identifier so a downloader can ask for a specific piece, and will also need a hash so the downloader can know they’ve successfully downloaded the piece and it hasn’t been corrupted or tampered with.

Next we’ll need a way of distributing this information to potential downloaders, and most critically, we’ll need a way to put the downloader in touch with all the users that may have parts of the file to upload.

Torrent Files

Torrent files solve the first problem. A torrent file contains the metadata for a download. Specifically, they include:

  • The torrent name

  • A hash of the completed file

  • The number and size of each part

  • A list of files

  • A list of parts and part hashes for each file

  • Some optional metadata (creation date and torrent software, comments from the author, etc)

  • A list of trackers

Most of these fields are self-explanatory, and for now let’s just say “trackers” solve the second problem of putting the downloader in touch with uploaders. Now to distribute data we only need to send this .torrent file to the downloader, and they can use it to bootstrap the download and gather data from everyone. Torrent files are tiny, at most a few megabytes to represent hundreds of gigabytes or even terabytes of data, so sending a torrent file via direct download is not a demanding requirement.

Trackers

Clearly the magic sauce here is the “trackers” field of the torrent file. A tracker acts as a rendezvous point between uploaders (“seeders” in torrent terminology) and downloaders (“leechers” in torrent-speak, at least until they begin helping upload data and become seeders themselves). The process is surprisingly simple:

  1. A user with torrent data to share or download connects to the tracker and announces the file hash it is interested in

  2. The user then submits its IP address and some port numbers it is reachable at

  3. The tracker responds with a list of IP addresses and ports to reach every other user that’s recently indicated interest in the same file

Users periodically repeat this process, both to confirm with the tracker that they remain active and interested in the data, and to potentially find other users that have registered with the tracker since the last time they checked.

That’s it. The tracker doesn’t act as any kind of proxy, it doesn’t have any information about the torrent file or what pieces of the file each user possesses, and it can’t distinguish between seeders and leeches. Just hashes and contact information.

So, doesn’t this violate the entire “decentralized” nature of torrents? There’s a single central server maintaining all the contact information that this entire system relies upon! Well, yes, but actually no. Most torrent files include a list of several trackers, instructing seeders “Please share your contact information with these five servers”, providing a good deal of redundancy. If for some reason all five servers (or however many were added to the torrent) go offline or ban the file hash and refuse to offer rendezvous services, then the data itself is still safe on the computers of all the seeders. Anyone with technical know-how can add new trackers to the torrent file, hopefully allowing the seeders to reconnect.

Nevertheless, trackers remain a point of failure in BitTorrent networks, and modern versions of the BitTorrent protocol further minimize the role of trackers using a distributed hash table, discussed below.

Indexes

What we’ve described so far is enough for the BitTorrent network to function if I email you a .torrent file to start you off. The only piece of the puzzle we’re missing is content discovery: If we want to spread torrent data further than a group of friends and make it a true community then we need a website that hosts torrent files and makes them searchable so that users can find what they’re interested in. These websites are called “indexes” and include notorious websites like The Pirate Bay. Note that indexes have an even smaller role in the network than trackers: Indexes are how new users discover new torrents, but they don’t distribute any data, and don’t even distribute contact information to reach seeders that do distribute data. If an index like The Pirate Bay goes offline all existing torrents will be completely unaffected, and the torrent files will usually be quickly reposted to an alternative index website.

An Example

Pulling it all together, to download data using a torrent, a user must:

  1. Find the torrent file on an index site and download it to open in their favorite torrent software

  2. Use their torrent client to connect to a set of trackers and locate other peers for the torrent data

  3. Connect directly to those peers and ask for any pieces the user doesn’t yet have

Finally, the user downloads the file pieces from each seeder, distributing the load to maximize download speed:

Firewalls

In order for all of this to work, torrent peers must be able to directly connect to one another to request pieces of torrent data. Don’t firewalls (including the Network Address Translation used in almost every home network) prevent such direct connections? Yes! Fortunately most torrent clients support “Universal Plug and Play”, a protocol that allows software within the local network to speak with the router and temporarily add port forwarding rules to open connections to itself. The torrent client will open a handful of ports for data transfer (usually TCP ports 6881-6889) and then announce these ports to the tracker (often over UDP ports 6969 or 1337).

If the user is behind carrier-grade NAT, or is otherwise unable to use UPnP to automatically open ports, then the user will either need to manually open and forward the ports (if their cgNAT ISP allows them to), or will be unable to upload data using BitTorrent.

Rewarding Helpfulness

The entire BitTorrent network relies on users contributing their own bandwidth to support the broader community, but what’s to prevent users from skimming off the top and downloading data without sharing anything back? In fact, nothing prevents this, and unsupportive leechers are common, but BitTorrent does have a means of mitigating the harm leeches can cause.

Every BitTorrent node allocates its upload bandwidth proportional to the upload speed it receives from other peers. In other words, if Alice uploads lots of data quickly to Bob and the data passes hash verification, then Bob will consider Alice a “good peer” and will prioritize sending them data in return.

This reward system emphasizes sharing data with peers that will share the data with others. It can alternatively be seen as a punitive system: Peers that do not enthusiastically share their data with others will be given the lowest priority. For a popular torrent with hundreds or thousands of peers, this becomes a kind of “tier” system, where high-speed uploaders are mostly paired with high-speed peers, and low-speed uploaders are mostly paired with low-speed peers.

In a modern torrent system the network is even more thoroughly decentralized, and trackers are relegated to a tiny (but still critical) role of making first introductions. The idea is to distribute the previous role the trackers held across every participating peer: By using a distributed hash table (DHT), peers can each cache part of the “hash -> contact information” dataset the trackers held, and users can now find peers for torrents they’re interested in by asking for introductions from peers they’ve met through previous torrents.

The effect is that once you’re “in” the peer-to-peer BitTorrent network, you never have to leave again, and can perform all BitTorrent operations from within the peer-to-peer space. The only time a tracker is needed is when a new peer who doesn’t know anyone else in the network yet wants to join, or an old peer has been disconnected for long enough that its peering information is out of date and it can no longer reach anyone. In these cases, it remains necessary to contact a tracker and acquire an initial set of peers to enter the peer-to-peer network.

So what do torrents look like in this new DHT network? They no longer require torrent files at all. All the fields stored in the .torrent file can be stored in the distributed hash table instead, so the information a peer needs to start their download is reduced to a single string containing the hash, called a magnet link: magnet:?xt=urn:btih:c8dd895fbc6cd38850205bf09c76a9b716b2cd87

From that string alone, the torrent client can identify the “exact topic” (xt), which is a “uniform resource name” (urn) consisting of “BitTorrent info-hash” (btih), which is just a hex-encoded SHA-1 hash of the torrent file. The torrent client knows to make a query to the distributed hash table for metadata and peers for the torrent with the hash c8dd895fbc6cd38850205bf09c76a9b716b2cd87, and from there begins the download.

We can optionally include additional information in the magnet link to make it more useful, including the filename (with the “dn” field), file length (with the “exact length” or “xl” field), and trackers to fall back to if the client doesn’t have DHT peers already (with the “tr” field). Therefore a more informative magnet link might look like:

magnet:?xt=urn:btih:c8dd895fbc6cd38850205bf09c76a9b716b2cd87&dn=Stuxnet.zip&xl=7911599&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.blackunicorn.xyz%3A6969

If the tracker information is left off then the magnet link is only usable by clients connected to the DHT, and anyone else must first download some other torrents using trackers to become established, and then try again.

These magnet links are tiny because no “piece” information is stored in them, and are therefore convenient for texting or emailing. The smaller size significantly reduces the storage space needed to run a torrent index, and so further supports decentralization and redundancy in the torrent ecosystem.

Architectural Takeaways

BitTorrent provides an excellent example of how to run any peer-to-peer system: using a set of central servers to make introductions, then switching to direct connections to exchange information. Ideally a distributed hash table means the central servers are only needed to bootstrap new users, who can rely solely on direct connections from that point on. While BitTorrent is used for file sharing, there’s no reason the same architecture can’t be used for other distributed systems.

Indeed, Bitcoin uses a similar network for storing their distributed blockchain, except that they have a hard-coded list of starting peers in the Bitcoin software and rely on these peers and the DHT instead of trackers. The Tor Project also uses a similar system, where their ten or so hard-coded directory servers provide contact information for Tor nodes, but once the contact list is downloaded a Tor client acts independently and connects directly to all nodes. The Inter-Planetary File System stores files in a DHT as a kind of hybrid between the way we think of the World Wide Web and torrents, and similar to Bitcoin uses a list of “bootstrap peers” included in the software for identifying other members of the DHT.

Posted 8/13/20


Trustless Pursuance: Decentralized Architecture

In a rapid series of events, Distributed Denial of Secrets was banned from Twitter for our BlueLeaks release, along with any links to our website, the leaked documents in question, or magnet links to the leaked documents in question (including blocking URL-shortened links, forwarding domains, and a ROT13-encoded version of the magnet link). Shortly after, one of the main DDoSecrets servers was seized by the German authorities, rendering the document archive and search engine unreachable. The raw documents were hosted with torrents and have all been mirrored elsewhere, but the loss of Twitter as a soapbox, as well as the index and search engine, are unfortunate setbacks.

Following the trend of other groups deplatformed from mainstream social-media, DDoSecrets has moved to Telegram and is looking at alternative media options for making announcements. On the hosted infrastructure side, we have dedicated servers for specific projects that remain online, but a shoestring budget limits our ability to operate a separate server for each task with many redundancies.

All of this is to emphasize the importance of decentralized systems, avoiding central points of failure and single-party censorship (as opposed to cooperative censorship). When considering the design of the Pursuance Project, we want a system that can outlive Pursuance as an organization, and can survive without the Pursuance Project’s servers if they’re lost due to lack of funding or censorship or equipment failure.

This post proposes a model for Pursuance without reliance on a permanent and trusted server, for use by groups like DDoSecrets that cannot afford such centralized dependencies. We begin by looking at peer-to-peer chat software, and building from there.

Ricochet

Ricochet is an anonymous encrypted “instant” messaging program. It works by hosting a Tor Onion Service on each user’s computer. To write to another user you need to know their onion service address, at which point both computers can connect to each-other’s onion service and send messages. This eliminates the need for any centralized chat server, requires no network configuration (such as domain registration and port forwarding), and hides every user’s IP address from one another.

Unfortunately, Ricochet has a range of shortcomings. Messages cannot be delivered while a user is offline, allowing only synchronous communication. There is no possibility for group-chats, and the trivial solution of “send a message to each user in the group” would require everyone in the group-chat to be online at once and is completely unscalable.

Cwtch

The good people at Open Privacy have been working on a more sophisticated chat system built on top of Ricochet, known as Cwtch (pronounced “kutch”, translating roughly to “a hug that creates a safe space”). The basic premise is to use Ricochet for peer-to-peer messaging, but to add an untrusted server that can cache messages during offline periods and facilitates group-chat. The server cannot read any messages, and will relay messages to any user that connects and asks. Each message is signed, preventing tampering, and includes a hash of the previous message from the user (blockchain-style), making omission easily detectable. Therefore the server cannot act maliciously without detection, and because it cannot identify users or distinguish their messages, it cannot target specific users and can only act maliciously at random.

Users create a group chat by direct-messaging another user (via Ricochet), establishing a shared symmetric key, and then relaying messages through the untrusted server encrypted with this key. Users are added to the group chat by providing them with the onion address of the relay and the symmetric key. Forward secrecy is achieved by periodically rotating the key and providing the update to all members of the chat via direct-message on Ricochet (and posting all messages using the old and new keys until rotation is complete). Backward secrecy is achieved by not providing these updated keys to compromised users. Users can be removed from a group by updating the shared key and providing it to all but the removed user. More technical details in the Cwtch whitepaper.

It’s important to note how minimal the role of the relay server is in this design. The majority of the work is performed by the client, which discards messages from the relay that aren’t encrypted with a group key the user is in, discards duplicate messages, and parses the contents of messages. The client can be connected to multiple groups on multiple servers concurrently, and a group can be moved from one server to another seamlessly, or even use multiple servers at once if messages are relayed through each server. Since the server is just a relay, and connects via Tor, it requires virtually no configuration and limited resources, and can be trivially deployed.

Cwtch has a few drawbacks (aside from being alpha software not ready for public use), but they are relatively small:

  1. The user that created the group has the responsibility of periodically rotating keys and distributing those keys to users via Ricochet. This can be automated, but it requires that the founder periodically log in, and if they abandon the group then key rotation will cease and forward and backward secrecy are lost. This places the founding user in an “administrative” role, which is reflected by their ability to add and remove users, unlike in Signal where no user is an administrator and the group can only be added to.

  2. There is no system for initial key exchange. Users sign their messages to mathematically prove a lack of tampering, but this is only effective if users exchange keys ahead of time over a trusted platform. This can be alleviated with something like Keybase proof integration to demonstrate that a Cwtch user has a well-established identity on many platforms.

Cwtch is intended to be used as an open source platform for peer-to-peer messaging with groups, allowing others to build more complex software and interactions on top of that foundation. So, let’s do just that…

Pursuance on top of Cwtch

What would it take to decentralize Pursuance so that it exists entirely client-side, with no storage except as messages sent via Cwtch to eliminate a need for a central Pursuance-specific semi-trusted server?

For basic functionality, not much is required. We can describe a pursuance as a Cwtch group-chat with a ton of meta-messages. The first message in a pursuance declares the rules of the pursuance (describing what roles exist) and how they interact), and metadata like the pursuance name. Subsequent messages take one of the following formats:

  1. Changing Rules (Creating or destroying roles or amending how roles interact)

  2. Changes Roles (Assigning or de-assigning a role from a user)

  3. Tasks (Creating, closing, assignment, and commenting on tasks)

Under this sytem, every user in the pursuance has all data in the pursuance, including data they do not have permission to access. Therefore, all tasks must be encrypted to only be readable by roles with permission to read the data. A role exists not only as a designator, but also as a shared key allowing the user to read messages encrypted for that role. Adding and removing users both fall under the category of “changing roles”.

Clients have access to all pursuance rules, and a list of all users and their roles, and so clients can reject illegal messages, such as a user closing a task they have no access to. Since all content in a pursuance consists of tasks and discussions associated with those tasks, this messaging system describes the entire infrastructure of a pursuance.

Cwtch handles “desynchronizing” in a decentralized manner, allowing one user to contact another and get caught up on messages they’ve missed. This is intended to elegantly handle Cwtch relay servers that only cache recent messages, and to allow a smooth transition if the group has moved between servers (because, perhaps, the original server has gone offline). Pursuance inherits this re-synchronization, and allows users to catch up if the pursuance has moved servers.

So to summarize, the advantages of this change include:

  • Encrypted messaging provided by Ricochet and Cwtch instead of reinventing the wheel in Pursuance

  • No central and trusted server required, easily moved in case of server failure

  • Automatic recovery in case of data outage

  • Only have to maintain one piece of software, the Pursuance client, instead of a client and a server

However, there are a number of hurdles remaining (besides Cwtch itself, which remains in Alpha):

  • Synchronizing keys between different pursuances to cross-assign tasks can be complicated

  • Sending tasks between pursuances is challenging: the task must be sent by a user, and the receiving pursuance must somehow verify that the sending-user is part of an approved role in the sending-pursuance

  • All of this becomes harder if a pursuance can move from one server to another and disrupt lines of communication with its neighbors

  • Pursuance discovery remains challenging, and it is unclear what “joining” a pursuance looks like unless new users invited from inside

  • The “group leader” role from Cwtch must be inherited by Pursuance, probably as the pursuance founder; There must be an elegant process for migrating the pursuance and underlying Cwtch chat if the leader becomes inactive

Cwtch isn’t the only option for decentralized data storage and messaging, and alternatives should be considered. One notable alternative is Tox, which stores messages in a decentralized hash table, much like torrents. Another is the InterPlanetary File System, which acts somewhat like an HTTP and git replacement, storing web data with version history in a distributed filestore. A more complete Pursuance implementation may combine several of these technologies - for example storing encrypted tasks on IPFS so they can be accessed by multiple pursuances, but running metadata for a pursuance and exchanging keys within Cwtch. This post should be considered an early abstract for what distributed Pursuance software might look like.

Posted 7/7/20


Information Paradigms and Radical Interfaces

One of my non-social-org interests is the design of interfaces for viewing and interacting with information that can change our perception and thought-processes surrounding that information. I haven’t written much on this subject yet (except tangentially when discussing Halftone QR Codes to create dual-purpose links/images), so this post will be an overview of information paradigms and exciting ideas in the topic space.

Conventional Model: Folder Hierarchies

When computers were imagined as information storage and retrieval systems, they first replaced their physical predecessors: Filing cabinets filled with nested folders. This hierarchy describes every major “filesystem”, including directory trees on Unix/Linux and Windows:

However, a simple hierarchy fails to describe information that belongs in multiple places at once: What if a file could be categorized under “Research” or under “Gradschool”? We can address this problem with links or aliases (so the file is in the “Research” folder but has a reference in the “Gradschool” folder), but it’s an imperfect solution to an awkward shortcoming.

Predominant Internet Model: Hypertext

While a folder hierarchy allows one file to reference another, it does not allow for more sophisticated referencing, like a file referring to several other files, or to a specific portion of another file.

Both of these problems can be improved with hypertext, which introduces a concept of a “link” into the file contents itself. Links can refer to other documents, or to anchors within documents, jumping to a specific region.

Hypertext is most notoriously used by the World Wide Web to link between web pages, but it can also be used by PDFs, SVG files, and protocols like Gopher. Still, there are more sophisticated relations that are beyond hypertext, like referring to a region of another document rather than a specific point.

Xanadu

Ted Nelson’s Project Xanadu aimed to extend hypertext dramatically. Xanadu departed from the idea of static documents referencing one another, and proposed dynamically creating “documents” out of references to information fragments. This embedding allows bidirectional linking, such that you can both view a piece of information in context, and view what information relies upon the current fragment. The goal was to allow multiple non-sequential views for information, and to incorporate a version control system, so that an information fragment could be updated and thereby update every document that embedded the fragment. Information storage becomes a “living document”, or a thousand living documents sharing the same information in different sequences and formats.

Perhaps unsurprisingly, Xanadu’s scope has hindered development for decades. A subset of Xanadu allowing embed-able fragments can be seen in a demo here. The bidirectional-reference model struggles to mesh with the decentralized reality of the Internet and World Wide Web, and a central database for tracking such links is impossible at any significant scale. HTTP “Referer” headers ask browsers to notify web servers when another document links to one of its pages, but the system was never widely deployed as a way of creating bidirectional links. Tumblr and Twitter “reblogging” and “quote retweeting” comes closer by creating bidirectional references, but falls short of the living-document paradigm. Wikipedia accomplishes bidirectional links within its own platform, and allows anyone to edit most pages as they read them, but still considers each page to be independent and inter-referential rather than dynamically including content from one another to create shifting encyclopedic entries.

Some projects like Roam Research continue developing the Xanadu dream, and support embedding pieces of one document in another, and visualizing a network diagram of references and embedded documents. There’s a good writeup of Xanadu’s history and Roam’s accomplishments here.

Explorable Explanations

Linking information together between different fragments from different authors is an insightful and revolutionary development in information design; but it’s not the only one. Bret Victor’s essay Explorable Explanations provide an alternate view of information conveyance. It critically introduces the concept of a reactive document, which works something like a research paper with a simulation built in. Any parameters the author sets or assumptions they make can be edited by the reader, and the included figures and tables adjust themselves accordingly. This provides an opportunity for the reader to play and explore with the findings, develop their own intuition about how the pieces fit together. There’s an example of this kind of interactive analysis at Ten Brighter Ideas.

Visual Exploration: XCruiser

All the models so far have focused on articles: organizing articles, moving between articles, synthesizing articles out of other articles, and manipulating the contents of articles. Other paradigms move more radically from simulating sheets of paper.

XCruiser flying through a filesystem

One such paradigm is XCruiser, which renders filesystems as three dimensional space. Directories are concentric galaxies, while files are planets. The user literally “flies” through their computer to explore and can see parent, child, and parallel directories in the distance.

It is, perhaps, not an intuitive or particularly useful model. Nevertheless, it demonstrates that our classic “hierarchical” view is far from the only possibility. A similar 3-D filesystem renderer is FSV, the File System Visualizer, featured in Jurassic Park.

Kinetic Exploration: Dynamicland

Finally, interfaces need not be purely visual, reached through a monitor, keyboard, and mouse. Bret Victor’s most recent and daring project, Dynamicland seeks to reimagine the boundaries of a physical computer. Terribly oversimplified, the idea is:

  1. Strap a number of projectors and cameras to the ceiling, aimed at the floors and walls

  2. Give the computer OpenCV to recognize objects and read their orientations and text

  3. Program the computer to display information via any projector on any surface defined by physical objects

  4. The entire room is now your computer interface

Ideally, the entire system can be developed from this point using only physical objects, code written on paper, a pencil on the table turned into a spin-dial referenced by a simulation on the wall. No screens, no individual computing.

The main takeaway is that the computer is now a shared and collaborative space. Two computer-users can stand at a wall or desk together, manipulating the system in unison.

Almost all our computing technology assumes a single user on a laptop, or a smartphone. We put great effort into bridging the gaps between separate users on individual machines, with “collaboration” tools like Google Docs and git, communication tools like E-Mail, IRC, instant messaging, slack, and discord. All of these tools fight the separation and allow us to compute individually and share our efforts between computers, but Dynamicland proposes a future where a shared space is the starting point, and collaboration emerges organically.

Posted 6/16/20


View older posts