Torrents: Decentralized Data Storage

Posted 8/13/20

This post is explanatory in the vein of past posts on memory allocation and lambda calculus, rather than introducing new social or technical theory

Torrents allow your computer to download files from many computers simultaneously rather than from a single web or file server. BitTorrent is frequently associated with piracy, but is on its own a benign technology, used for distributing Linux installation files and World of Warcraft updates. But how do torrents work, and how can that architecture be re-purposed for other applications?

The Motivation: Consumer ISP Bottlenecks

To understand the design of BitTorrent we’ll first look at the exact problem it was built to solve. Home ISPs sell consumers large download speeds and comparatively minimal upload speeds. This is because most users do very little uploading: Maybe they send an email or upload a photo to Facebook, but most of their “upload” bandwidth is just used to send HTTP requests to download more things. Therefore, ISPs can maximize the use of their infrastructure by designating more equipment for bringing data into a neighborhood and very little of it to bringing data out. This allocation also means consumers can’t effectively run a website or other service from their home without paying through the nose for a “commercial Internet plan”. How much of this bottleneck is a true technical limitation and how much is purely a money-grab is hard to answer since ISPs are usually not forthcoming about the resources they have available. Regardless of the reason, this is the common reality in U.S. home networks.

So, when some home users do want to send files to one another, they face a dilemma: They can download files quickly, but sending files to their friends takes ages. For example, Alice may have a download speed of 30mbps but an upload speed of 1mbps. If only they could work together with 29 friends, then they could share files at a full 30mbps…

What would home users want to distribute this way? Anything bigger than an image you could text to a friend. Music, movies, operating systems, leaked documents, libraries of scientific papers. Torrents present a community-driven approach to disseminating information, opposed to the top-down centralized paradigm of “files are distributed by companies like Netflix and Google that have the wealth and means to send data.”

Technical Details

Alright, so how does that work in practice? Well to start with we need to break the target file into small parts that can be downloaded independently. Each part will require an identifier so a downloader can ask for a specific piece, and will also need a hash so the downloader can know they’ve successfully downloaded the piece and it hasn’t been corrupted or tampered with.

Next we’ll need a way of distributing this information to potential downloaders, and most critically, we’ll need a way to put the downloader in touch with all the users that may have parts of the file to upload.

Torrent Files

Torrent files solve the first problem. A torrent file contains the metadata for a download. Specifically, they include:

The torrent name
A hash of the completed file
The number and size of each part
A list of files
A list of parts and part hashes for each file
Some optional metadata (creation date and torrent software, comments from the author, etc)
A list of trackers

Most of these fields are self-explanatory, and for now let’s just say “trackers” solve the second problem of putting the downloader in touch with uploaders. Now to distribute data we only need to send this .torrent file to the downloader, and they can use it to bootstrap the download and gather data from everyone. Torrent files are tiny, at most a few megabytes to represent hundreds of gigabytes or even terabytes of data, so sending a torrent file via direct download is not a demanding requirement.

Trackers

Clearly the magic sauce here is the “trackers” field of the torrent file. A tracker acts as a rendezvous point between uploaders (“seeders” in torrent terminology) and downloaders (“leechers” in torrent-speak, at least until they begin helping upload data and become seeders themselves). The process is surprisingly simple:

A user with torrent data to share or download connects to the tracker and announces the file hash it is interested in
The user then submits its IP address and some port numbers it is reachable at
The tracker responds with a list of IP addresses and ports to reach every other user that’s recently indicated interest in the same file

Users periodically repeat this process, both to confirm with the tracker that they remain active and interested in the data, and to potentially find other users that have registered with the tracker since the last time they checked.

That’s it. The tracker doesn’t act as any kind of proxy, it doesn’t have any information about the torrent file or what pieces of the file each user possesses, and it can’t distinguish between seeders and leeches. Just hashes and contact information.

So, doesn’t this violate the entire “decentralized” nature of torrents? There’s a single central server maintaining all the contact information that this entire system relies upon! Well, yes, but actually no. Most torrent files include a list of several trackers, instructing seeders “Please share your contact information with these five servers”, providing a good deal of redundancy. If for some reason all five servers (or however many were added to the torrent) go offline or ban the file hash and refuse to offer rendezvous services, then the data itself is still safe on the computers of all the seeders. Anyone with technical know-how can add new trackers to the torrent file, hopefully allowing the seeders to reconnect.

Nevertheless, trackers remain a point of failure in BitTorrent networks, and modern versions of the BitTorrent protocol further minimize the role of trackers using a distributed hash table, discussed below.

Indexes

What we’ve described so far is enough for the BitTorrent network to function if I email you a .torrent file to start you off. The only piece of the puzzle we’re missing is content discovery: If we want to spread torrent data further than a group of friends and make it a true community then we need a website that hosts torrent files and makes them searchable so that users can find what they’re interested in. These websites are called “indexes” and include notorious websites like The Pirate Bay. Note that indexes have an even smaller role in the network than trackers: Indexes are how new users discover new torrents, but they don’t distribute any data, and don’t even distribute contact information to reach seeders that do distribute data. If an index like The Pirate Bay goes offline all existing torrents will be completely unaffected, and the torrent files will usually be quickly reposted to an alternative index website.

An Example

Pulling it all together, to download data using a torrent, a user must:

Find the torrent file on an index site and download it to open in their favorite torrent software
Use their torrent client to connect to a set of trackers and locate other peers for the torrent data
Connect directly to those peers and ask for any pieces the user doesn’t yet have

Finally, the user downloads the file pieces from each seeder, distributing the load to maximize download speed:

Firewalls

In order for all of this to work, torrent peers must be able to directly connect to one another to request pieces of torrent data. Don’t firewalls (including the Network Address Translation used in almost every home network) prevent such direct connections? Yes! Fortunately most torrent clients support “Universal Plug and Play”, a protocol that allows software within the local network to speak with the router and temporarily add port forwarding rules to open connections to itself. The torrent client will open a handful of ports for data transfer (usually TCP ports 6881-6889) and then announce these ports to the tracker (often over UDP ports 6969 or 1337).

If the user is behind carrier-grade NAT, or is otherwise unable to use UPnP to automatically open ports, then the user will either need to manually open and forward the ports (if their cgNAT ISP allows them to), or will be unable to upload data using BitTorrent.

Rewarding Helpfulness

The entire BitTorrent network relies on users contributing their own bandwidth to support the broader community, but what’s to prevent users from skimming off the top and downloading data without sharing anything back? In fact, nothing prevents this, and unsupportive leechers are common, but BitTorrent does have a means of mitigating the harm leeches can cause.

Every BitTorrent node allocates its upload bandwidth proportional to the upload speed it receives from other peers. In other words, if Alice uploads lots of data quickly to Bob and the data passes hash verification, then Bob will consider Alice a “good peer” and will prioritize sending them data in return.

This reward system emphasizes sharing data with peers that will share the data with others. It can alternatively be seen as a punitive system: Peers that do not enthusiastically share their data with others will be given the lowest priority. For a popular torrent with hundreds or thousands of peers, this becomes a kind of “tier” system, where high-speed uploaders are mostly paired with high-speed peers, and low-speed uploaders are mostly paired with low-speed peers.

Magnet Links and the Distributed Hash Table

In a modern torrent system the network is even more thoroughly decentralized, and trackers are relegated to a tiny (but still critical) role of making first introductions. The idea is to distribute the previous role the trackers held across every participating peer: By using a distributed hash table (DHT), peers can each cache part of the “hash -> contact information” dataset the trackers held, and users can now find peers for torrents they’re interested in by asking for introductions from peers they’ve met through previous torrents.

The effect is that once you’re “in” the peer-to-peer BitTorrent network, you never have to leave again, and can perform all BitTorrent operations from within the peer-to-peer space. The only time a tracker is needed is when a new peer who doesn’t know anyone else in the network yet wants to join, or an old peer has been disconnected for long enough that its peering information is out of date and it can no longer reach anyone. In these cases, it remains necessary to contact a tracker and acquire an initial set of peers to enter the peer-to-peer network.

So what do torrents look like in this new DHT network? They no longer require torrent files at all. All the fields stored in the .torrent file can be stored in the distributed hash table instead, so the information a peer needs to start their download is reduced to a single string containing the hash, called a magnet link: magnet:?xt=urn:btih:c8dd895fbc6cd38850205bf09c76a9b716b2cd87

From that string alone, the torrent client can identify the “exact topic” (xt), which is a “uniform resource name” (urn) consisting of “BitTorrent info-hash” (btih), which is just a hex-encoded SHA-1 hash of the torrent file. The torrent client knows to make a query to the distributed hash table for metadata and peers for the torrent with the hash c8dd895fbc6cd38850205bf09c76a9b716b2cd87, and from there begins the download.

We can optionally include additional information in the magnet link to make it more useful, including the filename (with the “dn” field), file length (with the “exact length” or “xl” field), and trackers to fall back to if the client doesn’t have DHT peers already (with the “tr” field). Therefore a more informative magnet link might look like:

magnet:?xt=urn:btih:c8dd895fbc6cd38850205bf09c76a9b716b2cd87&dn=Stuxnet.zip&xl=7911599&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.blackunicorn.xyz%3A6969

If the tracker information is left off then the magnet link is only usable by clients connected to the DHT, and anyone else must first download some other torrents using trackers to become established, and then try again.

These magnet links are tiny because no “piece” information is stored in them, and are therefore convenient for texting or emailing. The smaller size significantly reduces the storage space needed to run a torrent index, and so further supports decentralization and redundancy in the torrent ecosystem.

Architectural Takeaways

BitTorrent provides an excellent example of how to run any peer-to-peer system: using a set of central servers to make introductions, then switching to direct connections to exchange information. Ideally a distributed hash table means the central servers are only needed to bootstrap new users, who can rely solely on direct connections from that point on. While BitTorrent is used for file sharing, there’s no reason the same architecture can’t be used for other distributed systems.

Indeed, Bitcoin uses a similar network for storing their distributed blockchain, except that they have a hard-coded list of starting peers in the Bitcoin software and rely on these peers and the DHT instead of trackers. The Tor Project also uses a similar system, where their ten or so hard-coded directory servers provide contact information for Tor nodes, but once the contact list is downloaded a Tor client acts independently and connects directly to all nodes. The Inter-Planetary File System stores files in a DHT as a kind of hybrid between the way we think of the World Wide Web and torrents, and similar to Bitcoin uses a list of “bootstrap peers” included in the software for identifying other members of the DHT.