Torrent Health Monitoring

Posted 8/21/2022

Distributed Denial of Secrets publishes most of our datasets via torrent. This minimizes infrastructural requirements for us: every time someone downloads a release, if they leave their torrent client running, they help us upload to other interested people. Once many people have mirrored our release it can remain available even if we stop seeding, completely self-hosted by the public. This is ideal, because with our budget we’re unable to maintain seed boxes for every release simultaneously; we keep offline backups, but virtual machine storage is more expensive.

This system typically works well, especially for rapidly distributing new releases. However, occasionally an older release will become unavailable, either because interest has waned and seeds have dropped offline, or because the trackers used by the torrent are no longer functional. If someone reports that a torrent is unavailable then we can pull the data from our backups and resume seeding, and issue a new magnet link containing an updated list of trackers. Unfortunately, that’s reactive, slow, and tedious. How can we proactively monitor the availability of all our torrents, to notify us when one requires attention?

Specifically, we want to build a dashboard that displays a list of torrents, and for each indicates how many trackers are online, how many peers those trackers are aware of, and how many peers can be found in the distributed hash table (DHT). It should track this information over the course of a day, a week, and a month, so we can distinguish between short-term and permanent loss of availability.

Every torrent client has the functionality to locate peers through trackers, and most modern clients can also find peers through the DHT. However, most clients do not provide a way to use that functionality without starting a download for the torrent, nor do they provide a way to export that peer information so we can plot availability over time. There are a few libraries for handling torrents, like libtorrent, but these also don’t easily expose peer-discovery independently from downloading. Fortunately, there are libraries for performing bittorrent DHT lookups, so our primary hurdle is implementing the client side of the bittorrent tracker protocol, described in BEP 0003, BEP 0015, and BEP 0023.

How do torrent trackers work?

Torrent trackers are conceptually simple:

A torrent or magnet link contains a list of trackers
Any client interested in downloading the torrent data contacts each tracker
The client announces the hash of the torrent they’re interested in, registering their interest with the tracker
The tracker returns a list of any other IP addresses that have recently registered interest in the same content
The client periodically re-registers its interest with the tracker, to identify any new peers, and ensure it remains discoverable to others

From there the client contacts each discovered peer directly, and negotiates a download. Since we’re only interested in peer discovery, we don’t have to follow along further than this.

Clients can communicate with trackers using two protocols: older trackers communicate using HTTP, but far more common is the newer, simpler, faster UDP-based protocol. In both protocols, clients can make announce requests, which announce their interest in a torrent, and scrape requests, which fetch some aggregated metadata about the number of clients interested in a torrent.

Unfortunately, scrape requests have little utility for our purposes: If one tracker says that it knows 7 peers, and another tracker says it knows 3, how many peers are there? 7? 10? Somewhere in-between? We can’t aggregate information across trackers without fetching the list of peer IP addresses from each tracker, which requires using an announce request.

The tracker HTTP API

The tracker HTTP protocol is deceptively simple. A tracker URL looks something like http://tracker.opentrackr.org:1337/announce. This contains the domain name of the tracker, the port number, and the resource for the request (typically “announce”). To send a request, the client adds several fields:

Field	Description
info_hash	A URL-encoded version of the torrent sha256 hash
peer_id	A random string uniquely identifying the client
port	The port number on which the client can be reached
uploaded	The number of blocks the client has already uploaded
downloaded	The number of blocks the client has downloaded
left	How many blocks the client still needs to download

Therefore a full request to a tracker may look something like:

http://tracker.opentrackr.org:1337/announce?info_hash=%5Bg%03%95%28%0A%3F%3F**%0A%CFs%D4K%2C%CE%0F%E1%AE&peer_id=foo&port=1234&uploaded=0&downloaded=0&left=0

Note that the uploaded, downloaded, and left fields are required, but are only hints. If the client is downloading a magnet link, it may not know how large the torrent data is, and therefore how much is left to download. This self-reported metadata isn’t verified in any way, the tracker just uses it to report some analytics.

Once the client makes an announce request to a tracker, the tracker responds with either an HTTP error, or with a text-encoded dictionary describing available peer data for the torrent. Great, so does the tracker respond with some JSON? XML? YAML? No, it responds with Bencode! This is a custom text-encoding scheme made for bittorrent metadata that can encode:

Field type	Encoding rule	Example
integers	Prefix with an `i`, then the integer in ascii-base10, then an `e`	`7` becomes `i7e`
bytestrings	Length-prefixed, then a colon, then the string	“foo” becomes `3:foo`
lists	Start with an `l`, then the contents of the list, then an `e`	`[2,3]` becomes `li2ei3ee`
dictionaries	Start with a `d`, then the contents of the dictionary, then an `e`. Each entry consists of a string key, followed immediately by a value	`{"foo": 1, "bar": 2}` becomes `d3:fooi1e3:bari2ee`

The tracker may respond with a Bencoded dictionary with a key of failure reason and a value of some explanatory text string like “this tracker doesn’t have information on that torrent” or “you’ve been rate-limited”. Otherwise, it’ll respond in one of two ways:

Bencoded dictionaries

In the older bittorrent tracker standard 3, trackers respond with a dictionary containing the key peers and a value of a list, where each entry is a dictionary, containing contact information for that peer. For example (translated to json):

{
    "peers":
        [
            {"ip": "1.2.3.4", "port": 4567},
            {"ip": "2.3.4.5", "port": 5678}
        ]
}

Or in the glorious bencode:

d5:peersl2:ip7:1.2.3.44:porti4567e22:ip7:2.3.4.54:porti5678eee

There may be a variety of other keys (a “peer ID” in the peer dictionary, or metadata like “number of seeds, peers, and leeches” at the top level), but this is all we need for our purposes.

Bencoded compact bytestring

All this text encoding gets a little tedious, so in an amendment to the tracker spec (standard 23), trackers may now instead return a binary string in the “peers” field, like:

{
    "peers": "\x04\x03\x02\x01\x04\xD2\x05\x04\x03\x02\t)"
}

Or in bencode again:

d5:peers12:\x04\x03\x02\x01\x04\xD2\x05\x04\x03\x02\t)e

This is equivalent to the dictionary above: the first four bytes are an integer IP address, followed by two bytes for a port, then another six bytes for the next IP address and port. The hex-escaping is added here for illustration purposes; the tracker would return those raw bytes.

While this string compression doesn’t save much in our two-peer example, it’s significantly more compact when handling dozens or hundreds of peers.

The tracker UDP API

HTTP is unwieldy. It takes many packets, the server might use gzip compression, maybe the server requires HTTPS, or goes through some redirects before responding. Once the server responds, it might respond with a variety of HTTP errors, and while it should respond with bencoded data, servers often return HTML in error. Even when they return bencoded data, they sometimes follow the bencode spec incorrectly. In short, supporting HTTP in torrent clients is a complicated mess. But it doesn’t need to be this way! The information the client and server are exchanging is relatively simple, and we can communicate it in just a handful of UDP packets. So begins bittorrent specification 15.

First, we need to perform a handshake with the server:

The client sends a magic number confirming that they are using the torrent tracker protocol, as opposed to random Internet traffic like a port scan. Then they send an action (0: connect), and a random transaction ID to identify datagrams connected to this session.

If the tracker is online, it will respond to complete the handshake:

The tracker sends back action 0 (responding to the connect request), the same transaction ID the client sent, and a random connection ID. The client will include this connection ID in future datagrams. This handshake prevents IP address spoofing, as used in DNS amplification attacks where an attacker coerces a DNS server into flooding a third party with traffic.

The client may now send its announce request (action code 1: announce):

This uses the same connection ID and a new transaction ID from the previous step, followed by the info hash of the torrent, and a peer ID representing this client. Then the client sends some metadata regarding how far along its download is (matching the downloaded, left, and uploaded fields in the HTTP spec). Finally, the client sends the IP address and port it can be reached at, although trackers will typically ignore the IP address field and use the IP that sent the request (again to prevent spoofing), a key identifying the client, and an unused num_wanted field.

If the client has both an IPv4 and an IPv6 address, and is therefore looking for both v4 and v6 peers, then it must make two announce requests, over v4 and v6, using the same key. This allows the tracker to avoid “double-counting” the number of peers interested in a torrent.

Finally, the tracker responds with peer data:

Here, the action and transaction ID match the previous datagram, and the interval indicates how long the client should cache results for before polling the tracker again. The leechers and seeders counts are the tracker’s guess as to how many peers are mostly-downloading or mostly-uploading based on the downloaded, left, and uploaded fields from each announce request. These counts are not authoritative, or confirmed by the tracker in any way.

And at last, the tracker responds with a series of IP addresses and port numbers: 4 bytes per address (assuming IPv4, 16 bytes for IPv6), and two bytes per port number.

That’s all there is to the UDP protocol! Keep in mind that all values should follow network byte-order (big endian). While the diagrams make this protocol look complicated, there’s far less parsing or error handling needed than for the HTTP version, no external libraries required, and the entire exchange occurs in just 4 packets. No wonder the majority of torrents only use UDP trackers!

Creating the dashboard

With the tracker protocol implemented, we can take a list of torrents, extract their list of trackers, and look up peers from each tracker. We can also look up peers in the DHT using third party code. From here, it’s a simple process to make a SQL database to track all that information with timestamps, select results from those tables based on their age, and at last throw up an interface to peruse it:

Screenshot of the DDoSecrets torrent health dashboard

In the hopes that this code might benefit others, it’s been released on GitHub.