Posted 11/14/16
Puddle is a prototype anonymity network designed and implemented by a friend (Taylor Dahlin) and myself. It lacks any cryptography, but explores how to perform anonymous routing and decentralized information lookups. Below is an adaptation of our research paper describing the system.
Information online is shared in two distinct styles. The first is in terms of specific data: A single email, or a URL to an exact webpage. This form is convenient for storing information, but makes finding information tedious. The second style of information sharing, a search engine, makes use of non-specific information. A user requests data on a topic, and the search engine returns all relevant specific data.
Information in anonymity networks, or “darknets”, is almost exclusively shared using the specific file style. Search engines seem to be the antithesis of anonymity, as a traditional search engine maintains an index of all information on a network and where it is located, and is also in a perfect position to track what data users are interested in. Unfortunately this makes discovering information on networks like Tor and I2P frustrating and unapproachable for even technical users.
Puddle is an attempt at a distributed and anonymous search engine system. Similar to an engine like Google a client can send out a request for information on a particular subject, and the network will return relevant files. However, unlike public search engines Puddle has no central index of information, or any bottlenecks where requests can easily be traced to their origin.
Puddle is implemented as an HTTP API. This provides a simple framework for requests and responses. GET requests represent requests for information, while PUT requests represent uploads of data related to the GET subject.
HTTP requests are rippled through connected nodes using a “time-to-live” to ensure that requests do not bounce indefinitely, and do not require a specific route.
Each information request is sent with two time to live values, formatted as follows:
GET /relay/:ttlOrig/:ttlCurrent/:topic
The “current” TTL acts like normal: It is decremented at each hop through the network, and the message is discarded if the TTL reaches zero.
The “original” TTL is used for responses, telling each relay how high a TTL it needs to set to guarantee its message will reach the source. Both time-to-lives are slightly randomized, so it is impossible to determine where a message originated unless an assailant controls all neighbors of the originating relay.
Responses to information requests are formatted as:
PUT /relay/:ttl/:topic/:filename
The response uses only a single time to live, responds to a specific topic, and specifies the appropriate filename. The content of the PUT request is the file data itself.
In both requests the topic and filename are base64 and then URL encoded, to ensure that malicious filenames or topics cannot be chosen to malform the URL.
We implemented Puddle in Ruby for rapid prototyping in a loose-typed language. Our implementation is broken into five segments:
Request Processing - Receives HTTP requests for information or responses with information
File Processing - Manages files being shared, downloaded, or cached
Signalling - Sends new requests or forwards the requests from other relays
Client - Handles interaction with the user via a web interface
Peeting - Manages adding new peers, removing defunct ones, and integrating into the larger network
We use Sinatra (an HTTP server library) to register handlers for each type of request. These handlers then pass off work to the file or signalling modules as needed.
The file processing module reads the files in a “data” folder and determines what topics they are related to. When the module receives a request it checks whether there is relevant data, and if there is sends those files to the Signalling module. It also manages a data cache of files that have passed through the relay recently. This ensures that information that is frequently requested on the network propagates and becomes faster to fetch in the future.
This module sends HTTP requests to other relays. It acts as a thread pool so that requests can pile up and be sent incrementally. Otherwise this module is mostly a wrapper around different HTTP requests. We use the “patron” library to send HTTP requests for us, simplifying the networking requirements substantially.
The client module is similar to the “request processing” module except that it maintains an internal website for human users to input requests, retrieve results, and view the current state of the node. This part of the website is restricted so it can only be accessed from localhost, providing a modicum of authentication.
Our implementation of Puddle is only a proof of concept, and has many shortcomings. However, it achieves its goal of creating a decentralized and anonymous search system. We hope that other scientists see our design and build off of it to create a more complete network.
There are several areas our implementation does not yet touch on. First, we do not use cryptography. Encryption is not strictly necessary for anonymity in Puddle, as privacy is protected by randomized time-to-live values that conceal the origins of messages. However, encryption is necessary to protect against a Sybil attack, as we would need a method similar to Pisces’ cryptographically signed routing tables to detect malicious relays.
Puddle is also vulnerable to denial of service attacks. Since each message ripples from a relay to all of its peers there is a great deal of bandwidth used, and on tightly linked networks there is a high level of message duplication. One potential solution is to use random walks down a small number of neighboring relays, rather than broadcasting messages to everyone. This limits the bandwidth used, but also limits how many relays will be reached, potentially missing hosts with relevant content. Relays would need to re-transmit requests so long as the user is still interested to ensure that the likelihood of reaching a relevant relay is high.
Finally, our implementation uses a trivial definition of “relevant” content, and determines which files to up-load solely based on the filenames. A next step would be implementing some type of tagging system, or applying natural language processing to files to automatically determine what content is relevant to a request. Such a solution would also need to account for file name collisions, and how to handle extremely large files that could clog the network.