Distinguishing In-Groups from Onlookers by Language Use

Posted 6/8/2022

This post is a non-academic summary of my most recent paper, which can be found here. It’s in a similar theme as a previous paper, which I discussed here, but this post can be read on its own. An enormous thank you to my fantastic co-authors Josh Minot, Sam Rosenblatt, Guillermo de Anda Jáuregui, Emily Moog, Briane Paul V. Samson, Laurent Hébert-Dufresne, and Allison M. Roth.

If you wanted to find QAnon believers on Twitter, YouTube, or Reddit, you might search for some of their flavorful unique vocabulary like WWG1WGA (“Where we go one, we go all”). To find cryptocurrency enthusiasts, you might search for in-group phrases like HODL or WAGMI, or “shitcoins”, or specific technologies like “NFT” or “ETH”. This works well for new, obscure communities, when no one else has picked up on their vocabulary. However, once a community reaches the limelight, the keyword-search strategy quickly deteriorates: a search for “WWG1WGA” is now as likely to find posts discussing QAnon, or ridiculing them, as it is to identify true believers.

Human observers with some contextual understanding of a community can quickly distinguish between participants in a group, and discussion about (or jokes about) a group. Training a computer to do the same is decidedly more complicated, but would allow us to examine exponentially more posts. This could be useful for tasks like identifying covid conspiracy communities (but distinguishing them from people talking about the conspiracists) or identifying a hate group (but distinguishing from people discussing hate groups). This, in turn, could help us to study the broad effects of deplatforming, by more systematically examining where communities migrate when they’re kicked off a major site. Those possibilities are a long way off, but distinguishing participants in a group from onlookers talking about the group is a step towards the nuance in language processing we need.

Setup

Our study focuses on a simple version of this problem: given a subreddit representing an in-group, and a subreddit dedicated to discussing the in-group, automatically label commenters as being part of the in-group or onlookers based on the text of their comments. We use the following list of subreddit pairs:

In-Group Onlooker Description
r/NoNewNormal r/CovIdiots NoNewNormal discussed perceived government overreach and fear-mongering related to Covid-19
r/TheRedPill r/TheBluePill TheRedPill is part of the “manosphere” of misogynistic anti-feminist communities
r/BigMouth r/BanBigMouth Big Mouth is a sitcom focusing on puberty; BanBigMouth claimed the show was associated with pedophilia and child-grooming, and petitioned for the show to be discontinued
r/SuperStraight r/SuperStraightPhobic SuperStraight was an anti-trans subreddit, SuperStraightPhobic antagonized its userbase and content
r/ProtectAndServe r/Bad_Cop_No_Donut ProtectAndServe is a subreddit of verified law-enforcement officers, while Bad_Cop_No_Donut documents law enforcement abuse of power and misconduct
r/LatterDaySaints r/ExMormon LatterDaySaints is an unofficial subreddit for Mormon practitioners, while ExMormon hosts typically critical discussion about experiences with the church
r/vegan r/antivegan Vegan discusses cooking tips, environmental impact, animal cruelty, and other vegan topics. AntiVegan is mostly satirical, making fun of “vegan activists”

Some of these subreddit pairs are directly related: r/TheBluePill is explicitly about r/TheRedPill. Other subreddit pairs are only conceptually connected: r/Bad_Cop_No_Donut is about law enforcement, but it’s not specifically about discussing r/ProtectAndServe. This variety should help illustrate under what conditions we can clearly distinguish in-groups from onlookers.

For each subreddit pair, we downloaded all comments made in each subreddit during the last year in which they were both active. In other words, if one or both subreddits have been banned, we grab the year of comments leading up to the first ban. If both subreddits are still active, we grab the comments from the last 365 days to present.

We discarded comments from bots, and comments from users with an in-subreddit average karma below one. This is to limit the effect of users from an onlooking subreddit “raiding” the in-group subreddit (or vice versa), and therefore muddying our understanding of how each subreddit typically writes.

What’s in the Data

Next, we want to identify the words used far more in the in-group than the onlooking group, or vice versa. There are a variety of ways of measuring changes in word-usage, including rank turbulence divergence (which words have changed the most in terms of their order of occurrence between one dataset and another) and Jensen-Shannon divergence (the difference in word frequency between each subreddit and a combination of the two subreddits).

For example, here’s a plot illustrating which words appear more prominently in r/NoNewNormal or r/CovIdiots, based on the words “rank”, where rank 1 is the most used word, and rank 10,000 is the 10,000th most-used word:

An allotaxonograph comparing r/NoNewNormal and r/CovIdiots

While we know both subreddits feature terms like “vaccine”, “mask”, and “covid”, this plot tells us that terms like “doomer”, “trump”, and “lockdown” are used disproportionately in our in-group, while disparaging terms like “idiot”, “stupid”, and “moron” are far more common in the onlooker group.

We can already see one limitation of this study: the most distinguishing term between our two subreddits is “covidiot”, a term developed on r/CovIdiots. We’re not just capturing some context around the in-group’s use of terminology, we’re identifying keywords specific to this community of onlookers, too.

Building a Classifier

Now that we’ve had a peek at the data, and have confirmed that there are terms that strongly distinguish one community from its onlookers, we want to build a classifier around these distinguishing terms. Specifically, for every user we want to get a big text string consisting of all of their comments, the classifier should take this comment string as input, and return whether the user is in the in-group or the onlooker group.

Since we know whether each user participates mostly in the in-group subreddit, or the onlooking subreddit, we’ll treat that as ground-truth to measure how well our classifier performs.

We built two classifiers: a very simple linear-regression approach that’s easy to reverse-engineer and examine, and a “Longformer” transformer deep-learning model that’s much closer to state-of-the-art, but more challenging to interrogate. This is a common approach that allows us to examine and debug our results using our simple method, while showing the performance we can achieve with modern techniques.

We trained the linear regression model on term frequency-inverse document frequency; basically looking for words common in one subreddit and uncommon in another, just like in the plot above. We configured the Longformer model as a sequence classifier; effectively “given this sequence of words, classify which subreddit they came from, based on a sparse memory of prior comments from each subreddit.”

Results

Here’s our performance on a scale from -1 (labeled every user incorrectly) to 0 (did no better than proportional random guessing) to 1 (labeled every user correctly):

In-Group Onlooker Logistic Regression Performance Longformer Performance
r/NoNewNormal r/CovIdiots 0.41 0.48
r/TheRedPill r/TheBluePill 0.55 0.65
r/BigMouth r/BanBigMouth 0.64 0.80
r/SuperStraight r/SuperStraightPhobic 0.35 0.43
r/ProtectAndServe r/Bad_Cop_No_Donut 0.50 0.55
r/LatterDaySaints r/ExMormon 0.65 0.72
r/vegan r/antivegan 0.49 0.56

Or, visually:

Barplot of above table

Much better than guessing in all cases, and for some subreddits (BigMouth, LatterDaySaints, and TheRedPill) quite well!

If a user has barely commented, or their comments all consist of responses like “lol”, classification will be near-impossible. Therefore, we can re-run our analysis, this time only considering users who have made at least ten comments, with at least one hundred unique words.

In-Group Onlooker Logistic Regression Performance Longformer Performance
r/NoNewNormal r/CovIdiots 0.57 0.60
r/ProtectAndServe r/Bad_Cop_No_Donut 0.65 0.76
r/LatterDaySaints r/ExMormon 0.80 0.83
r/vegan r/antivegan 0.65 0.72

And visually again:

Barplot of above table

For a few subreddit pairs, the onlooking subreddit has too few comments left over after filtering for analysis to be meaningful. For the four pairs that remain, performance improves significantly when we ignore low-engagement users.

Similarly, we can examine what kinds of users the classifier labels correctly most-often:

Plot comparing labeling correctness to subreddit comments, total subreddit karma, and mean subreddit karma

The classifier performs better on users with more comments (and therefore more text to draw from), and more karma in the subreddit (which typically correlates with number of comments unless the user is immensely unpopular), but does not significantly differ with mean subreddit karma. In other words, popular users who receive lots of karma on many of their comments, and therefore might be more representative of the subreddit’s views, are not easier to classify.

Conclusions, Limitations, Next Steps

For a first attempt at solving a new problem, we have some promising results. We can consistently distinguish users from an in-group and users from a specific onlooking group, based on the language of users’ posts. Our study focuses on subreddits, which provide a best-case scenario for classification: comments are neatly partitioned into the in-group and onlooker subreddits. If we studied Twitter users, for example, we’d have no baseline to determine whether our classifier was guessing correctly, or even a good way to feed it training data, without human annotators labeling thousands of Twitter accounts by hand.

It’s also unclear how well this classifier would function in a cross-platform environment. For example, could we train the classifier on a subreddit, and then classify Twitter or Discord users based on their comments? Theoretically, the same community will discuss the same topics on multiple platforms, likely with similar keywords. However, the design of each platform (such as the short character limits on Tweets) may constrain authors enough to make classification harder.

Finally, it’s unclear how well this classification will hold up over time. Would a classifier trained on last year’s comments still perform well on users from this year? Or will the discussion topics of a community have drifted too far for those old word frequencies to be useful? This could be especially important when communities migrate between platforms, when we may for example have old Reddit data and new Discord data.

Lots more to do, but I’m excited about these first steps!


Git and Code Versioning for Scientists

Posted 5/10/2022

I recently gave a talk to the Joint Lab on using git to write code collaboratively with other scientists. I’m going to recycle that talk in blog form, in case anyone else would like a quick crash-course. This post assumes that you’ve used git a few times before, and are trying to get a better handle on it.

My examples use the command line and diagrams to explain the concepts behind git operations. You may find graphical git software, like GitHub Desktop, or the git integration in your text editor, more approachable. That’s perfectly fine; I just find the command line makes git operations more explicit, and so is clearer for teaching.

What is Git?

Git is a version-control system: it tracks changes to files in a folder, typically source code, allowing developers to undo their changes, examine the history of changes, and merge changes that they and others have made.

A group of changes to one or more files is called a commit. Each commit includes a reference to the previous commit, creating a timeline of every change since the start of the project.

The current state of the repository can be described as the sum of the most recent commit, and every ancestor commit that came before it. The current state of the repository is referred to as HEAD.

A git branch is a series of commits. The default branch is typically called main or master. More on branches later.

Remote Servers

Git can technically be used offline: you can create a local repository on your computer, track changes to files in a project, and never share the commits with anyone. However, we most frequently use git as a collaboration tool, typically by creating a repository on a central site like GitHub or GitLab and giving multiple developers access.

When you clone a GitHub repository, your computer assigns a name to the remote server, origin by default. Your computer then downloads the repository contents, and creates two branches: an origin/main branch, representing the commits GitHub is aware of, and a main branch, representing your own work.

Pull

When you pull new changes down from GitHub, git first downloads the commits to the origin/main branch, then fast-forwards your own main branch to match:

Push

Similarly, when you push local changes to GitHub, you’re sending changes from your own main branch to GitHub, so the origin/main branch catches up to your own work:

Conflicts

If you are the only developer on a repository, and you only develop from one computer, then you can push and pull to your heart’s content. However, if someone else has pushed changes to GitHub since you have, then you have a conflict, where GitHub has commits that you don’t, and you have commits that GitHub doesn’t:

In order to sync your local commits with GitHub, you need to resolve the conflict with one of two strategies.

Rebasing

If the commits on GitHub (C in this diagram) edit different files than the local commits (D and E), then the “conflict” is purely bureaucratic. The file changes aren’t incompatible, you just need to re-order the commits to create a coherent timeline. The rebase command will rewrite your local commits so that they come after C:

The full commands to accomplish this are:

git fetch origin main
git rebase origin/main

The first downloads changes from GitHub (origin) to the origin/main branch, but does not attempt to combine them with the local main branch. The second rebases the local commits to occur after the origin/main commits.

Merging

If commit C does change the same files as commits D or E, then the two histories may be incompatible. In order to combine them, we need to add a new commit, F, that incorporates the changes from all three commits, possibly changing files further to make the changes mesh.

Similarly, the commands for this are:

git fetch origin main
git merge origin/main

If you’ve never run these commands before, that’s because git pull is shorthand for the above two commands!

Branches

So far we’ve only considered repositories with a single branch: main. However, using multiple branches is key to successful collaboration. You can create side branches to work on a new feature, or otherwise make large breaking changes to a codebase. Branches let you work in a corner, contain your mess, and only merge those changes back to the main branch when you’re done and ready to share.

Creating Local Branches

Creating a local branch is as easy as running:

git checkout -b new_feature

This creates a new branch called new_feature, and switches HEAD to track the new branch. When you add new commits, they’ll now be added to the new_feature branch instead of main:

Switching Branches

To switch back to the main branch, run git checkout main:

This will un-apply the changes from commits D and E, reverting the codebase to the state of the main branch. If you’ve made changes that you haven’t committed, git will not let you change branches. In other words, switching branches will never destroy your work.

When you’re ready to combine your changes from your side branch to the main branch, simply checkout the main branch, and run git merge new_feature or git rebase new_feature as appropriate.

Creating Remote Branches

By default, newly created branches only exist in your local repository, and are not pushed up to GitHub. This is so that you can do as much internal bookkeeping and branching as you want, without cluttering what other developers see. Make side branches to your heart’s content!

However, if you want to share your branch with other developers, to collaborate on a side branch without merging everything to main, then you need to create a “remote” branch.

If you already have a local branch checked out, you can create a corresponding remote branch (and then push to send those changes to GitHub) with:

git branch --set-upstream-to origin/new_feature
git push

If you are creating a new branch for the first time, and already know you want to share it with others, you can create the branch like:

git checkout -b new_feature origin/new_feature

In either case, your commit tree will now look like:

From now on, you can git push and git pull in the new_feature branch to sync it to GitHub’s origin/new_feature branch.

Common Pitfalls

Editing on the Wrong Branch

If you’ve made some changes to the main branch that you wanted to put on new_feature and you haven’t committed yet, then you can move those changes over by stashing them in a corner:

git stash
git checkout new_feature
git stash pop

Stashing uncommitted changes bypasses git’s rule about not permitting you to switch branches before committing, because your changes will not be overwritten this way.

Committing to the Wrong Branch

If you’ve already committed your changes to main instead of new_feature and you haven’t pushed yet then you can fix the mishap with:

git checkout new_feature
git merge main
git checkout main
git reset --hard HEAD~3

This will move the changes to the new_feature branch (by merging them in from main), then on the main branch, undoes the last three commits, rewriting history.

Since the commits are still referenced by the new_feature branch, no work is lost, but they no longer appear in the main branch’s history.

Reversing a Catastrophic Commit

Usually when you make a mistake, like committing a typo, the solution is to add a new commit that corrects the mistake. There is rarely any need to rewrite history and un-commit something.

One glaring exception is when you commit security-sensitive information. For example, if you accidentally commit your Amazon EC2 API keys to a public GitHub repository, then adding a new commit to erase them is insufficient. Anyone could look in the commit history of the project and fetch the keys back out.

In this rare circumstance, you can roll back back the most recent commits on the local main branch with:

git reset --hard HEAD~3 # Replace '3' with the number of commits to undo

And then push those changes up to GitHub, being very explicit about forcing GitHub to update their origin/main to match your local main, even though you aren’t adding new commits:

git push origin main --force

Note that if anyone else has pulled the mistaken commit down from GitHub, they’ll still be able to view the withdrawn credentials, and they’ll now get an error when they try to pull new changes down. This is because their own git history is now incompatible with the one on GitHub. To fix this, they need to reset their main branch to overwrite their local history with GitHub’s:

git fetch origin main
git reset --hard origin/main

Needless to say, this is extremely messy, and should be avoided whenever possible.

Using .gitignore to avoid mistakes

Git has a special configuration file, .gitignore, that tells it to ignore specific files. For example, you might include a .gitignore file at the top level of the repository containing:

*.png
database_credentials.json
logs/

This will tell git to ignore all PNG files, the database credentials file, and the entire logs folder. You will never be prompted to add or commit those files, so there is no risk of accidentally pushing them to a public repository.

Rename and move files within git!

Git has poor support for moving and renaming files. When you rename a file without telling git, it compares the contents of the “new” file and the “missing” file, and if the files are identical or extremely similar, it assumes the file has simply been moved. However, if you rename a file and then make some changes to it, chances are git will no longer recognize it as the same file, and will prompt you to commit deleting the old file and creating a new file. This is inconvenient, since it destroys the history associated with the file, making it difficult to track changes. Fortunately, the solution is simple - move your files using git:

git mv oldfile newfile

This guarantees that git knows the file is renamed, and maintains history.

Conclusion

Git is a powerful tool for collaborating with others on code. It has an extraordinary breadth of functionality, because it was created by the Linux kernel developers to help them manage what is likely the single most complicated open source project in human history. However, this post covers most of the basics, and as much as I’ve needed for almost any project. I picked up git on my own, as I think many scientists do, and for several years felt like I half-understood what I was doing and stumbled my way through getting git to do what I wanted. Hopefully this can help someone in a similar place!


What is Distributed Denial of Secrets?

Posted 4/10/2022

Distributed Denial of Secrets (DDoSecrets) is a transparency collective. We’ve seen some questions and confusion recently as to what that means, so I’d like to elaborate. See also our about page, our Wikipedia page, and articles here and here.

When whistleblowers within an organization, or hackers that have gained access to files, want to get the word out, DDoSecrets can:

  • Announce the release to a wide audience

  • Check the files for personally identifying information

  • Package a release for public consumption, via torrents or web-search interfaces

  • Contact journalists at a wide range of publications

  • Collaborate with academic researchers

  • Share more sensitive documents with journalists and researchers, without making all data public

That’s our primary role: connectors and proliferators of data.

Our Sources

For most releases, a source contacts us anonymously. They describe what data they have, we have a conversation to establish veracity and the scope of documents, and negotiate a safe way to move the data to us. We typically do not know who our sources are, and do not have contact with them after they submit documents. Occasionally a source will ask to be credited publicly, in which case we verify their identity and attribute the leak to them.

We also mirror releases published by other groups. For example, ransomware groups sometimes publish their victims’ data to criminal forums when their (typically corporate) victims refuse to pay. Other criminals can then peruse the release looking for usernames and passwords, personal information, and anything else they can profit off of. By making a copy of these releases for journalists, we hope that some social good can come out of what would otherwise only benefit other bad actors.

We’ve also published datasets on behalf of other public organizations, notably including the .Win Network archive originally scraped by the Social Media Analysis Toolkit (SMAT) team, and the Patriot Front files, sourced by Unicorn Riot.

Our Publishing Process

When considering a new release, we have four main outcomes: discard the data, publish it publicly in full, publish it only to journalists and researchers, or publish a redacted subset of the documents publicly.

When we publish a release publicly, we typically release the documents via torrent, with private backups to reseed the torrents if needed. When possible, we also release documents through websites like DDoS Search, which can make data easier to browse, explore, and analyze.

When we release data with limited distribution, journalists and researchers contact us (or vice versa), and we negotiate a way to share files with them. If it’s feasible to filter out personal information, then we’ll publish the subset of low-risk documents publicly. However, given our limited staff, and the large size of many of these datasets (millions of emails and documents), we frequently mark datasets for limited distribution to vetted reserchers.

In either case, we add a detailed summary of the release to our website and usually announce publication via our newsletter, our Telegram channel, Twitter, and whatever other platforms are useful.

Working with Analysts

Usually our role ends as a publisher, but when resources permit, we also work with some journalists and academics to analyze datasets. Sometimes this aid is short-term; when our releases include content like database dumps that not all newsrooms are equipped to investigate, we’ve been able to help some news groups run a SQL query to find relevant information and export it in a simpler format like a spreadsheet.

In other cases, our collaborations have been more involved. DDoSecrets has worked on months-long investigations with:

Our role in these collaborations has ranged from coordinating who’s investigating which sections of a release and collating results, to providing infrastructure (wikis for coordination, interactive sites for document search, and virtual machines for analyzing datasets), to writing our own document parsers and aiding in network analysis.

Spreading the Word

Finally, DDoSecrets aggregates investigations into our releases, summarizing and linking to articles on our newsletter and our wiki entries. Some of us interview with journalists, speak on podcasts or on television, and present at conferences about our work, and what our releases and investigations have revealed.

Support Us

If you’re a fan of what we do, please consider donating, volunteering, and helping spread information about our work. We’re a small team with limited funding, and every bit helps!


Predicting Friendships from SMS Metadata

Posted 3/30/2022

We know that metadata is incredibly revealing; given information about who you talk to, we can construct a social graph that shows which social circles you’re in and adjacent to, we can predict your politics, age, and a host of other attributes.

But how does social graph prediction work? How do you filter out noise from “real” social connections? How accurate is it, and in what cases does it make mistakes? This post introduces one approach based on the expectation-maximization algorithm, to start a discussion.

The Setup

We have texting logs from 20 individuals, showing how many times they texted every other participant over the course of a week. We have no additional data, like the timestamps of messages, length, or contents. We also won’t consider the directionality of who sent the texts, just “how many texts were sent between person 1 and 2?” Given these texting logs, we want to find the most probable friendship graph between these 20 people. We will also assume that friendships are always reciprocated, no one-way friends.

We can represent this input as a matrix, where the row and column indicate who is speaking, and the value represents number of texts:

  Person 1 Person 2 Person 3
Person 1 0 6 1  
Person 2 6 0 11  
Person 3 1 11 0  
       

It may be tempting to apply a cutoff here. For example, if person 1 and 2 text more than X times we’ll assume they’re friends. However, this doesn’t easily let us represent uncertainty: If the number of texts is close to X, how do we represent how sure we are that they might be friends? Even for values much greater or lower than X, how do we represent our confidence that we haven’t found two non-friends who text a surprising amount, or two friends who text surprisingly infrequently? Instead, we’ll use a slightly more sophisticated approach that lends itself to probability.

We will assume that friends text one another at an unknown rate, and that non-friends text one another at a lower unknown rate. This is a big assumption, and we’ll revisit it later, but for now take it as a given.

We can represent the two texting rates using Poisson distributions. This allows us to ask “what’s the probability of seeing k events (texts), given an underlying rate at which events occur?” The math for this looks like:

We can use this building block to ask a more useful question: Given that we did see k texts, is it more likely that these texts came from a distribution of friends texting, or a distribution of non-friends texting?

This is equivalent to asking “what is the probability that person A and B are friends, given the number of texts sent between them?” So, all we need to do now is run through every pair of people, and calculate the probability that they’re friends!

There’s just one problem: We have no idea what the two texting rates are. To estimate them, we’ll need to add a new level of complexity.

The Model

When determining our friends texting rate and non-friends texting rate, it would be very helpful if we knew the probability that two people are friends. For example, if 80% of all possible friendships exist, then we know most logs of texts represent the logs of friends texting, and only about the lowest 20% of text counts are likely to represent non-friends.

This sounds like it’s making things worse: now we have a third unknown variable, the likelihood of friendship, which we also don’t know the value of! In reality, it will make the problem much easier to solve.

Let’s make a second huge starting assumption: There is an equal likelihood that any two randomly chosen people in the group will be friends. This is generally not true in social graphs - highly charismatic, popular people usually have far more friends, so the probability of friendship is not at all equal - but it will make the math simpler, and it’s not a terrible assumption with only 20 people in our social graph.

We can represent this probability as follows:

To re-iterate the second line, the probability of any given friendship network F is equal to the probability of each friendship in the network existing, times the probability of each non-friendship not existing. In other words, if our friendship probability is 0.8, then about 80% of all possible friendships should exist, and if we propose a friendship network with only five friendships then the above math will tell us that the scenario is highly unlikely.

It’s important to note that this network model represents our prior assumption about the underlying friendship network, but doesn’t lock us in: given enough evidence (text messages) we will override this prior assumption, and add friendship edges even if they are unlikely under a random network.

Next, let’s approach the original problem backwards: Given a friendship network F, what’s the probability that we’d get the text logs we’ve received?

That is, for each friendship that does exist, get the probability of seeing our text observations from the friends texting distribution, and for each friendship that does not exist, get the probability of our text observations from the non-friends texting distribution. Multiply all those probabilities together, and you have the probability of seeing our full set of logs.

The Optimization

We can combine the above pieces and solve in terms of the most likely values for our friends texting rate, our non-friends texting rate, and our friendship probability. I will not include the details in this post, because it’s about five pages of calculus and partial derivatives. The high level idea is that we can take the probability of a friendship network given observed texts and parameters, and take the partial derivative with respect towards one of those parameters. We multiply this across the distribution of all possible friendship networks, weighted by the probability of each network occurring. We set the entire mess equal to zero, and solve for our parameter of interest. When the derivative of a function is zero it’s at either a local minimum or maximum, and for out-of-scope reasons we know that in this context it yields the global maximum. Ultimately, this gives us the most likely value of a parameter, given the probability that each pair of people are friends:

Where n is our number of participants, in this case 20.

But wait! Didn’t the probability that two people are friends depend on the texting rates? How can we solve for the most likely texting rate, based off of the texting rates? We’ll do it recursively:

  1. Start with arbitrary guesses as to the values of our two texting rates, and the friendship probability

  2. Calculate the probability that each pair of people are friends, based on our three parameters

  3. Calculate the most likely values of our three parameters, given the above friendship probabilities

  4. Loop between steps 2 and 3 until our three parameters converge

One quick python script to parse the log files and run our four math equations, and:

Plot showing convergence of all three variables over time

We’ve got an answer! An average of 10 texts per week among friends, 2 among non-friends, and a 20% chance that any two people will be friends. The parameters converge after only 25 steps or so, making this a quick computational optimization.

The Analysis

With our three parameters we can calculate the likelihood that any two individuals are friends based on observed texts, and plot those likelihoods graphically:

Plot showing probability of each possible friendship

This is our “answer”, but it’s not easy to understand in current form. We’d prefer to render this as a friendship network, where nodes represent people, with an edge between every two people who are friends. How do we translate from this probability matrix to a network? Here, it’s a little more appropriate to apply cutoffs: We can plot all friendships we’re at least 70% confident exist, 90%, and 98%:

Plot showing possible friendship networks based on confidence

Confirmation Test: U.S. Senators

Unconvinced by the “detecting friendships from text messages” example? Let’s apply the exact same code to a similar problem with better defined ground truth: predicting the political party of senators.

We can take data on the voting records for session 1 of the 2021 U.S. Senate. For every pair of senators, we can count the number of times they voted the same way on bills (both voted “yea”, or both voted “nay”). We will assume that senators in the same political party vote the same way at one rate, and senators in different political parties vote together at a lower rate. The signal will be noisy because some senators are absent for some bills, or vote against party lines because of local state politics, occasional flickers of morality, etc.

As in the texting problem, we’ll place an edge between two senators if we believe there is a high chance they are in the same party. We can also anticipate the value of rho: Since the senate is roughly split between Democrats and Republicans, there should be close to a 50% chance that two randomly chosen senators will be in the same party. (This is an even better environment for our “random network” assumption than in the texting friendship network, since all senators will have close to the same degree)

First, the optimization:

Plot showing convergence of all three variables over time

As expected, the probability of any two senators being in the same party is 0.494 - close to 50%. Members of the same party agree on bills roughly 442 times per senate session, while members of opposing parties agree roughly 111 times. And now, the resulting network:

Most likely political alignment for all senators

Nodes are colored according to their declared party (red for Republicans, blue for Democrats, green for independents) to confirm the network clusters we’ve found. I’ve also removed the Vice President (who only votes to break ties), and a senator that held a partial term after their predecessor resigned. Since both the Vice President and the short-term senator voted in far fewer bills than their peers, there were no edges between them and other senators.

The remaining results are accurate! Not counting the people I removed because of insufficient voting history, the algorithm struggled to classify two senators, who are shown in the center. These two senators, Collins (R-ME) and Murkowski (R-AK), are considered some of the most moderate Republicans in the senate, and are swing votes. All other senators are clearly placed in the correct cluster with their peers.

Conclusions

We’ve created a model for detecting social relationships from event data. We assume that events occur at a higher rate between people with a social relationship, and a lower rate between people without a relationship. This is general enough to describe a wide variety of scenarios.

But our model also assumes consistency: what if our assumption about two friend texting and non-friend texting rates didn’t hold? For example, in a larger social graph some people may not text much. They still text friends more often than non-friends, but both rates are much lower than their peers. Our current model would mark these less-active users as friendless, and create no edges to them. We could extend our model by switching from two global texting rates to two individual rates, but then we’d have 2N+1 variables to optimize instead of 3, and will need much more training data to optimize.

We also assumed that the underlying social network was a simple random graph: Every two members have an equal chance of being connected to one another. That assumption is appropriate for our senate network, where we’re trying to ascertain group membership. It works relatively well in the texting network because the population is very small. In many scenarios, however, we expect member degree to follow a power law, where a small number of participants are far more connected than others. We could switch our network model from random to many varieties of exponential or scale-free networks, but this will complicate the math and likely add more parameters to tune.

My main takeaway from this is the need to understand assumptions made by the model, which dictate where it can be meaningfully applied, and where it will produce deeply flawed results.


The Distributed Denial of Service at Distributed Denial of Secrets

Posted 10/9/2021

A few days ago, the Distributed Denial of Secrets website went down under a distributed denial of service attack. We set up Cloudflare and brought the site back up, but since none of us are professional sysadmins and we’re all volunteers with other time commitments, it took us a couple days to work through the steps. I thought it would be fun to walk through what happened for a less-technical audience.

The Attack

A DDoS attack consists of sending a server messages over and over again, requiring all of its resources so that it’s unable to respond to legitimate requests. Often this consists of several computers sending small HTTP requests several times a second, each of which requires that the server evaluate code and respond with a webpage. Small effort from attacker, large effort from webserver, good ratio for the attack. If a webserver is poorly configured or has limited resources, this overwhelms it and the webpage becomes unreachable until the attack is over. DDoS attacks are technically simple, and just require that you get enough participants to jam the server with a digital sit-in.

Since DDoSecrets operates on a shoestring budget (something you can help change here), the web server hosting our wiki was puny, and easily overwhelmed.

The Response

There are a few strategies for defending against DDoS attacks. You can set up intrusion detection software like fail2ban that reads your webserver logs and automatically blocks IP addresses that are sending too many requests - but if there are enough participants in the attack, the server will still get overwhelmed even as it blocks attackers left and right.

The more thorough solution is to set up a content distribution network, or CDN. Without a CDN, web requests typically work like this:

The client computer makes a DNS request for ddosecrets.com, receives back an IP address, then connects to that IP address and sends a web request.

With the addition of a CDN, the process looks more like this:

Instead of one web server, there are now many webservers in the content distribution network. When the client makes a DNS request for ddosecrets.com it receives the IP address of one of the CDN servers. Often CDN servers are spread out geographically, and the client will receive the IP of a CDN server that’s relatively close to them to improve performance. The DNS server may also alternate between returning several different IP addresses to help balance how many clients are using each CDN server.

If the client is behaving normally, then it connects to the CDN server, and sends its HTTP request there. The first time a CDN server receives an HTTP request, it connects to the real web server and forwards the request along. However, the CDN remembers what the webserver response was, and on subsequent requests for the same page, the CDN can respond with the cached answer instead of forwarding the request. This massively cuts down on how many requests the webserver receives, making the website faster even with exponentially more users, and even allows the website to remain available while the webserver is offline, so long as all the webpages can be cached. The CDN usually applies their own rate-limiting to clients, and may look for malicious patterns in requests, further limiting the possibility that malicious traffic makes it to the webserver.

So, we paid a CDN provider, set the servers up, clients should now be directed to the CDN instead of our webserver, and… nothing. The webserver continued to receive a flood of requests and remained stubbornly unresponsive. What happened?

If the attackers know the IP address of a webserver, then they don’t need to perform a DNS lookup - they can connect directly to the webserver’s IP address, and send their request. This bypasses the CDN, and all of its caching and protections. It may not even require any clever skill on the attackers’ part; since the attack began before we set up the CDN, the attackers’ computers may have simply had the IP address cached already and never needed to run a DNS lookup.

The solution to this problem is trivial: No one should be connecting to the webserver except through the CDN, so we can simply block all IP addresses at the firewall level except for IP ranges used by the CDN. Simple, but sometimes overlooked since the CDN will appear to work without this step.

DDoSecrets is back online.


View older posts