Academic Funding on Fire

Posted 5/17/2025

I’m now a postdoc involved in applying for research funding, and have more insight into the process than I did as a graduate student. Given horrifying cuts to funding throughout academia, I wanted to share my perspective with non-academic readers.

Where Does Research Funding Come From?

In corporate research and development, funding comes from the business’ revenue or investors and is driven by product development. In academia, things are a little more varied. Academic funding typically comes from government grants, philanthropists, and corporate partnerships.

Government grants come from agencies like the National Science Foundation (NSF), National Institutes for Health (NIH), Department of Energy (DoE), Department of Defense (DoD), and so on. Each agency has a broad mandate, issues calls for projects on topics within their mandate, and awards money to project proposals they find promising.

Philanthropists include the Gates Foundation, the Chan-Zuckerberg Initiative, the Sloan Foundation, and other charitable groups. These groups often have much narrower mandates than their federal counterparts - for example, the National Science Foundation funds fundamental research and education that they hope will lay the groundwork for future applied studies, while philanthropic charities are likely to fund more immediately applied work related to eradicating disease or other topics of interest.

Corporate partnerships are often narrower and more directly applied than even philanthropic funding. For example, Google funded some of my research into the longevity of open source software. Many Google products are built upon open source software, and they want to be confident that those building blocks are being actively maintained and aren’t likely to be abandoned due to lack of funding or founder burnout. Clear utility to them, of interest to us, win-win.

While the prominence of funding sources varies between fields, in most disciplines the vast majority of funding comes from government grants.

What do Grants Look Like?

Scientists apply for grants by proposing a project or series of projects under an advertised call for proposals. Along with the description of the research and an argument for why it’s valuable, each grant has a budget including:

  • Salaries for professors, postdocs, graduate students, and other research staff working on the project

  • Equipment (computers, software, fancy microscopes, you name it)

  • Travel and publishing costs (for fieldwork, attending conferences, open-access publication fees, and so on)

Once research staff have their salaries covered by a grant, they may work on additional projects without funding. This is entirely normal for letting students explore their own interests, and helps with future grant funding. Scientists often apply for grants to support projects they started unpaid, using their preliminary results to justify funding and allocating more resources to the work. In this way scientists are always a little out of sync, using existing grant funding to support both the grant’s work and the groundwork for the next grant.

Indirect Costs

Before the scientists apply for a grant, the university amends the budget to add indirect costs. This overhead covers expenses that support all research indirectly but don’t fit neatly into an individual research budget. For example, building maintenance, and covering human resources, accountants, janitorial staff, electricity, and so on. These indirect costs vary between universities, but often exceed 50% - meaning half the grant money awarded will go to the university rather than the specific research project.

Philanthropists often put strict limits on indirect costs, frequently 15%. If your mission is to eradicate Ebola, you don’t want to report back to Bill and Melinda Gates that half your budget went to university administrators and libraries. This puts universities in an awkward position - they need those indirect costs to cover operations, but don’t want to turn down free money from a private foundation. The typical compromise is to use indirect costs from federal grants to cover the university’s overhead needs, then take what they can get from the philanthropists.

It’s all Burning Down

The Trump administration has frozen (then unfrozen, then refrozen, then-) grant awards from the National Institutes for Health and the National Science Foundation. They’ve cut countless existing grants, retracting already awarded funding mid-project.

Academics are pivoting from federal funding to philanthropy and corporate partnerships, but this presents two problems: first, there isn’t nearly enough private funding to replace federal shortfalls, and second, these grants often have the aforementioned 15% indirect cost cap, and so provide much less money to universities.

More recently, Republicans in Congress proposed capping federal grant indirect costs at 15% under the argument that universities clearly get by on low indirect costs from charities, so why can’t they do the same with government grants? This of course misses that government grant overheads have been supplementing charity grants, and so capping federal indirect costs would be devastating.

In any case, the academic funding situation is dire. Even if these policies were all reversed tomorrow, there’s a long lead time on starting new studies, hiring scientists, and some research (especially medical studies) can’t just be paused and restarted with fluctuating funding. Add to that all the scientists leaving the United States to seek more reliable work elsewhere, and American academia is committed to a downward spiral for quite a while.


Dungeon Generation with Binary Space Partitioning

Posted 5/11/2025

It’s a lazy Sunday afternoon, and I’ve been thinking about procedural dungeon generation. This is a challenge where we build algorithms to create maps suitable for video- or tabletop-games, such that the maps are random enough to provide some surprise and variety, and constrained enough to produce playable maps with coherent structure. There are many, many approaches to this problem, and today I’m going to write about one of the simplest.

The Setting

Rather than generating forests, caves, or towns, we’re going to generate a traditional dungeon in the style of Rogue: a series of underground rooms, connected by passageways. Here’s one such example from Nethack rendered as traditional ASCII-art, featuring eight rooms, staircases up and down (< and >), our hero (@), and our trusty cat (f):

          -----         
          |...|         -------                           -----
          |...|        #.......########################  #.....##
          |...|        #|.....|                       #  #|...| #
          |....#####   #|.....-###     ------------   ## #|...| #
          |...|    #####-<....|  ###   |...........   ####|...| #
          ---.-      #  |.....|    ### ...........|    ###----- ###-----------
             #      ##  -------     ## |..........-###   ######   #-.........|
             #      #             #####...........|  #      #      |.........|
             #      #      #     ##    ------------  ##     #      |.........|
             #      #  #######   #                   #####  #      |.........|
           ###      #  #---.---- #                    #-|-.-#      |.........|
         --.------  #  #|>...f.|##                    #.....#      |.........|
         |.......|###  #|...@..| #                     |...|       -----------
         |........######|.......##                     |...|
         |.......|#     ------.-                       |...|
         |........#           #                        -----
         ---------

So, we need a series of non-overlapping rectangles of a variety of sizes, and paths between the rectangles so as to guarantee any room can be reached from any other. We’d also like some locality: rooms should typically link to other nearby rooms, only rarely producing long corridors that stretch across the map. Other details, like giving corridors twists and dead-ends, we’ll leave off for now in the name of simplicity.

The Algorithm

Today we’ll be using Binary Space Partitioning, a general purpose technique often used in computer graphics, collision-prediction in robotics, and other scenarios where diving up a physical space is useful. Here’s the approach, described in terms of our problem:

  1. Begin with a rectangle the size of the entire map

  2. Choose whether to cut the rectangle vertically or horizontally

  3. Choose a random cut-position along the x- or y-axis (according to step 2) such that the two child-rectangles are both above a minimum size - if no such cut is possible, stop

  4. Repeat step two on each child-rectangle

This recursive algorithm lets us break up one big rectangle into several smaller adjoining rectangles. For an arbitrary visualization, here’s a 100x100 rectangle, broken into smaller pieces so long as those pieces are at least 10x10:

We can visualize this rectangle cutting algorithm using a tree, which traces our recursive function calls:

Converting the Graph to a Dungeon

We have adjoining and non-overlapping rectangles, but these don’t look like rooms yet. To generate rooms we want to shrink our rectangles and add some buffer space between neighbors. For each leaf on the original graph, r1 through r12, let’s generate a rectangle with random dimensions between a minimum width and height of 10, and a maximum width and height of its parent rectangle minus two (for the borders). Then we’ll place the new rectangle randomly within the confines of its parent:

Okay, we have rooms! How about corridors? Here we’ll make use of our tree structure: once we’ve reached the bottom of the tree and placed a room in our rectangle, we’ll start returning back up the recursive tree. At each parent node we’ll randomly select a room from each child branch, and add an edge between them.

Here’s an example process for the dungeon we’ve been illustrating:

  • r3 and r4 are sibling rooms at the bottom of a branch, so at the cut node above them we will add an edge between r3 and r4

  • At the parent cut node we’ll select a random room from the left branch (r3 or r4) and a random room from the right branch (only r2 is available) and add an edge between them - we’ve connected r2 and r3

  • At the parent cut node we’ll select a random room from the left (r2, r3, or r4) and a random room from the right (r1) and add an edge between them - we’ve connected r1 and r3

  • At the parent cut node we’ll select a random room from the left (r1 through r4) and a random room from the right (r5 through r12) and add an edge between them - we’ve connected r1 and r7

The result looks like:

Note that a few paths overlap: r6 and r7 are connected, but the path is obscured by r1-r7 and r6-r10. The path from r1 to r7 intersects both r3 and r5. These collisions are desirable in our application, because they increase the diversity of room configurations - r3 has five doors, despite only being explicitly connected to three neighboring rooms.

Adding some Frills

We have a minimum-viable dungeon! Rooms of variable size, in a fully connected graph, mostly connecting nearby rooms to one another. Pretty bare-bones, but not bad for under 70 lines of Python. There’s lots more we could do from here: we could add some turbulence to paths, so that they sometimes zig-zag or move more diagonally between rooms. We could delete a room or two and part of its edges to introduce dead-ends - but we’d need to be careful to delete only rooms with a maximum degree of one to avoid disconnecting parts of the dungeon. We could introduce some shapes besides rectangles - perhaps one room represents a theater or Colosseum and should be elliptical. Until now we have only considered the structure of the map, but we can now add constraints to assign the purpose and contents of each room. Some example rules might include:

  • Rooms of a large size have a chance to be merchant halls or city centers

  • Rooms of moderate size, a skewed width to height ratio, and a maximum of two exits may be barracks

  • Small rooms adjoining a barracks may be an armory

  • Small rooms with only one doorway may be storehouses

  • Moderately sized single-exit rooms rooms may be shops

  • Distant rooms may be goblin dens, and have tunnels rather than corridors

Coloring rooms by their assigned purpose, swapping out r5 for an elliptical town center, and indicating that the path from r3 to r4 should be a roughly hewn tunnel yields a map with slightly more life in it:

Conclusion

Not much in the way of conclusion! This was fun to play with, and my main code is here if anyone wants to take a look.


What is NP? Why is NP?

Posted 12/11/2024

This post is about theoretical computer science. I’ve written it without getting too far into academic vocabulary, but this is a disclaimer that if thinking about Turing machines sounds dry as sand, this will not be the post for you.

In computer science we are often interested in how difficult a problem is, defined by how the number of steps required to solve a problem scales up as the size of the problem increases. We also use this kind of asymptotic analysis to discuss how well an algorithm scales, something I have written about before. I ended that post by discussing problem complexity classes, and especially three groups of interest:

  • P - the set of decision problems that can be solved in polynomial time or better, meaning they are relatively approachable

  • NP - the set of decision problems that can have their solutions verified in polynomial time or better, but may be much slower to solve

  • NP-Hard - the set of decision problems at least as hard as the hardest NP problems (which we refer to as NP-Complete), meaning this category also includes problems that cannot be verified in polynomial time

However, there is a second way to define NP: the set of all problems that can be solved in polynomial time by a Non-Deterministic Turing Machine. This is in fact where the name comes from, “Nondeterministic, Polynomial time.” In my undergraduate foundations class we glossed over this equivalent definition as a footnote and focused on “if you can verify in polynomial-time it’s in NP.” I never understood how this non-deterministic definition worked or why the two are equivalent, so many years later I’m digging in.

What in the world is a ‘Non-Deterministic Turing Machine?’

First, a quick refresher: A Turing Machine is a simple model of computation. It’s a theoretical machine that has an “input” (idealized as a tape of symbols drawn from a finite alphabet), and a current state from a finite list of states. The machine can do things like move the tape forward and backward so the tape head points at a different position, read symbols, write symbols, and change the current state.

A typical deterministic Turing Machine can only take one particular action given a particular state and input. Since its action is wholly determined by the state and input, it’s deterministic. Therefore, a non-deterministic Turing Machine (NTM hereafter) is one that can take multiple actions given a particular state and input.

So how do we evaluate an NTM if it can take more than one action at any given step? We usually think of an NTM as a branching process, where it executes all possible actions concurrently, perhaps in parallel universes. Then, once one path of the NTM leads to a result, we return that one resulting path and discard the other branches of the evaluation tree. Another way of thinking about this is that the NTM always guesses perfectly which of its available actions to take to yield a result in as few steps as possible.

As an example, imagine a breadth-first search on a square graph where you can move up, down, left, and right. We can represent the first two steps of such a search in an evaluation tree, as follows:

A deterministic Turing Machine evaluates each node in the evaluation tree one by one; that is, it evaluates “left, right, down, up,” then “left left, left right, left down, left up,” and so on. Therefore, the runtime of the breadth-first search scales with the size of the evaluation tree, which grows exponentially. However, a non-deterministic Turing machine evaluates each of its possible paths concurrently (or alternatively, always guesses which step to take correctly). It evaluates the first four moves in one parallel step, then all sixteen second steps in a second parallel step. Therefore, the number of steps a deterministic TM needs scales with the total number of nodes in the tree, but the steps an NTM needs scales only with the depth of the evaluation tree.

Note that when the NTM returns its answer - a path from the start to end point on the graph, as highlighted above - a verifier walks through that single path step by step. The verifier doesn’t need to make any complex decisions or multiple branching actions per input, it just reads one step at a time in the path, confirms that they’re valid steps for the search, and that the start point and end points of the path are correct. Therefore, the verifier can always be a deterministic Turing machine.

So, if the depth of the evaluation tree scales polynomially with the input size then an NTM will be able to solve the problem in polynomial time - and a TM will be able to verify that same answer in polynomial time. That’s why the two definitions of NP are equivalent.

Why frame complexity classes this way?

Okay, so that’s why we can describe NP problems in terms of a rather impractical non-deterministic Turing Machine, but what’s the advantage of doing so? Remember that there are two ways of evaluating a non-deterministic Turing Machine: we can think of each possible action for an input executing “in parallel” and then discarding the false paths once a solution is found, or we can think about the Turing Machine always correctly “guessing” the action that will lead to an answer in the shortest number of steps. Using this second definition we can frame anything beyond NP as “even if you guessed the right action to take at every step and bee-lined towards a solution, the runtime would still increase exponentially with the input size.”

Now P is the set of problems we can solve in polynomial time with no guesswork, NP consists of problems we can solve in polynomial time with perfect guesswork at each step, and anything beyond NP can’t be solved in polynomial time even with perfect guesswork.


Open Academic Publication

Posted 10/28/2023

I’m currently at a workshop on open practices across disciplines, and one topic of discussion is how to change the academic publishing process to be more accessible to both authors and readers. I’ve also had a few friends outside of academia ask me how publishing research papers works, so it’s a good opportunity to write a post about the messy world of academic publishing.

The Traditional Publication Model

Academics conduct research, write an article about their findings, and submit their article to an appropriate journal for their subject. There it undergoes review by a committee of peer researchers qualified to assess the quality of the work, and upon acceptance, the article is included in the next issue of the journal. In a simple scenario, the process is illustrated by this flowchart:

Libraries and research labs typically pay journals a subscription fee to receive new issues. This fee traditionally covered publication expenses, including typesetting (tedious for papers with lots of equations and plots and diagrams), printing, and mail distribution, along with the salaries of journal staff like editors, who are responsible for soliciting volunteer peer-reviews from other academics. These subscription fees were long considered a necessary evil: they limit the ability of low-income academics to access published research, such as scientists at universities in developing countries, let alone allowing the public to access research, but we all agree that printing and distributing all these journal issues has some significant financial overhead.

In recent decades, all significant journals have switched to majority or exclusively digital distribution. Academics do most of the typesetting themselves with LaTeX or Microsoft Word templates provided by the journals, and there’s no printing and negligible distribution costs for hosting a PDF online, so publication fees now go largely to the profit margins of publishers. This has made academic publishing ludicrously profitable, with margins as high as 40% in a multi-billion dollar industry.

The Shift to Open Publishing

Academics complain bitterly that journal publishers are parasitic, charging exorbitant publication fees while providing almost no service. After all, research is conducted by academics and submitted to the publishers for free. Other academics review the research, also for free, as peer-review is considered expected community service within academia. Since academics are typically funded by government agencies (such as the National Science Foundation, Department of Energy, and Department of Defense in the United States), this is taxpayer-funded public research, whose distribution is being limited by publishers rather than facilitated by them.

As journal subscription costs grew, these complaints eventually evolved into threats by universities to cancel their journal subscriptions, and funding agencies like the NSF began to demand that work they fund be made publicly accessible. The publisher profit margins were endangered, and they needed to act quickly to suppress dissent.

Many publishers now offer or require an alternative publishing scheme: Open Access. Under Open Access, articles can be read for free, but academics must pay to have their work published in order to cover staff salaries and the burdensome cost of web-hosting PDFs. This not only protects the revenue stream of publishers, but can expand it dramatically when journals like Nature Neuroscience charge $11,690 per article.

Open Access Logo

While Open Access allows academics with fewer resources to read scholarly work from their peers, and allows the public to read academic papers, it also inhibits academics with less funding from publishing if they can’t afford the publication fees. Further, it provides an incentive for publishers to accept as many papers for publication as possible to maximize publication fees, even if these papers are of lower quality or do not pass rigorous peer-review. When journals are paid under a subscription model they make the same income whether a new issue has ten or a hundred articles in it, and so it is more profitable to be selective in order to maximize the ‘prestige’ of the journal and increase subscriptions.

What Can Be Done?

Academic research remains constrained by publishers, who either charge a fortune before publication, or after, while providing minimal utility to academia. These costs disproportionately impact researchers with less funding, often those outside North America and Europe. The most obvious solution to this problem might be “replace the journals with lower-cost alternatives,” but this is easier said than done. Even if we could find staff to organize and run a series of lower-cost journals, there’s a lot of political momentum behind the established publishers. Academics obtain grant funding, job offers, and tenure through publishing. Successful publishing means publishing many papers in prestigious journals and getting many citations on those papers. A new unproven journal won’t replace a big name like Nature or Science any time soon in the eyes of funding agencies and tenure committees, and will take time to gather a loyal readership before papers in it receive many reads or citations. While I hope for eventual reform of journals and academic institutional practices at large, a more immediate solution is needed.

Collective Bargaining

One option is to simply pressure existing journals into dropping fees. If enough universities threaten to cut their subscriptions to major journals, then publishers will have no choice but to lower subscription costs or Open Access publication fees and accept a lower profit margin. This strategy has seen some limited success - some universities are cutting their contracts with major publishers, perhaps most notably when the University of California system ended their subscription to all Elsevier journals in 2019. However, this strategy can only work if researchers have leverage. Elsevier is the worst offender, and so universities can cut ties and push their researchers to publish in competitor journals from Springer or SAGE, but the costs at those competitor publishers remains high.

Preprints

Physicists popularized the idea of a “preprint.” Originally this consisted of astrophysicists emailing rough drafts of their papers to one another. This had less to do with publication fees and more to do with quickly sharing breakthroughs without the delays that peer-review and publication incurs. Over time, the practice shifted from mailing lists to centralized repositories, and grew to encompass physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. That preprint service is called arXiv. This effort has been replicated in other fields, including bioRxiv, ChemRxiv, medRxiv, and SocArXiv, although preprint usage is not common in all fields.

ArXiv Logo

Papers submitted to preprint servers have not undergone peer-review, and often have little to no quality control - the moderators at arXiv will give a paper a quick glance to remove obvious spam submissions, but they have neither the resources nor the responsibility to confirm that research they host is of high quality or was conducted ethically. Preprint papers were always intended to be rough drafts before publication in real journals, not a substitution for publication. Nevertheless, it is common practice for scholars to bypass journal paywalls by looking for a preprint of the same research before it underwent peer-review, so in practice preprint servers already serve as an alternative to journal subscriptions.

Shadow Libraries

The direct action counter to journal subscription fees is to simply pirate the articles. Sci-Hub and Library Genesis (URLs subject to frequent change) acquire research papers and books, respectively, and host them as PDFs for free, ignoring copyright. Both shadow libraries have been sued for copyright infringement in several jurisdictions, but have rotated operations between countries and have so far avoided law enforcement.

Sci-Hub Raven Logo

Use of Sci-Hub is ubiquitous in STEM-academia, and is often the only way that researchers can access articles if they have limited funding or operate out of sanctioned locations, such as Russia during the Russia-Ukraine war. Sci-Hub’s founder, Alexandra Elbakyan, considers the site’s operations to be a moral imperative under the Universal Declaration of Human Rights, which guarantees all human beings the right to freely share in scientific advancements and their benefits. Whether or not you agree with Elbakyan’s stance, it seems clear that a combination of shadow libraries and preprint services have undermined the business models of traditional academic publishers and made them more amenable to alternatives like Open Access, and more susceptible to threats by universities to end subscriptions.

What Comes Next?

Academic publishing is approaching a crisis point. Research funding in most disciplines is scarce, and journal subscription or publication fees are steadily increasing. The number of graduate and postgraduate researchers is growing, guaranteeing an accelerating rate of papers, straining publication fees and the peer-review system even further. Academics have tolerated the current system by using preprints and shadow libraries to share work without paying journals, but these are stopgaps with a range of shortcomings. If academic research is to flourish then we will see a change that lowers publication costs and perhaps relieves strain on peer reviewers, but what that change will look like or how soon it will come remains open to debate.


When is a network “decentralized enough?”

Posted 08/08/2023

I’ve submitted a new paper! Here’s the not-peer-reviewed pre-print. This post will discuss my work for non-network-scientist audiences.

There is broad disillusionment regarding the influence major tech companies have over our online interactions. Social media is largely governed by Meta (Facebook, Instagram, Whatsapp), Google (YouTube), and Twitter. In specific sub-communities, like open source software development, a single company like GitHub (owned by Microsoft) may have near-monopolistic control over online human collaboration. These companies define both the technology we use to communicate, and thereby the actions we can take to interact with one another, and the administrative policies regarding what actions and content are permissible on each platform.

In addition to debates over civic responsibility and regulation of online platforms, pushback to the centralized influence of these companies has taken two practical forms:

  1. Alt-Tech. Communities that are excluded from mainstream platforms, often right-wing hate and conspiracy groups, have built an ecosystem of alternative platforms that mirrors their mainstream counterparts, but with administrations more supportive of their political objectives. These include Voat and the .Win-network (now defunct Reddit-clones), BitChute and Parler (YouTube-clones), Gab (Twitter-clone), and many others.

  2. The Decentralized Web. Developers concerned about centralized control of content have built a number of decentralized platforms that aim to limit the control a single entity can have over human communication. These efforts include Mastodon, a Twitter alternative consisting of federated Twitter-like subcommunities, and ad-hoc communities like a loose network of self-hosted git servers. The decentralized web also encompasses much older decentralized networks like Usenet and email, and bears similarity to the motivations behind some Web3 technologies.

It is this second category, of ostensibly self-governed online communities, that interests me. Building a community-run platform is a laudable goal, but does the implementation of Mastodon and similar platforms fall short of those aspirations? How do we measure how ‘decentralized’ a platform is, or inversely, how much influence an oligarchy has over a platform?

The Community-Size Argument

One common approach to measuring social influence is to examine population size. The largest three Mastodon instances host over half of the entire Mastodon population. Therefore, the administrators of those three instances have disproportionate influence over permitted speech and users on Mastodon as a whole. Users who disagree with their decisions are free to make their own Mastodon instances, but if the operators of the big three instances refuse to connect to yours then half the Mastodon population will never see your posts.

A size disparity in community population is inevitable without intervention. Social networks follow rich-get-richer dynamics: new users are likely to join an existing vibrant community rather than a fledgling one, increasing its population and making it more appealing to future users. This is fundamentally a social pressure, but it is even further amplified by search engines, which are more likely to return links to larger and more active sites, funneling potential users towards the largest communities.

But is size disparity necessarily a failure of decentralization? Proponents of Mastodon have emphasized the importance of small communities that fit the needs of their members, and the Mastodon developers have stated that most Mastodon instances are small, topic-specific communities, with their mastodon.social as a notable exception. If smaller communities operate comfortably under the shadow of larger ones, perhaps this is a healthy example of decentralized governance.

Before exploring alternative methods for measuring social centralization, let’s compare a few of these decentralized and alt-tech platforms using the lens of community size. Below is a plot of sub-community population sizes for five platforms.

The y-axis represents the population of each community as a fraction of the largest community’s size. In other words, the largest community on each platform has a size of “1”, while a community with a tenth as many users has a size of “0.1”. The x-axis is what fraction of communities have at least that large a population. This allows us to quickly show that about 2% of Mastodon instances are least 1% the size of the largest instance, or alternatively, 98% of Mastodon instances have fewer than 1% as many users as the largest instance.

This puts Mastodon in similar territory as two centralized platforms, BitChute and Voat. Specifically, the number of commenters on BitChute channels follows a similar distribution to Mastodon instance sizes, while the distribution of Voat “subverse” (analogous to “subreddits”) populations is even more skewed.

By contrast, the number of users on self-hosted Git servers (the Penumbra of Open-Source), and unique authors on Polish Usenet newsgroups, is far more equitable: around a third of git servers have at least 1% as many users as the largest, while the majority of newsgroups are within 1% of the largest.

Inter-Community Influence

If smaller communities exist largely independently of larger ones, then the actions of administrators on those large communities does not matter to the small community, and even in the face of a large population disparity a platform can be effectively decentralized. How can we measure this notion of “independence” in a platform-agnostic way such that we can compare across platforms?

Each of the five platforms examined above has some notion of cross-community activity. On Mastodon, users can follow other users on both their own instance and external instances. On the other four platforms, users can directly participate in multiple communities, by contributing to open source projects on multiple servers (Penumbra), or commenting on multiple channels (BitChute), subverses (Voat), or newsgroups (Usenet).

In network science terminology, we can create a bipartite graph, or a graph with two types of vertices: one for communities, and one for users. Edges between users and communities indicate that a user interacts with that community. For example, here’s a diagram of Mastodon relationships, where an edge of ‘3’ indicates that a user follows three accounts on a particular instance:

This allows us to simulate the disruption caused by removing an instance: if mastodon.social went offline tomorrow, how many follow relationships from users on kolektiva.social and scholar.social would be disrupted? More globally, what percentage of all follow relationships by remaining users have just been pruned? If the disruption percentage is high, then lots of information flowed from the larger community to the smaller communities. Conversely, if the disruption percentage is low, then users of the smaller communities are largely unaffected.

Here is just such a plot, simulating removing the largest community from each platform, then the two largest, three largest, etcetera:

From this perspective on inter-community relationships, each platform looks a little different. Removing the largest three Mastodon instances has a severe effect on the remaining population, but removing further communities has a rapidly diminished effect. Removing Usenet newsgroups and BitChute channels has a similar pattern, but less pronounced.

Voat and the Penumbra require additional explanation. Voat, like Reddit, allowed users to subscribe to “subverses” to see posts from those communities on the front page of the site. New users were subscribed to a set of 27 subverses by default. While the two largest subverses by population (QRV and 8chan) were topic-specific and non-default, the third largest subverse, news, was a default subverse with broad appeal and high overlap with all other communities. Therefore, removing the largest two communities would have had little impact on users uninvolved in QAnon discussions, but removing news would impact almost every user on the site and cuts nearly 10% of interactions site-wide.

The Penumbra consists of independently operated git servers, only implicitly affiliated in that some developers contributed to projects hosted on multiple servers. Since servers are largely insular, most developers only contribute to projects on one, and so those developers are removed entirely along with the git server. If a user contributed to projects hosted on two servers then disruption will increase when the first server is removed, but will decrease when the second server is removed, and the developer along with it. This is shown as spiky oscillations, where one popular git server is removed and drives up disruption, before another overlapping git server is removed and severs the other side of those collaborations.

Sometimes you may be uninterested in the impact of removing the largest 2, 3, or 10 instances, and want a simple summary statistic for whether one platform is “more centralized” than another. One way to approximate this is to calculate the area under the curve for each of the above curves:

This scores Mastodon as the most centralized, because removing its largest instances has such a large effect on its peers. By contrast, while the Voat curve is visually striking, it’s such a sharp increase because removing the largest two communities doesn’t have a large impact on the population.

Situating Within Network Science

“Centralization” is an ill-defined term, and network scientists have a range of ways of measuring centralization for different scenarios. These metrics fall into three broad categories:

Scale Description Examples
Vertex Measures how central a role a single node plays in the network Betweenness centrality, Eigenvector centrality
Cluster Measures aspects of a particular group of vertices Assortativity / Homophily, Modularity, Insularity / Border index
Graph A summary attribute of an entire graph Diameter, Density, Cheeger numer

These metrics can capture aspects of centrality like “this vertex is an important bridge connecting two regions of a graph” or “this vertex is an important hub because many shortest paths between vertices pass through it.” They can measure how tight a bottleneck a graph contains (or, phrased another way, how well a graph can be partitioned in two), they can measure how much more likely similar vertices are to connect with one another, or how skewed the degree distribution of a graph is.

However, these metrics are mostly intended for fully connected unipartite graphs, and do not always have clear parallels in disconnected or bipartite graphs. Consider the following examples:

Many would intuitively agree that the left-most graph is central: one community in the center is larger than the rest, and serves as a bridge connecting several other communities together. By contrast, the middle graph is decentralized, because while the communities aren’t all the same size, none are dramatically larger than one another, and none serve a critical structural role as a hub or bridge.

The graph on the right is harder to describe. One community is much larger than its peers, but the remaining graph is identical to the decentralized example. By degree distribution, the graph would appear to be centralized. If we add a single edge connecting the giant community to any user in the main graph, then the giant community’s betweenness centrality score would skyrocket because of its prominent role in so many shortest-paths between users. However, it would still be inappropriate to say that the largest community plays a pivotal role in the activity of the users in the rest of the graph - it’s hardly connected at all!

My disruption metric is a cluster-level or mesoscale measurement for bipartite graphs that measures the influence of each community on its peers, although you can calculate the area under the disruption curve to make a graph-scale summary statistic. Using this approach, the centralized community is decidedly centralized, and the decentralized and ambiguous graphs are decidedly not.

Takeaways

Community size disparity is natural. Some communities will have broader appeal, and benefit from more from rich-get-richer effects than their smaller, more focused peers. Therefore, even a thriving decentralized platform may have a highly skewed population distribution. To measure the influence of oligarchies on a platform, we need a more nuanced view of interconnection and information flow between communities.

I have introduced a ‘disruption’ metric that accounts for both the size of a community and its structural role in the rest of the graph, measuring its potential influence on its peers. While the disruption metric illustrates how population distributions can be deceptive, it is only a preliminary measurement. Follows across communities and co-participation in communities are a rough proxy for information flow, or a network of potential information flow. A more precise metric for observed information flow might measure the number of messages that are boosted (“retweeted”) from one Mastodon instance to another, or might measure how frequently a new discussion topic, term, or URL appears first in one community, and later appears in a “downstream” community.

Does population size correlate with these measurements of information flow and influence? Are some smaller communities more influential than their size would suggest? How much does the graph structure of potential information flow predict ‘social decentralization’ in practice? There are many more questions to explore in this domain - but this is a start!


View older posts