Model Collapse

Posted 5/29/2026

I’ve heard many people talk about model collapse recently, especially regarding Large Language Models. This is a phenomenon where training a machine learning model on the outputs of itself or other machine learning models can often degrade performance and magnify errors. In this post I want to talk a little about the phenomenon and its implications for the future of LLMs.

How does Model Collapse Work?

Let’s demonstrate the problem with one of the simplest machine learning models: linear regression. Here we have a “true function” (a cosine wave) which can only be observed noisily. This means given an X value we can observe the Y value of the cosine wave plus a small error term. We’ll gather thirty samples from random X points and try to fit a degree 3 polynomial, so fitting y = ax + bx^2 + cx^3 + d to the sampled points.

#!/usr/bin/env python3
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Wrap the function in a class that behaves like a sklearn predictor
class TrueFunc:
    def true_fun(self, X):
        return np.cos(1.5 * np.pi * X)
    def predict(self, X):
        X_relevant = X[:,1] # Use X, not X^0 or X^2
        return self.true_fun(X_relevant)

np.random.seed(0)
n_samples = 30
degree = 3
tf = TrueFunc()

X = np.sort(np.random.rand(n_samples))
X_rotated = X.reshape(-1, 1) # Make this an Nx1 array instead of 1xN
expanded_features = PolynomialFeatures(degree).fit_transform(X_rotated)
# Sample with noise
y = tf.predict(expanded_features) + np.random.randn(n_samples) * 0.1
# Fit a curve to the samples
reg = LinearRegression().fit(expanded_features, y)

# Plot the resulting curves across [0..1] and the sampled points
X_test = np.linspace(0, 1, 100)
X_test_rotated = PolynomialFeatures(degree).fit_transform(X_test.reshape(-1,1))
plt.figure(figsize=(6,4), dpi=120)
plt.plot(X_test, tf.true_fun(X_test), label="True Function")
plt.plot(X_test, reg.predict(X_test_rotated), label="Model")
plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
plt.legend(loc="best")
plt.show()

So far so good, we’ve fit a line that approximates the original function and it does pretty well! But what happens if we noisily sample from our fit line, and try to train a degree three linear regressor on that? And then train another model off of that one?

Let’s go ahead and fit five hundred linear regression curves, each to thirty points sampled noisily from its predecessor:

modelCount = 500
models = [tf]
orig_X = []
orig_Y = []
for i in range(modelCount):
    model = models[-1]
    X = np.sort(np.random.rand(n_samples))
    X_rotated = X.reshape(-1, 1) # Make this an Nx1 array instead of 1xN
    expanded_features = PolynomialFeatures(degree).fit_transform(X_rotated)
    # Sample with noise
    y = model.predict(expanded_features) + np.random.randn(n_samples) * 0.1
    if( i == 0 ):
        orig_X = X
        orig_Y = y
    reg = LinearRegression().fit(expanded_features, y)
    models.append(reg)

plt.figure(figsize=(6,4), dpi=120)
X_test = np.linspace(0, 1, 100)
X_test_rotated = PolynomialFeatures(degree).fit_transform(X_test.reshape(-1,1))
for i in [0, 100, 200, 500]:
    model = models[i]
    if( i == 0 ):
        plt.plot(X_test, model.true_fun(X_test), label="True Function")
    else:
        plt.plot(X_test, model.predict(X_test_rotated), label="Model %d" % i)
plt.legend(loc="best")
plt.show()

As we can see, the error terms accumulate, so we drift further and further from the original function. At five hundred curves, the shape of the original function is almost entirely lost. You can’t improve linear regression by generating synthetic data points from your fit curve and then re-fitting to those.

All machine learning models, whether regressors, classifiers, or generative models, suffer from a similar challenge. The models have imperfect outputs, they predict a little too high or too low, they occasionally misclassify a point, they hallucinate¹ some outputs. Therefore, training on those mistakes will magnify errors and degrade performance in nearly every domain.

There are some rare exceptions where training on synthetic data is warranted, such as imputing missing values, which I discussed in another machine learning post, or training a small and simple model to mimic the behavior of a larger and more complex model. However, even in these scenarios data scientists know that synthetic data degrades model performance, and they must carefully ensure they accomplish their goals without damaging the model to the point that it loses utility.

How Does this Apply to LLMs?

In the case of large language models our objective is to mimic human writing by ingesting an enormous volume of text and fitting to patterns in word co-occurrence. Therefore, we want to avoid training on text written by other LLMs, as this represents synthetic data points that are likely to amplify bad behaviors in our model.

Unfortunately, as LLMs have proliferated they are responsible for lots of text online, from social media comments to code commits. If you conduct large-scale web scraping you will inevitably include lots of LLM-written text, and there’s no easy way to filter it out to get a “clean” training set.

But why do we need new data? Don’t we have enough text to train LLMs? If we scrape every book in every library and all the Reddit comments, tweets, and skeets from before 2022², is that not enough?

Well, if your goal is simply to build a machine that can read and write English, then that’s fine. Train a model once, now you have an LLM you can run locally. However, we often expect models to answer questions about current events. In order to answer prompts with contemporary data, you have two options:

Train the model using recent text, so recent history and public figures become part of the pattern the model has observed and fit to
Combine the model with a secondary data source; for example, Google Search may find several recent web pages about the topic you search for, feed these to their Gemini LLM, and ask it to summarize the articles to answer your search query

The second option can help in some circumstances (especially summarizing breaking news), but there’s a limit to how much context you can feed the LLM along with each prompt, so it’s a bandaid rather than a solution to the overall problem.

Case Study: Writing Code

For an illustrative example, consider Claude Code or Microsoft’s CoPilot, LLMs trained to help write software. These models have read nearly every commit on GitHub, StackOverflow question, and whatever other source code the companies could get their hands on. The models can now write useful code, from aiding in debugging to writing a short function or SQL query, to “vibe coding” entire applications if you’re into that sort of thing. We could debate the exact quality and maintainability of LLM-written code, or the impacts this will have on software engineering as an industry and training junior engineers, but those aren’t the aspects I want to focus on today.

Two years from now there will be several new libraries or frameworks, and existing packages and websites will change their APIs. If we froze an LLM right now with its present training data, then in two years it will necessarily produce ‘stale’ code using outdated libraries and deprecated APIs. Over time, such an LLM becomes less and less useful. Eventually, it will produce code that no longer works because the surrounding environment has diverged too greatly from the training examples the model has seen.

So to keep Claude Code relevant it must be continually trained on new git commits, new StackOverflow questions. This teaches the model about new libraries and API changes, but given the proliferation of LLM-written code on GitHub it also means you are necessarily training the model on the outputs of other models.

What might you pick up by training off of LLM outputs? Well think about what mistakes LLMs make when writing code. They may replicate bugs that they observed in training commits. They may hallucinate methods or arguments to functions that don’t exist. They may replicate typos they’ve learned from, including typoed package names, leading to an explosion in typo-squatting attacks where malicious actors release malware under package names close to real packages, hoping victims will accidentally install their malware. When an LLM sees these mistakes in its training data it is more likely to make the same mistakes in the future, magnifying the likelihood of introducing bad code, bugs, and security vulnerabilities.

In short, as LLMs proliferate they will ‘contaminate’ the training pool, and make it harder and harder to build future models. These challenges may not be insurmountable - one can imagine an army of human reviewers sifting out the “good” code from the bad, or an elaborate series of sandboxes and unit tests to try to improve automatic detection of bugs - but they will certainly prove difficult to overcome.

Footnotes

I think “hallucinate” is a misnomer, and it would be more appropriate to say that generative models are always hallucinating and sometimes those hallucinations are incidentally accurate or useful. Generative models lack a view or understanding of the world sufficient to differentiate fact from fiction, they simply generate plausible-looking outputs based on training data. Sometimes those outputs “look right” to humans, and when they “look wrong” we call it a hallucination, but there’s no functional difference to the machine. However, that’s a tangent that’s not helpful here, so we’ll stick with the accepted term. ↩
I like comparing pre-LLM text to the low-background steel problem, wherein we have contaminated the atmosphere after nuclear bomb tests such that all steel contains trace radioactive elements. When building particularly radiation-sensitive equipment, such as Geiger counters or particle detectors, we use metal produced before the atomic tests that has been subsequently shielded from fallout. Often this means harvesting metal from pre-1940s shipwrecks! ↩