Why Data Annotation Is Critical for Modern LLMs

Llama 3.1 was trained on 15 trillion tokens of text. That sounds incomprehensible until you put it next to the Epoch AI estimate: about 300 trillion tokens of high-quality, deduplicated human text exist on the public internet. In a single training run, Meta consumed roughly 5% of it.

That number is the starting point for understanding why the frontier of AI development has shifted so dramatically in the last two years.

The wall is real and the math is unambiguous

Epoch AI researchers Pablo Villalobos and colleagues published their updated projections in 2024. At current scaling rates, frontier labs will exhaust the high-quality public internet text stock between 2026 and 2032. The median projection is 2028. Under aggressive overtraining scenarios, where the same data gets reused many times, exhaustion could arrive as early as 2025.

The projections got a real-world stress test in 2025. SemiAnalysis reported that OpenAI hadn't completed a full-scale pretraining run since GPT-4o in May 2024. Whatever the cause, the signal is clear: the era of "scrape more internet, train bigger model" is functionally over at the frontier.

The "just scrape more internet" strategy, hitting its natural conclusion around 2028.

Synthetic data won't save you

The obvious answer is synthetic data. If human-generated text is running low, generate it with the models you already have and train the next model on that. It sounds clean. The problem is researchers studied this directly, and the results landed in Nature.

Ilia Shumailov and colleagues published what they called "model collapse" in Nature in July 2024. When you train a model on outputs from another model, two things happen in sequence. First, the tails of the data distribution disappear: rare but important patterns get smoothed out, replaced by the statistical mean. Then, over successive generations of model-trains-on-model-output, the distribution collapses entirely. Outputs become repetitive and generic, bearing little resemblance to the original human-generated source. The mathematical mechanism is unforgiving, statistical approximation error compounds with each generation, and there's no known way to halt the collapse when the loop is closed.

Epoch AI's conservative position is to exclude synthetic data entirely from their projections. Their read: synthetic data has only reliably improved capabilities in narrow verifiable domains. Math problems with known answers. Code that can be unit-tested. It doesn't generalize. For anything where correctness can't be confirmed by an external oracle, synthetic training data degrades rather than improves model quality.

A model trained on synthetic data from a model trained on synthetic data from a model trained on synthetic data. Generation 4. It still thinks it's helpful.

Post-training is now the main arena

So how are frontier models still improving? They are. The answer is the post-training stack.

Post-training, fine-tuning a pretrained model for instruction-following, alignment, and specific capabilities, has become the competitively important layer. Nathan Lambert's 2025 analysis noted that labs with smaller pretraining budgets shipped models that competed credibly with frontier models because they invested heavily in post-training. For OpenAI's o1-class reasoning models, post-training compute accounts for 40% or more of total model compute. That percentage used to be a rounding error.

The core post-training method is Reinforcement Learning from Human Feedback (RLHF). The pipeline has three stages. First, Supervised Fine-Tuning (SFT): curate a dataset of prompt-and-ideal-response pairs written or verified by domain experts, then fine-tune the base model on them. Second, Reward Model training: collect pairwise human preference judgments (response A vs. response B, which is better?) and train a model to predict those preferences. Third, RL optimization: use the reward model's signal to push the policy toward outputs humans prefer.

A newer variant, Direct Preference Optimization (DPO), collapses stages two and three. Instead of training an explicit reward model and running PPO on top, DPO optimizes a binary cross-entropy loss directly on (chosen, rejected) response pairs. No separate reward model. No value network. No sampling loop during fine-tuning. It trades some raw performance on the hardest tasks for dramatically simpler training infrastructure. Most academic fine-tuning uses DPO now. Most production frontier models still use PPO.

Why the bottleneck shifted to human expertise

Here's where the data wall and the expert annotation bottleneck connect. SFT and reward modeling are only as good as the humans doing the annotation. For early RLHF tasks, is this translation accurate? is this response harmful?, crowdworkers were sufficient. For graduate-level math, advanced code reasoning, medical diagnosis, or scientific claim verification, you need humans who can actually judge whether a model output is correct.

The cost evidence is stark. A dataset of 600 high-quality RLHF preference pairs can cost $60,000, about 167 times the compute cost of the training run itself. The binding constraint in post-training is no longer GPU hours. It's human hours from people qualified to evaluate the task.

$60,000 for 600 preference pairs. Or roughly $100 per "I prefer response A over response B" from someone with a PhD. Totally reasonable. Carry on.

Platform rate cards make this concrete. Medical fellows: $250-$450 per hour. PhD STEM specialists: $50-$150 per hour. Generalist crowdworkers: $20-$50 per hour. Platforms like Outlier AI source math and code specialists at the higher end precisely because annotation value scales with annotator expertise. A $50/hour math expert identifying an incorrect proof step produces a training signal that a $12/hour generalist simply can't generate.

The proof case: PRM800K

The clearest demonstration of expert annotation's impact is OpenAI's PRM800K dataset. The researchers' problem: standard outcome supervision for math reasoning, where the reward model only sees whether the final answer is correct, was leaving performance on the table. Complex multi-step problems have long reasoning chains, and a wrong step early in the chain isn't distinguishable from a right one if you only evaluate the end.

OpenAI collected 800,000 step-level human labels. Annotators went through each intermediate reasoning step in GPT-4's math solutions and marked individual steps correct or incorrect. The process-supervised model that resulted solved 78% of MATH benchmark problems, up from 72% under outcome supervision. Six percentage points on a benchmark where individual points are hard to move.

This is not a task you can hand to a crowdworker. It requires a human who can evaluate whether a specific algebraic manipulation is valid, whether a proof gap exists, whether a case was missed. That's the work that PhD-level annotators do, and there's no substitute.

Reward Model calibration: the quiet engineering problem

Reward models are trained on human preferences, but they're vulnerable to a known failure mode: reward hacking, where the policy model discovers outputs that score well on the reward model without being genuinely better. A 2024 paper on reward calibration found that uncalibrated reward models can be strongly biased by output length, longer responses score higher regardless of actual quality, because length reads as effort to the model the same way it reads as effort to a distracted human.

Bayesian approaches to reward modeling, proposed at ICLR 2025, address this by signaling higher uncertainty for outputs far from the training distribution, which limits over-optimization. But calibration is ultimately a human verification problem. You need annotators who can confirm that the reward model's preference rankings match what correct human judgment would produce, which means they need to have correct human judgment in the first place.

Multi-turn expert interactions

Standard RLHF rates a single response in isolation. Real deployment failures are often conversational failures: context drift across turns, instruction forgetting, coherence loss over long exchanges. Capturing those failure modes requires annotators who can sustain a technically demanding back-and-forth for multiple exchanges and notice where the model went wrong. That's what "multi-turn expert interactions" means in annotation practice.

A 2024 paper on imitation learning for multi-turn LLM agents noted that expert trajectories across full conversations, not just single-turn demonstrations, are the essential training signal for agentic tasks. The annotators who can produce those trajectories are the ones who understand the technical domain well enough to maintain coherent, substantive conversations about it across many turns.

The practical implication

The data wall isn't an abstract problem anymore. It's a structural shift in where the competitive work of AI development lives. Pretraining is becoming a commodity; post-training is where the quality differences between frontier models are actually determined.

That makes domain-expert annotation the critical path. Not because there's a shortage of GPUs, but because there's a shortage of people who can evaluate a multi-step mathematical proof, identify a subtle logical error in generated code, or judge whether a medical reasoning chain missed a diagnosis.

If you're an engineer or domain specialist wondering whether annotation work is worth taking seriously: the answer is that it's the highest-leverage point in the current AI training pipeline. The labs that are pulling ahead aren't doing it with more data. They're doing it with better feedback from people who actually know what correct looks like.

That's a narrower group than it sounds. Which is exactly why it matters.