Here's something that surprises most engineers when they first dig into LLM serving: your GPU runs at 90-95% utilization while processing the input prompt, then drops to 20-40% while generating output. Same hardware. Same model. Same request. Two completely different utilization profiles — because those two phases have fundamentally different bottlenecks.
This isn't a bug or a configuration problem. It's a direct consequence of how transformer inference works. Understanding it is the difference between guessing at LLM performance and having an actual mental model you can reason from.
Two phases, two bottlenecks
Every LLM call has two distinct stages.
Prefill is where your input prompt gets processed. Every token in the prompt is read in parallel. The attention mechanism computes relationships between all token pairs simultaneously — quadratic work proportional to sequence length. For a 16K-token prompt, the attention computation touches 256 million token pairs. This is pure matrix-matrix multiplication (GEMM), the operation GPU tensor cores were designed for. The GPU is fully saturated with arithmetic. Prefill is compute-bound.
Decode is where output gets generated. The model produces one token at a time, autoregressively. Token N+1 can't start until token N is complete — no parallelism across output tokens. Each step does relatively little arithmetic (one token's worth), but it requires loading the entire model's weights from GPU high-bandwidth memory (HBM) into the compute units for every single token. On a 70B parameter model at FP16, that's ~140 GB of data moved per output token. The GPU sits mostly idle waiting for memory transfers. Decode is memory-bandwidth-bound, not compute-bound.
This is why GPU utilization falls off a cliff the moment generation starts. The hardware isn't broken — it's starved for data.

Your $30,000 H100, running at 30% utilization during decode, waiting on the memory bus like a senior engineer waiting on a PR review.
What the numbers actually look like
Take an H100 SXM GPU: 3.35 TB/s of memory bandwidth. A Llama 3.1 70B model at BF16 is 140 GB of weights. Each decode step requires loading those weights. So the theoretical ceiling is:
3.35 TB/s ÷ 140 GB ≈ 24 tokens/sec per decode stream
That's the hard ceiling imposed by memory bandwidth — before you've even thought about batching or efficiency overhead. An A100, with 2 TB/s bandwidth, tops out around 14 tokens/sec on the same model. The H200's primary advantage over the H100 is its 141 GB HBM3e memory, which delivers roughly 1.9x the H100's decode throughput on Llama 70B. Not faster tensor cores — more memory bandwidth.
For prefill, the GPU is compute-bound instead. Doubling prompt length roughly quadruples prefill time (quadratic attention). This is why a 32K-token prompt pushes Time-to-First-Token (TTFT) into 1-3 seconds even on top hardware, while a 1K-token prompt might prefill in under 100ms.
KV cache: what it is and why it controls everything
During prefill, attention computes Key and Value vectors for every input token. Without caching, every decode step would recompute those vectors for all prior tokens — a catastrophic waste. The KV cache stores these vectors so they don't get recomputed. It's what makes decode at all tractable.
The memory cost of the KV cache is real and significant. For Grouped-Query Attention (used in Llama 3.1, Mistral, and most modern open models):
KV Cache Bytes = 2 × layers × kv_heads × head_dim × seq_len × batch_size × bytes_per_element
For Llama 3.1 70B in BF16, that's roughly 0.31 MB per token per sequence. At 128K context length, a single conversation holds about 40 GB of KV cache — and the model weights themselves consume 140 GB. Put those two numbers together and you immediately understand why fitting long-context workloads on GPU is so constrained.

You, re-reading the KV cache memory formula for the third time and wondering why nobody warned you about this.
The KV cache directly limits batch size. More concurrent sequences means more KV cache in VRAM. When VRAM fills up, you have to reduce batch size or start evicting cache entries, which forces recomputation. This tension between KV cache size and throughput is one of the central engineering problems in LLM serving.
Speculative decoding: buying multiple tokens for the cost of one
The decode phase is where speculative decoding cuts in, and the mechanism is clever.
A small draft model (say, a 7B parameter version) generates a batch of K candidate tokens cheaply — because 7B is fast and cheap on memory bandwidth. The large target model (70B) then verifies all K tokens in a single parallel forward pass. This verification looks like a prefill step: all K tokens processed simultaneously, compute-bound, GPU fully utilized.
If the draft tokens match what the target model would have predicted, they're all accepted and output at once. If a draft token diverges, everything after it is discarded, and the target model's corrected token is used instead. You keep every correctly-predicted draft token, so even partial acceptance wins.
The economics: you pay roughly the same compute as generating 1-2 tokens from the target model alone, but you get 5-8 tokens in the common case when the draft model is accurate. That's 2-3x throughput improvement in practice, and in benchmark conditions NVIDIA has demonstrated 3.6x on H100s.
The catch: acceptance rate depends on how well the draft model's distribution matches the target model's. On common prose and code, acceptance rates are high. On novel prompts, uncommon languages, or highly constrained outputs, the draft diverges more often and the speedup shrinks. Most major inference providers — Anthropic for Claude, OpenAI, Together AI — now ship speculative decoding by default.
vLLM and PagedAttention: making batching actually work
Traditional inference servers pre-allocate a contiguous chunk of VRAM per request equal to the maximum possible context length. Most requests use a fraction of that. The result: 60-80% of reserved KV cache memory is empty but unavailable — reserved, not used. At high concurrency, you run out of VRAM long before the compute is saturated.
PagedAttention (the core innovation in vLLM) borrows the operating system's virtual memory design. KV cache is split into fixed-size "pages." A block table maps each sequence's logical KV addresses to non-contiguous physical GPU memory pages, allocated on demand as tokens are actually generated. Near-zero fragmentation. Allocate exactly what you need, when you need it.
The downstream effect is a dramatic increase in the number of concurrent sequences that fit in the same VRAM — which directly translates to higher throughput.
But PagedAttention alone isn't the whole story. Static batching makes it worse: the server assembles a batch of N requests, runs them all to completion, then assembles the next batch. A single long sequence in the batch stalls every other request in it. GPU sits idle waiting for the slow outlier.
Continuous batching (iteration-level scheduling) fixes this. After every single decode step — every token — the scheduler checks whether any sequence finished. Done sequences are evicted immediately and replaced with waiting requests from the queue. No idle time between sequences. The GPU stays full.
The throughput improvement from switching static batching to continuous batching: Anyscale reported 23x in their benchmarks. Production workloads at realistic concurrency levels typically see 3-10x improvement. Those numbers are why virtually every serious inference stack — vLLM, TensorRT-LLM, SGLang — ships continuous batching by default.

23x throughput improvement from one scheduling change. Perfectly normal. No big deal. Just casually transforming your infrastructure economics.
FlashAttention: why long context is feasible at all
Standard attention materializes a full N×N matrix of attention scores in HBM. At 8K tokens, that's 64 million elements written to and read from memory — multiple times, in multiple passes. Attention at long context is dominated by this IO overhead, not by arithmetic.
FlashAttention tiles the Q, K, V matrices into blocks small enough to fit in on-chip SRAM (which runs at ~19 TB/s on an H100 vs. ~3.35 TB/s for HBM). It computes softmax in a numerically stable streaming fashion within SRAM, fusing multiple operations, and only writes the final output back to HBM. Fewer HBM round-trips — faster wall-clock time.
The benchmark numbers: 3x speedup on GPT-2 at 1K sequence length; 15% end-to-end speedup on BERT at 512 tokens; 2.4x on tasks with sequences in the 1K-4K range. The gains grow with sequence length because IO costs scale quadratically and FlashAttention cuts those costs disproportionately.
Without FlashAttention, 128K-token context windows would be prohibitively slow in practice. It's the reason long-context models exist outside of research.
Multi-GPU: tensor parallelism vs. pipeline parallelism
When a model doesn't fit on a single GPU, you need to split it. The choice of how to split determines latency vs. throughput tradeoffs.
Tensor parallelism shards each weight matrix horizontally across GPUs. Every GPU works on every layer simultaneously, and results are all-reduced (synchronized) at each layer boundary. A single request gets the combined compute of all GPUs — low latency. But it requires high-bandwidth GPU interconnect (NVLink). On PCIe, the synchronization overhead consumes 40-50% of inference time at TP=4. NVLink is effectively mandatory for tensor parallelism beyond 2 GPUs.
Pipeline parallelism splits the model layer-by-layer. GPU 0 runs layers 1-20, GPU 1 runs layers 21-40, and so on. At any given moment, only one GPU is active per request. At PP=4, each GPU sits idle 75% of the time for a single request. You compensate with micro-batching — filling the pipeline with multiple requests — which improves throughput at the cost of latency.
Production clusters typically combine both. Tensor parallelism within a node (where NVLink is available), pipeline parallelism across nodes (where only Ethernet or InfiniBand connects them). Llama 3.1 405B effectively requires TP=8 or higher for acceptable per-request latency.
TTFT vs. TPOT: which one actually matters for your use case
Time-to-First-Token (TTFT) is controlled by prefill speed plus queue wait plus network RTT. Decode speed doesn't affect it.
Time-per-Output-Token (TPOT) — also called inter-token latency — is controlled by decode throughput: memory bandwidth divided by model size.
These metrics pull in opposite directions when you're optimizing. To minimize TTFT you want fewer tokens in the input and more compute dedicated to prefill. To maximize TPOT throughput you want large batches and high memory bandwidth. Optimizing one often hurts the other.
What your use case actually needs:
- Chat and interactive applications: TTFT below 500ms for "feels responsive." TPOT around 20-30 tokens/sec (33-50ms per token) is the comfortable reading-pace range.
- Code autocomplete (Copilot-style): TTFT must be under 100ms or it breaks the flow. TPOT matters less because outputs are short.
- Voice synthesis: Needs 150+ tokens/sec to keep up with audio. TPOT dominates completely. Aggressive quantization (INT4) is usually mandatory.
- Batch/async jobs (document summarization, eval runs, data pipelines): TTFT is irrelevant. Maximize tokens-per-second-per-dollar with large static batches.
- Long-form generation (blog posts, reports at 800+ tokens): Total latency = TTFT + (output_tokens × TPOT). At 800 tokens and 20 tok/sec, generation takes 40 seconds regardless of TTFT. TPOT dominates.
The implication for infrastructure: a single serving configuration won't be optimal for all workloads. Some teams run separate inference clusters — one optimized for low TTFT (fewer concurrent sequences, dedicated prefill compute), one optimized for throughput (large batches, cheaper memory-bandwidth-heavy GPUs). This is called prefill-decode disaggregation, and research results from DistServe (OSDI 2024) show up to 2x goodput improvement from running prefill and decode on separate, purpose-built hardware pools.
The model that generalizes
The thing worth taking away from all of this is a single mental model: every constraint in LLM inference traces back to two physical facts. Prefill is compute-bound. Decode is memory-bandwidth-bound. Every optimization technique — speculative decoding, continuous batching, PagedAttention, FlashAttention, tensor parallelism, quantization — is an attempt to work around one or both of those facts.
When you see an odd latency curve in production, when a GPU upgrade doesn't deliver the throughput you expected, when a cost estimate comes in way off: the answer is almost always hiding in one of those two phases. Knowing which one to look at cuts the debugging surface in half.
