I've been building with LLMs for a while now, and here's the thing nobody warns you about. Most of what you learned in classic system design (request/response, stateless workers, throw more replicas at the problem) only gets you about halfway. The other half is its own world, with its own physics. Tokens behave differently from JSON. Inference behaves differently from a CRUD call. Latency and cost move in ways that surprise people who are otherwise excellent engineers.
So this is the post I wish I'd had a year ago. No selling, no course pitch. Just the mental model I keep coming back to whenever I'm designing one of these systems. If you're a backend or platform engineer trying to make sense of what's actually happening inside an LLM service, this one is for you.

Every backend engineer, upon discovering that "throw more replicas at it" doesn't work here either.
The unit of work is a token, not a request
The first shift is small but it changes everything downstream.
In a normal web service, you think in requests. One request comes in, one response goes out, and the cost is roughly fixed. With LLMs, the request is just a wrapper. The real work is measured in tokens, the small chunks of text the model actually reads and writes. A "token" sits somewhere between a character and a word. The English sentence you just read is maybe 30 tokens.
Why does this matter? Because two requests can look identical from the outside and cost wildly different amounts on the inside. A prompt with 500 tokens of input and 50 tokens of output is a cheap, fast thing. A prompt with 50,000 tokens of input and 4,000 tokens of output is a different animal. Different latency, different memory footprint, different price tag. Capacity planning, rate limiting, billing, SLOs: all of it has to be expressed in tokens, not requests, or the numbers stop meaning anything.
Prefill and decode are two different jobs
Here's the part that surprised me most when I first dug into it. A single LLM call is not one workload. It's two workloads stitched together, with very different characteristics.
Prefill is what happens to your prompt. The model reads every input token in parallel, builds up its internal state, and gets itself ready to generate. This phase is compute-heavy and GPU-friendly. It loves big batches. If you've ever wondered why a long prompt feels expensive upfront, this is why.
Decode is what happens after. The model generates output one token at a time. Each new token depends on every token that came before it, so it can't be parallelized the same way. This phase is memory-bandwidth-bound, not compute-bound. It's the part where the GPU is mostly waiting on memory, not arithmetic.
The practical consequence: prefill and decode want different things from the hardware, and a serious inference stack treats them as two distinct stages. Some setups even run them on different machines (this is called "disaggregated serving"). When you see weird latency curves in production, it's almost always one of these two phases misbehaving.
The KV cache is the thing you're actually paying for
Once you understand prefill vs decode, the KV cache makes sense.
During prefill, the model computes a bunch of intermediate state for every input token. The "keys" and "values" attention needs. Throwing that state away after each output token would mean re-doing all that work on every generation step, which would be absurd. So we keep it in GPU memory. That stash is the KV cache.
The KV cache grows linearly with sequence length. In practice, it's the single biggest reason your GPU runs out of memory. Long contexts aren't expensive because the model is "thinking harder." They're expensive because the KV cache eats your VRAM.
Almost every interesting optimization in modern inference (paged attention, prefix caching, prompt caching, chunked prefill) is, at heart, a clever way to manage this cache. If you remember one acronym from this post, make it KV.

Your VRAM when someone "just wants to test" a 128K-token prompt in production.
Attention scales quadratically (and that's why context windows are hard)
Why is doubling the context window such a big deal? Because the attention mechanism, in its standard form, scales with the square of the sequence length. Twice the tokens means roughly four times the work. Ten times the tokens means a hundred times the work.
A lot of recent research is about softening that curve. Sliding-window attention, sparse patterns, linear approximations, all sorts of tricks. But the default mental model you should carry is this: long prompts are not linearly more expensive. They're quadratically more expensive. Plan accordingly.
Batching is where throughput comes from
GPUs are happiest when they're doing a lot of similar work at once. A single inference request barely tickles them. The job of the serving layer is to gather requests from many users and process them together. That's batching, and it's the difference between an inference setup that costs a fortune and one that doesn't.
Naive batching has an obvious problem, though. Requests finish at different times. If one user wants 50 output tokens and another wants 500, the short request has to wait for the long one. The fix is continuous batching (sometimes called in-flight batching). As soon as one request in the batch finishes a token, it can drop out and a new request can take its slot. The batch is fluid, not fixed.
This is the single biggest throughput unlock in modern LLM serving. If you're picking an inference engine, "does it do continuous batching" is roughly the first question to ask.
Parallelism: when one GPU isn't enough
Modern models are too big to fit on one accelerator. So we split them. There are a few flavors of how, and they're worth telling apart:
- Data parallelism. Same model on every GPU, different requests on each. Easy, and what you reach for first when you just need more throughput.
- Tensor parallelism. A single layer of the model is sliced across multiple GPUs, all working on the same request in lockstep. This is what makes huge models fit at all, but it demands very fast interconnect between the GPUs.
- Pipeline parallelism. Different layers on different GPUs, like an assembly line. A request flows through stages.
- Expert parallelism. Relevant for mixture-of-experts models, where only a subset of the network activates per token.
You almost never pick just one. Production deployments mix these. Tensor parallelism within a node, pipeline parallelism across nodes, data parallelism across the whole fleet. The combination matters more than any single choice.
Embeddings: how text becomes something you can search
An embedding is a vector, a long list of numbers, that represents the meaning of a piece of text. Two sentences that mean similar things end up with vectors that point in similar directions, even if they share no words. That's the whole trick. It's what lets you build "semantic search" instead of keyword search.
Embeddings are produced by their own small model (much smaller than a chat LLM), and they're cheap to compute compared to generation. You run your documents through the embedding model once, store the vectors in a database, and then at query time you embed the user's question and look for the nearest neighbors.
The "nearest neighbor" part is its own field. Exact search is too slow at scale, so we use approximate algorithms (HNSW is the popular one) that trade a tiny bit of recall for orders of magnitude more speed. Vector databases like pgvector, Qdrant, Pinecone, and Weaviate are essentially wrappers around these algorithms with the operational ergonomics of a real database.
RAG is just "look stuff up before you answer"
Retrieval-Augmented Generation has a fancy name but the idea is simple. Before you ask the LLM to generate, fetch the relevant context from somewhere and stuff it into the prompt.
The flow is almost always the same:
- Take the user's question and embed it.
- Search the vector store for chunks that are semantically close.
- Optionally rerank those chunks with a more expensive model to improve quality.
- Stitch the top chunks into the prompt as context.
- Let the LLM answer, grounded in what you just retrieved.
That's it. RAG is the default pattern for anything where the model needs to know about your specific data. Your docs, your codebase, your customer history. The hard parts aren't the algorithm. They're chunking strategy (how do you split documents?), retrieval quality (are you actually getting the right chunks?), and context management (what do you do when the relevant context doesn't fit?).
Done well, RAG is the difference between a chatbot that hallucinates and one that's actually useful.
Latency is not one number
In a regular service, you have one latency number per request. In an LLM service, you have at least two, and they mean different things to the user:
- Time to first token (TTFT). How long until the user sees anything. Dominated by prefill. This is what makes a UI feel responsive.
- Inter-token latency, or tokens per second. How fast the response streams after it starts. Dominated by decode. This is what makes a response feel snappy versus sluggish.
You can have great TTFT and terrible throughput, or the other way around. Most users care more about TTFT than they realize. Streaming output forgives a lot. If your TTFT is bad, no amount of clever UI will save the feel of the product.
Cost is mostly about three things
Once you've shipped a few of these systems, you start to see that cost concentrates in a handful of levers:
- Token volume. The cheapest token is the one you didn't generate. Tighter prompts, smaller outputs, better caching of common prefixes. This is the biggest knob, and the one most teams under-pull.
- Model size. Routing easy requests to a smaller model and only escalating to the big one when needed (often called a "model cascade") can cut bills dramatically without users noticing.
- Hardware utilization. A GPU that's idle is just as expensive as a GPU that's busy. Batching, autoscaling, and right-sizing your replicas matter more than the per-token price of any individual model.
If you're optimizing cost and you're not measuring these three together, you're flying blind.

Your GPU during decode, sitting at 30% utilization, spiritually at peace, waiting for the memory bus to catch up.
A rough mental picture of a production LLM stack
Stitched together, a serious LLM system tends to look something like this:
- An API gateway at the front that handles auth, rate limits (in tokens, not requests), and routing.
- A prompt assembly layer that takes the user's input, pulls relevant context from a vector store, applies any templates, and produces the final prompt.
- An inference layer. One or more model servers (vLLM, TGI, TensorRT-LLM, SGLang, or a hosted API) doing the actual generation, with continuous batching and KV cache management.
- A retrieval layer. Embedding model plus vector database plus an optional reranker.
- An observability layer that tracks token counts, TTFT, p95 latencies, cache hit rates, and per-feature cost. This one always gets built last and always should have been built first.
- A safety and evaluation layer. Input filtering, output checks, and (ideally) automated evals that run on every model or prompt change.
None of this is exotic. It's the same mental shape as any other distributed system. The pieces just have different names and different failure modes.
Where I'd start if I were starting today
If I were learning this from scratch, I'd build something small and end-to-end before I read another article. Pick a tiny problem ("answer questions about a single PDF" is a classic) and wire up a chunker, an embedding model, a vector store, a retrieval step, and an LLM call. Run it. Watch it fail. Look at the prompts it actually sends. Look at the chunks it actually retrieves. Then start reading about the optimizations, because now they'll mean something.
That's most of what I know about LLM system design distilled into one post. None of it is secret, none of it is magic. It's just a different shape of system than the ones most of us were trained on, and once the shape clicks, the rest of the literature becomes much easier to navigate.
If any of this was useful, I'd love to hear what you're building. That's the whole reason I write these. Selfishly, I learn more from the conversations these posts start than from writing them.
