At some point, most AI platform teams hit the same wall. The direct-to-OpenAI calls that worked fine in prototype are now scattered across fifteen microservices, nobody knows which team spent $8,000 last month, one provider outage takes down everything, and adding observability requires touching every service individually.
This is the problem an LLM Gateway solves. Not an abstraction layer for its own sake — a centralized proxy that gives you fallback routing, rate limiting, semantic caching, cost attribution, and security enforcement as infrastructure-level concerns, solved once, applied everywhere.
Here's how to actually build one.
Why direct provider calls don't scale
The obvious failure is vendor lock-in. When every service hardcodes gpt-4o endpoint URLs and OpenAI auth patterns, switching providers — or even switching model versions — requires touching every callsite. A team that went all-in on GPT-4 in 2023 found that out the hard way when they wanted to trial Claude for cost reasons and discovered the migration was a multi-week effort.
The less obvious failures are operational. With no centralized layer:
- No cost attribution. You can't answer "which feature spent $12K last month" without bespoke per-service instrumentation.
- No fallbacks. A provider rate limit or outage propagates directly to end users.
- No policy enforcement. Content filtering and PII redaction have to be reimplemented in every service.
- No quota control. A runaway agent that gets into a retry loop at 30 RPM will silently burn through $3,600/month before anyone notices.

Your platform team, receiving the cloud bill after three microservices independently decided to retry against the OpenAI API during an outage.
Enterprise LLM API spend crossed $12.5 billion in 2025. Most platform teams can't attribute that spend to specific products or teams until the bill arrives. A gateway fixes that.
The component architecture
Before getting into each piece, here's what the full system looks like:
Client Services
│
▼
[LLM Gateway Layer]
├── Auth & Virtual Key Validation
├── Rate Limiter (Redis-backed, per-team/user TPM+RPM)
├── Semantic Cache (Embedding → Vector Search → Cache Hit?)
├── Model Router (complexity scoring / cost routing)
├── Guardrails (prompt injection, PII scan)
├── Provider Router (primary + ordered fallbacks)
│ ├── Provider A (OpenAI)
│ ├── Provider B (Anthropic)
│ └── Provider C (Google Gemini)
├── Response Guardrails (output scan, DLP)
└── Telemetry Writer (async → Prometheus + OTEL Collector)
Each component is independent. You can add semantic caching without changing how fallbacks work. You can swap rate limiting backends without touching routing logic.
Fallback routing: error classification is the hard part
The mechanics of fallback routing are straightforward — you try Provider A, and if it fails, you try Provider B. What's actually hard is classifying why it failed and deciding what to do about it.
Not all errors should route the same way:
429 Rate Limit (transient): The deployment hit a per-minute ceiling. Put it on cooldown for N seconds, route immediately to the next provider in the fallback chain. Don't wait for retries.
429 Quota Exhausted (account-level): Check the error body. If it mentions "quota" or "billing," this is a fatal auth error for this request cycle. Do not retry. Alert immediately and route to fallback.
500/502/503/504 Provider Errors: Retry on the same provider once with exponential backoff, then escalate to fallback. These are transient infrastructure failures.
Content Filter Block (OpenAI 400 with content_filter code): Don't retry the same provider. The content won't pass the filter regardless. Route to an alternative model, and optionally flag for human review.
LiteLLM's router handles this with ordered deployments:
router_settings:
routing_strategy: "usage-based-routing"
num_retries: 2
retry_after: 5
cooldown_time: 60
model_list:
- model_name: "gpt-4o"
litellm_params:
model: openai/gpt-4o
api_key: sk-openai-...
order: 1
- model_name: "claude-sonnet"
litellm_params:
model: anthropic/claude-sonnet-4-5
api_key: sk-ant-...
order: 2
- model_name: "gemini-flash"
litellm_params:
model: gemini/gemini-2.0-flash
order: 3
When order=1 fails its retry budget, the router escalates to order=2 — Anthropic — and places the failed deployment on cooldown. If that fails, order=3. The failed deployment cools down to avoid thundering-herd retries once it recovers.
Rate limiting: the hierarchy that matters
Rate limiting in a gateway serves two purposes: protecting providers from getting you throttled, and protecting your budget from runaway consumers.
The hierarchy in LiteLLM's model is:
Organization → Team → Virtual Key → User
Each level gets independent rpm_limit (requests per minute), tpm_limit (tokens per minute), max_budget, and budget_duration (reset cycle: 1d, 7d, 1h). A key inherits its team's limits unless overridden. Teams inherit their organization's limits.
One practical consideration: token count is unknown until after the LLM responds. The implementation pattern is estimate-then-deduct. Check TPM headroom before the request using an estimated token count, send the request, then deduct actual tokens after the response. At high concurrency, brief over-budget windows are possible — typically a few seconds. That's acceptable for most use cases and unavoidable without a round-trip to get a token count before every request.
For horizontally scaled gateway deployments, rate limit counters live in Redis. LiteLLM syncs its in-memory counter to Redis every 10ms to avoid per-request Redis calls — this means at most ~10 request drift at 100 RPS across 3 instances. Documented tradeoff, acceptable in practice.
Priority-based allocation is worth considering at scale. Reserve TPM/RPM capacity by key priority level: production keys get a guaranteed floor, development keys are throttled first when capacity is tight. LiteLLM added this in v1.77.3.
Semantic caching: where you actually save money
Semantic caching is the highest-leverage cost optimization in an LLM gateway, and it's underused because teams conflate it with exact-match caching and assume it won't work.
The difference: exact-match caching stores responses keyed by prompt hash. Two requests need to be byte-for-byte identical to get a cache hit. Semantic caching converts each incoming prompt into an embedding vector, searches a vector store for cached prompts above a similarity threshold, and returns the cached response if found. "What is your cancellation policy?" and "How do I cancel my subscription?" hit the same cache entry.
The implementation stack is simpler than it sounds:
- Embed the incoming prompt with a fast, cheap model (
text-embedding-3-smallcosts ~$0.00002 per query vs. $0.01+ for a GPT-4 response — roughly 500x cheaper). - Query Redis Stack (with RedisSearch) for the nearest cached prompt by cosine similarity. Sub-50ms query times are normal.
- Return the cached response if similarity exceeds your threshold (0.85-0.95 depending on required precision).
- Call the LLM and cache the result on a miss.
Layer this as: exact hash check first (zero cost) → semantic similarity search → LLM call. Most requests never reach the LLM.
Cache hit rates by workload type:
| Workload | Cache Hit Rate |
|---|---|
| Customer support / FAQ | 70-90% |
| RAG pipelines (varied queries) | 40-60% |
| General chat / creative | 20-40% |
| Personalized / real-time | Not applicable |
The 70-90% hit rate on support workloads is real. One documented customer support platform at 50,000 daily queries — even at a conservative 65% hit rate — avoids 32,500 LLM calls per day. At $0.005 per call average, that's $162/day, $5,900/month, saved purely from caching.

The finance team after you tell them semantic caching saved $5,900/month. They will never understand what a vector store is. They don't need to.
Threshold tuning matters. A 0.95 threshold is high precision (fewer false hits) but lower recall. 0.85 is more aggressive and serves cached answers to more queries — but risks returning a subtly wrong cached response to a similar-but-different question. For factual domains where answer correctness is critical, start at 0.90 and tune down only if cache hit rates are unsatisfactory. For conversational domains with more variation tolerance, 0.85 is fine.
When to disable caching: Personalized queries that include user-specific context, time-sensitive queries, queries requiring fresh data. Add metadata tags to requests that flag these and skip the cache lookup entirely.
Model routing: 40-85% of LLM spend is wasted on the wrong model
Organizations that use a single model for all tasks overpay by 40-85% compared to intelligent routing. The economics are stark:
- Claude Haiku 4.5: ~$0.80/M input tokens
- Claude Sonnet 4.5: ~$3/M input tokens
- Claude Opus 4: ~$15/M input tokens
- GPT-4o-mini: ~$0.15/M input tokens
That's an 18x cost difference between Haiku and Opus. For the majority of queries — simple question-answering, classification, formatting, short summarization — the cheap model performs identically. Sending those to Opus is wasted spend.
One documented case: a customer support platform reduced monthly LLM spend from $42,000 to $18,000 by routing simple queries to Claude 3.5 Haiku and complex escalations to Claude 3.5 Sonnet, with no measurable degradation in customer satisfaction scores.
The routing signals that actually work:
Prompt length and complexity: Short prompts with simple vocabulary → cheap model. Long prompts with code, multi-step reasoning, or structured output requirements → capable model. This is easy to implement and captures the most variance.
Task type tags: Your application layer knows what kind of task it's generating a prompt for. Tag it: classification, summarization, qa → cheap. code-generation, multi-step-reasoning, tool-use → capable.
User tier: Free users get Haiku. Paid users get Sonnet. Enterprise users get Opus. Simple, clean, and aligns cost structure with revenue.
LiteLLM's router supports lowest-cost (always pick cheapest), latency-based (pick fastest deployment), and usage-based-routing (balance across deployments by current token load).
Virtual keys: the security architecture that makes all of this work
Everything above relies on a single security primitive: virtual keys. Internal services authenticate to the gateway with internal virtual keys (sk-internal-...). Real provider credentials — the actual OpenAI and Anthropic API keys — live only in the gateway's config or a secret store. They never appear in application code, CI pipelines, or service deployments.
Virtual keys can be scoped:
- Which models can this key access?
- What's the spending cap?
- What's the RPM limit?
- When does it expire?
A marketing automation service gets access to cheap models only. Core product agents get premium models. The security boundary is enforced at the infrastructure layer, not in application code.

Your real provider credentials, finally not hardcoded in seventeen microservice repos committed by an intern in 2023.
Key rotation with grace periods is supported in both LiteLLM and Portkey. Integrate with Vault or AWS Secrets Manager for the gateway's own secrets.
Observability: what to log and what not to
Every request through the gateway should produce a structured log record with:
{
"request_id": "...",
"trace_id": "...",
"model_requested": "gpt-4o",
"model_used": "claude-sonnet",
"provider": "anthropic",
"prompt_tokens": 342,
"completion_tokens": 187,
"cost_usd": 0.00157,
"latency_ms": 1840,
"ttft_ms": 420,
"http_status": 200,
"cache_hit": false,
"fallback_triggered": true,
"fallback_reason": "429_rate_limit",
"team_id": "platform-team",
"user_id": "user_abc123",
"environment": "production"
}
Key aggregations to surface in Grafana:
- P95 latency per model and per provider. P99 for tail latency pathology.
- Error rate by provider, segmented by error class (rate limits vs. content filters vs. 5xx).
- Cache hit rate by workload type. A falling cache hit rate on a stable workload is a signal something changed upstream.
- Cost per team and cost per feature over rolling 7-day and 30-day windows.
- Fallback rate. A rising fallback rate is an early signal of provider degradation, often hours before the provider publishes a status page incident.
- TPM utilization vs. quota ceilings. Set alerts before you hit the ceiling, not after.
Two important notes on logging: write telemetry asynchronously — never block the request path on observability calls. And by default, don't log raw prompt and response bodies. Log token counts, model, latency, and trace IDs. Raw content logging must be opt-in, RBAC-gated, and explicitly reviewed for compliance (SOC 2, GDPR, HIPAA).
Which gateway to use
The field has consolidated around a few solid options:
LiteLLM Proxy is the default choice for most engineering teams. Open-source, self-hosted, supports 100+ providers through an OpenAI-compatible endpoint. YAML-configured. Virtual key management via admin UI. Redis-backed rate limiting. Prometheus metrics built in. Connects to Langfuse, Helicone, and Datadog for observability.
Portkey is a hosted option (with a self-hosted path) that ships guardrails out of the box — 60+ checks including prompt injection detection and PII redaction across 9 languages. Good fit for teams that need compliance features without building their own guardrail layer.
Cloudflare AI Gateway is the right choice if you're already on Cloudflare infrastructure. Edge-native, sub-millisecond overhead, DLP built in, good observability dashboard.
Kong AI Gateway is a reasonable choice if your organization already runs Kong for non-AI API management. Version 3.10 added token-and-cost-based load balancing and semantic caching as first-class plugins.
OpenRouter works well for small teams under $2K/month in LLM spend who want multi-provider access without running their own infrastructure. 300+ models, automatic failover. Adds ~25ms overhead and a 5% markup on all requests.
Start with the problems you have
The most common mistake is trying to build all of this at once. You don't need semantic caching on day one. You don't need multi-tier budget enforcement before you have multiple teams.
Start with the two things that provide immediate value at any scale: virtual key management (so real provider credentials never touch application code) and basic fallback routing (so a single provider outage doesn't take down your product). Those two changes alone substantially reduce operational risk.
Add semantic caching when you have enough query volume to measure hit rates. Add multi-tier rate limiting when you have multiple teams with different consumption patterns. Add model routing when LLM costs become a meaningful line item.
The gateway pattern scales with your needs. The point is to have the layer in place before you need every feature — because adding it after the fact, across fifteen services, is the wall most teams hit.
