Cutting LLM Cost and Latency in Production: Caching, Routing, and Streaming
LLM bills scale with traffic and latency kills UX. The caching, model-routing, and streaming tactics that cut both, with the tradeoffs spelled out.
On this page
- Understand the token economics first
- Caching: three layers, three different jobs
- 1. Provider-side prompt caching (cached prefixes)
- 2. Exact-match caching
- 3. Semantic caching
- Model routing and cascades: stop overpaying for easy work
- Latency: streaming, smaller prompts, and parallelism
- Putting numbers to it
- A decision order that works
- Further reading
The first month an LLM feature ships, the bill is a rounding error and nobody cares. By month four it is a line item the CFO has questions about, and the p95 latency is bad enough that the support inbox has a recurring "is the AI broken?" thread. I have been on both sides of that conversation. The good news is that LLM cost and latency are the same problem wearing two hats, and most of the wins come from the same handful of tactics: stop paying for tokens you already paid for, stop sending the most expensive model work a cheap one could do, and stop making the user stare at a spinner while the model thinks.
This is the playbook I actually run in production. Numbers are from real workloads; your mileage will vary, but the shape of the wins is consistent.
Understand the token economics first
You cannot optimize what you do not price correctly. Every provider bills input and output tokens separately, and output is the expensive one — usually 3x to 5x the input rate. As of mid-2026, Anthropic's Claude Sonnet sits around $3 per million input tokens and $15 per million output; OpenAI's GPT-class mid-tier models are in the same ballpark. That ratio is the single most important fact for cost work.
The implication: a request with a 4,000-token prompt and a 200-token answer is dominated by input cost, while a request that streams back a 1,500-token essay is dominated by output. They need different fixes. Bloated RAG context and giant few-shot blocks bleed input cost; verbose, unbounded generations bleed output cost. Before touching anything, log the split.
type UsageLog = {
requestId: string;
model: string;
inputTokens: number;
cachedInputTokens: number; // billed at a discount
outputTokens: number;
costUsd: number;
latencyMs: number;
cacheHit: "exact" | "semantic" | "prefix" | "miss";
};
const PRICING: Record<string, { in: number; cachedIn: number; out: number }> = {
// USD per 1M tokens
"claude-sonnet-4-5": { in: 3.0, cachedIn: 0.3, out: 15.0 },
"claude-haiku-4-5": { in: 0.8, cachedIn: 0.08, out: 4.0 },
};
function costUsd(model: string, inTok: number, cachedTok: number, outTok: number) {
const p = PRICING[model];
const billedIn = inTok - cachedTok;
return (billedIn * p.in + cachedTok * p.cachedIn + outTok * p.out) / 1_000_000;
}Ship this logging on day one. When someone asks "where is the money going," you want a query, not a guess.
Caching: three layers, three different jobs
Caching is the highest-leverage lever, but "cache the LLM" is three distinct techniques that people conflate.
1. Provider-side prompt caching (cached prefixes)
Both major providers let you cache the static prefix of a prompt — your system instructions, tool definitions, the big RAG document everyone shares — so you only pay full price for it once, then a fraction (often ~10% of the input rate) on subsequent calls within the cache window. This is the closest thing to free money in this whole post. See Anthropic's prompt caching and OpenAI's prompt caching docs for the exact rules.
The mechanics matter: caching is prefix-based, so put everything stable at the front and everything variable (the user's actual message) at the end. If you interpolate a timestamp or a request ID into the system prompt, you have just invalidated the cache on every call.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const res = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: [
{
type: "text",
text: LARGE_STABLE_INSTRUCTIONS, // tools, policy, schema — never changes
cache_control: { type: "ephemeral" }, // mark the prefix as cacheable
},
],
messages: [{ role: "user", content: userMessage }], // the only variable part
});On a chat product where every request carries a 6,000-token system prompt, turning this on cut our input spend by roughly 80% on cached turns. The only tradeoff is a short cache TTL (about five minutes by default), so it helps bursty, conversational traffic far more than one-off requests scattered across the day.
2. Exact-match caching
Identical input, identical output. A plain hash of the normalized request keyed into Redis. It is unglamorous and it works: FAQ-style assistants, autocomplete, "summarize this fixed document" all see real repeat traffic. Normalize aggressively (trim whitespace, lowercase where safe, sort any unordered params) before hashing or your hit rate collapses.
3. Semantic caching
The interesting one. Users phrase the same question fifty ways — "how do I cancel," "cancel my plan," "where's the cancellation button." Exact-match catches none of these. Semantic caching embeds the query, finds the nearest stored query by vector similarity, and returns its cached answer if the distance is under a threshold.
import { OpenAI } from "openai";
import { Pool } from "pg"; // Postgres + pgvector
const openai = new OpenAI();
const pool = new Pool();
async function embed(text: string): Promise<number[]> {
const r = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return r.data[0].embedding;
}
// Cosine distance via pgvector's <=> operator. Lower = more similar.
async function semanticLookup(query: string, threshold = 0.12) {
const v = await embed(query);
const { rows } = await pool.query(
`SELECT answer, embedding <=> $1::vector AS distance
FROM llm_cache
ORDER BY embedding <=> $1::vector
LIMIT 1`,
[JSON.stringify(v)],
);
if (rows[0] && rows[0].distance < threshold) {
return rows[0].answer as string; // cache hit
}
return null;
}
async function answer(query: string, generate: () => Promise<string>) {
const cached = await semanticLookup(query);
if (cached) return cached;
const fresh = await generate();
const v = await embed(query);
await pool.query(
`INSERT INTO llm_cache (query, answer, embedding) VALUES ($1, $2, $3::vector)`,
[query, fresh, JSON.stringify(v)],
);
return fresh;
}The threshold is where the engineering lives. Too loose and you serve the cancellation answer to someone asking about upgrading. Too tight and you may as well use exact-match. Tune it against a labeled set of real query pairs, start conservative (a low distance), and never semantically cache anything personalized or stateful — account balances, order status, anything tied to a user. Semantic caching is for knowledge, not for state. Treat the embedding store like any other index; pgvector with an HNSW index keeps lookups in single-digit milliseconds well into the millions of rows.
Model routing and cascades: stop overpaying for easy work
The reflex of pointing every request at your most capable model is the second-biggest source of waste. Most production traffic is not hard. Classification, short rewrites, extraction, routing, and "is this in scope" checks are comfortably handled by a small model at a fifth of the price and half the latency.
Two patterns, often combined:
Right-sizing — match the model to the task statically. The intent classifier gets Haiku; the final user-facing synthesis gets Sonnet. You know the difficulty in advance, so you do not need to discover it at runtime.
Cascading — try the cheap model first, escalate only when it is not confident or fails validation. This shines when difficulty is unknown per request.
async function cascade(prompt: string) {
// 1. Cheap model attempt
const cheap = await client.messages.create({
model: "claude-haiku-4-5",
max_tokens: 512,
messages: [{ role: "user", content: prompt }],
});
const text = cheap.content[0].type === "text" ? cheap.content[0].text : "";
// 2. Escalate only if the cheap answer fails our quality gate
if (isConfident(text)) {
return { text, model: "haiku", escalated: false };
}
const strong = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
const out = strong.content[0].type === "text" ? strong.content[0].text : "";
return { text: out, model: "sonnet", escalated: true };
}isConfident is the crux. Cheap signals work well: a self-reported confidence field in a structured response, a refusal/hedging detector, or schema validation failing. On a support-triage workload where ~70% of tickets were routine, a cascade kept those on the cheap model and only escalated the genuinely ambiguous 30%, dropping blended cost per request by about 55% with no measurable quality regression. The honest tradeoff: escalated requests pay both models and incur two round trips, so a cascade is a loss if your escalation rate climbs past ~50%. Measure the rate; if it is high, your "cheap" model is wrong for the task and you should right-size instead.
Latency: streaming, smaller prompts, and parallelism
Cost and latency diverge here, so treat them separately.
Stream everything user-facing. Streaming does not make the model faster, it makes the wait disappear. Time-to-first-token on a small model is often 200-400ms; total generation of a long answer can be 6-8 seconds. Streaming turns an 8-second blank screen into a 300ms first word, which is the difference between "broken" and "fast" in the only benchmark that matters — the user's perception.
// app/api/chat/route.ts — Next.js 16 App Router route handler
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function POST(req: Request) {
const { message } = await req.json();
const stream = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
stream: true,
messages: [{ role: "user", content: message }],
});
const encoder = new TextEncoder();
const body = new ReadableStream({
async start(controller) {
for await (const event of stream) {
if (
event.type === "content_block_delta" &&
event.delta.type === "text_delta"
) {
controller.enqueue(encoder.encode(event.delta.text));
}
}
controller.close();
},
});
return new Response(body, {
headers: { "Content-Type": "text/plain; charset=utf-8" },
});
}Shrink the prompt. Input tokens are latency too — the model reads before it writes. The cheapest token is the one you never send. Trim retrieved RAG context to the top 3-5 chunks instead of dumping 20; a reranker pays for itself by letting you send less. Cut few-shot examples down once the model clearly has the pattern; two good examples usually beat eight mediocre ones. For long conversation histories, summarize older turns instead of replaying them verbatim.
Set max_tokens deliberately. An unbounded generation is an unbounded bill and an unbounded wait. If the answer should be a sentence, cap it. This single line stops the model from rambling into a 900-token essay when you wanted "yes, with a reason."
Cut retries with structured outputs. Every time the model returns malformed JSON you parse-fail and retry, paying twice and doubling latency. Constrain the output to a schema so the first response is always parseable.
const res = await client.messages.create({
model: "claude-haiku-4-5",
max_tokens: 256,
tools: [
{
name: "classify_ticket",
description: "Classify a support ticket",
input_schema: {
type: "object",
properties: {
category: { type: "string", enum: ["billing", "bug", "howto"] },
priority: { type: "string", enum: ["low", "med", "high"] },
},
required: ["category", "priority"],
},
},
],
tool_choice: { type: "tool", name: "classify_ticket" },
messages: [{ role: "user", content: ticket }],
});Parallelize independent calls and batch where you can. If a request needs three independent LLM calls (extract entities, classify sentiment, detect language), fire them with Promise.all — three sequential 800ms calls become one 800ms wall-clock wait. And for offline, non-interactive work — nightly classification, backfills, evals — use the providers' batch APIs, which run asynchronously at roughly half price in exchange for a slower turnaround you do not care about for a cron job.
Putting numbers to it
Here is the rough payoff of each tactic on the conversational workloads I have measured. Stack them; they mostly compose.
| Tactic | Primary win | Typical savings | Main tradeoff |
|---|---|---|---|
| Provider prompt caching | Input cost | 50-90% on cached turns | ~5-min TTL; needs stable prefix |
| Exact-match cache | Cost + latency | 100% on hits | Only catches identical input |
| Semantic cache | Cost + latency | 30-60% on repetitive Q&A | Threshold tuning; not for state |
| Model right-sizing | Cost + latency | 40-80% per right-sized task | Requires per-task evals |
| Cheap-to-strong cascade | Cost | 40-60% blended | Pays twice when it escalates |
| Streaming | Perceived latency | TTFT 8s → 0.3s | No raw cost change |
| Prompt trimming / reranking | Cost + latency | 20-50% input | Risk of dropping needed context |
max_tokens + structured output | Cost + latency | 10-30%, fewer retries | Truncation if cap set too low |
| Batch API (offline) | Cost | ~50% | Async, slow turnaround |
A decision order that works
Do not boil the ocean. Apply these in order and stop when the numbers are acceptable:
- Instrument first. Log per-request input/output tokens, cost, latency (p50 and p95), and cache outcome. No optimization without measurement.
- Turn on provider prompt caching. Cheapest possible win, near-zero risk, one config change. Restructure prompts so the stable part leads.
- Right-size your models. Audit every call site and ask if the strongest model is actually required. Move classification, extraction, and routing to a small model behind evals.
- Add exact-match, then semantic caching for read-heavy, non-personalized queries. Tune the threshold against real pairs.
- Stream every user-facing generation and set
max_tokenson all of them. - Then reach for cascades, rerankers, and batch APIs where the residual cost justifies the added complexity.
The trap is starting at step 6 because it is the most interesting. Steps 1 through 3 are boring and they capture the majority of the savings. One more honest caveat from production: caching introduces staleness and a security surface — never cache across user/tenant boundaries, and treat your cache key like an authorization decision, which lines up with the guidance in the OWASP Top 10 for LLM Applications. Get the cheap, safe wins first, prove them in your dashboards, and only spend complexity where the graph still hurts.
Further reading
- Anthropic — Prompt caching documentation: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- OpenAI — Prompt caching guide: https://platform.openai.com/docs/guides/prompt-caching
- pgvector: https://github.com/pgvector/pgvector
- OWASP Top 10 for LLM Applications: https://owasp.org