All posts
AI Engineering··12 min read

Building Production RAG: Retrieval, Chunking, and the Parts That Break

RAG demos are easy; production RAG is not. The full pipeline, the chunking and retrieval decisions that decide quality, and the failure modes nobody warns you about.

By

On this page

A RAG demo takes an afternoon. You embed a folder of Markdown, stuff the top three chunks into a prompt, and the model answers questions about your docs. It feels like magic. Then you put it in front of real users with real documents and the magic evaporates: it confidently cites a deprecated API, misses the one paragraph that actually answers the question, and contradicts itself between two runs of the same query.

I have shipped RAG into production three times now, and the gap between the demo and the system is almost entirely in the parts nobody puts in the tutorial: how you split documents, how you retrieve, and how you find out you're wrong. The model is the easy part. Let me walk the whole pipeline and be specific about where it breaks.

The pipeline, end to end

Every production RAG system is the same nine stages, whether you wrote it yourself or bolted together a framework:

ingest → chunk → embed → store → retrieve → rerank → assemble context → generate → cite

The demo skips rerank, takes chunking for granted, and never cites. Those three omissions are responsible for most of the quality complaints you'll get. The interesting decisions live in the middle of that chain, so that's where I'll spend the words.

Chunking is where quality is won or lost

Here is the thing that took me too long to internalize: retrieval quality is bounded by chunk quality. If your chunks are garbage, no embedding model and no reranker will save you. You cannot retrieve a good answer that doesn't exist as a coherent unit in your index.

Naive fixed-size chunking — "split every 1000 characters" — is the single most common reason RAG underperforms. It slices sentences in half, separates a heading from the table it introduces, and orphans the pronoun "it" from whatever "it" referred to. You embed a fragment that means nothing on its own, and then you wonder why cosine similarity returns junk.

There are four strategies, in rough order of how much they respect document structure:

StrategyHow it splitsGood forCost
Fixed-sizeEvery N characters/tokensThrowaway demosDestroys meaning
RecursiveOn a hierarchy of separators (\n\n, \n, . )General prose, the sane defaultLow
Structure-awareOn Markdown/HTML/AST boundariesDocs, code, legalMedium, parser per format
SemanticOn embedding-distance shifts between sentencesDense unstructured textHigh, an embed call per sentence

My default is recursive splitting that respects structure, not semantic chunking. Semantic chunking sounds clever and benchmarks well in papers, but it costs an embedding call per sentence at ingest and the wins are marginal once you attach good metadata. Start recursive; reach for semantic only if evals tell you to.

Here is a recursive chunker I actually use. It tries to split on the largest natural boundary first and only falls back to smaller ones, with overlap to preserve context across the seam.

type Chunk = { text: string; index: number };
 
const SEPARATORS = ["\n\n", "\n", ". ", " ", ""];
 
export function recursiveChunk(
  text: string,
  maxTokens = 512,
  overlapTokens = 64,
  separators: string[] = SEPARATORS,
): Chunk[] {
  // ~4 chars per token is a safe heuristic for English; measure with a
  // real tokenizer (tiktoken/js-tiktoken) if you're near model limits.
  const maxChars = maxTokens * 4;
  const overlapChars = overlapTokens * 4;
 
  if (text.length <= maxChars) {
    return text.trim() ? [{ text: text.trim(), index: 0 }] : [];
  }
 
  const [sep, ...rest] = separators;
  const pieces = sep ? text.split(sep) : [...text];
 
  const chunks: string[] = [];
  let buf = "";
  for (const piece of pieces) {
    const candidate = buf ? buf + sep + piece : piece;
    if (candidate.length <= maxChars) {
      buf = candidate;
    } else {
      if (buf) chunks.push(buf);
      // A single piece still too big? recurse with finer separators.
      if (piece.length > maxChars && rest.length) {
        chunks.push(...recursiveChunk(piece, maxTokens, overlapTokens, rest).map((c) => c.text));
        buf = "";
      } else {
        buf = piece;
      }
    }
  }
  if (buf) chunks.push(buf);
 
  // Stitch overlap so a fact split across a boundary survives.
  return chunks
    .map((c, i) => {
      const prevTail = i > 0 ? chunks[i - 1].slice(-overlapChars) : "";
      return { text: (prevTail + c).trim(), index: i };
    })
    .filter((c) => c.text.length > 0);
}

On size and overlap: I land between 256 and 512 tokens per chunk for most corpora, with 10–15% overlap. Smaller chunks retrieve more precisely but lose surrounding context; larger chunks carry context but dilute the embedding so the relevant sentence gets averaged out with noise. Overlap is insurance against a fact landing exactly on a split. Too much overlap and you balloon your index and return near-duplicate chunks that crowd out diversity. There is no universal answer here — this is a knob you tune against evals, not vibes.

Metadata is not optional

The most underrated move in RAG is attaching metadata to every chunk: source document, section heading, URL, last-updated timestamp, and a doc_version. You need it for three things — filtering retrieval (only search this tenant's docs), citation (link back to the exact section), and freshness (evict stale chunks). Prepend the heading path into the embedded text itself so a chunk under "## Refunds → ### EU customers" embeds with that context, not as a naked paragraph.

Embeddings: the boring choices that matter

Pick one embedding model and commit, because changing it means re-embedding your entire corpus. For most work I reach for OpenAI's text-embedding-3-large (3072 dims, truncatable) or text-embedding-3-small (1536) when cost matters. Anthropic does not ship a first-party embedding model; the Claude docs point you to Voyage AI, whose voyage-3 family is genuinely strong, especially for code and domain text.

Three things people get wrong:

  • Dimensions are a tradeoff, not "bigger is better." text-embedding-3-large supports Matryoshka truncation — you can ask for 1024 dims and keep most of the quality at a third of the storage and faster search. I default to 1024 unless evals show I need more.
  • Normalize if your model isn't already. Cosine similarity assumes unit vectors. OpenAI's are pre-normalized; if you mix providers or use a local model, L2-normalize before storing, or your distances lie.
  • Batch, and respect rate limits. Embedding one chunk per request will get you throttled and cost you latency. Batch 96–256 inputs per call.

Here is embed-and-upsert into Postgres with pgvector. pgvector 0.8 with an HNSW index handles millions of rows comfortably and means one less piece of infrastructure than standing up a dedicated vector database.

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE chunks (
  id          bigserial PRIMARY KEY,
  doc_id      text NOT NULL,
  heading     text,
  content     text NOT NULL,
  embedding   vector(1024) NOT NULL,
  tsv         tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
  doc_version int NOT NULL DEFAULT 1,
  updated_at  timestamptz NOT NULL DEFAULT now()
);
 
-- Dense ANN index. vector_cosine_ops because we search by cosine distance.
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
 
-- Keyword index for the BM25-ish half of hybrid search.
CREATE INDEX ON chunks USING gin (tsv);
import OpenAI from "openai";
import { Pool } from "pg";
 
const openai = new OpenAI();
const pool = new Pool();
 
export async function embedAndUpsert(rows: { docId: string; heading: string; content: string }[]) {
  // Embed in one batched call. dimensions: 1024 truncates the Matryoshka vector.
  const res = await openai.embeddings.create({
    model: "text-embedding-3-large",
    dimensions: 1024,
    input: rows.map((r) => `${r.heading}\n\n${r.content}`),
  });
 
  const client = await pool.connect();
  try {
    await client.query("BEGIN");
    for (let i = 0; i < rows.length; i++) {
      const vec = `[${res.data[i].embedding.join(",")}]`;
      await client.query(
        `INSERT INTO chunks (doc_id, heading, content, embedding)
         VALUES ($1, $2, $3, $4)`,
        [rows[i].docId, rows[i].heading, rows[i].content, vec],
      );
    }
    await client.query("COMMIT");
  } catch (e) {
    await client.query("ROLLBACK");
    throw e;
  } finally {
    client.release();
  }
}

Retrieval: hybrid search and a reranker, or you're leaving quality on the table

Pure vector search has a known weakness: it's great at semantics and bad at exact matches. Ask for error code ERR_2043 or the function parseInvoice and dense embeddings will happily return semantically-similar-but-wrong chunks because the embedding doesn't care about the literal token. Keyword search (BM25, or Postgres ts_rank) nails exact matches and whiffs on paraphrase.

The fix is hybrid search: run both, then fuse the rankings. Reciprocal Rank Fusion (RRF) is the workhorse — it's parameter-light and doesn't require you to normalize incomparable score scales.

Then add a cross-encoder reranker. Your first-stage retrieval optimizes for recall: cast a wide net, pull 30–50 candidates. A reranker (Cohere rerank-3.5, Voyage rerank-2.5, or a local BGE model) reads each query-chunk pair together and scores true relevance, far more accurately than the dot product of two independently-computed vectors ever could. You retrieve 40, rerank, keep the top 5. This single step moved answer quality more than any prompt tweak I tried.

import { CohereClientV2 } from "cohere-ai";
 
const cohere = new CohereClientV2({ token: process.env.COHERE_API_KEY! });
 
export async function hybridRetrieve(query: string, queryVec: number[], topK = 5) {
  const vec = `[${queryVec.join(",")}]`;
 
  // Dense + sparse in one round trip, fused with RRF (k=60 is the standard constant).
  const { rows } = await pool.query(
    `WITH dense AS (
       SELECT id, content, heading, doc_id,
              row_number() OVER (ORDER BY embedding <=> $1) AS rank
       FROM chunks ORDER BY embedding <=> $1 LIMIT 40
     ),
     sparse AS (
       SELECT id, content, heading, doc_id,
              row_number() OVER (
                ORDER BY ts_rank(tsv, plainto_tsquery('english', $2)) DESC
              ) AS rank
       FROM chunks WHERE tsv @@ plainto_tsquery('english', $2) LIMIT 40
     )
     SELECT COALESCE(d.id, s.id) AS id,
            COALESCE(d.content, s.content) AS content,
            COALESCE(d.heading, s.heading) AS heading,
            COALESCE(d.doc_id, s.doc_id) AS doc_id,
            COALESCE(1.0 / (60 + d.rank), 0) + COALESCE(1.0 / (60 + s.rank), 0) AS score
     FROM dense d FULL OUTER JOIN sparse s USING (id)
     ORDER BY score DESC LIMIT 40`,
    [vec, query],
  );
 
  // Cross-encoder rerank: reads (query, chunk) pairs jointly. This is the win.
  const reranked = await cohere.rerank({
    model: "rerank-3.5",
    query,
    documents: rows.map((r) => r.content),
    topN: topK,
  });
 
  return reranked.results.map((r) => rows[r.index]);
}

Lost in the middle

There is a documented failure mode — the "lost in the middle" problem from Liu et al. — where models reliably use information at the start and end of their context and quietly ignore what's buried in the middle. So even after you've retrieved the perfect chunks, the order you place them in matters. Put your highest-ranked chunk first, your second-highest last, and let the weaker ones occupy the middle where they'll do least harm. It's a one-line reorder and it measurably reduces "the answer was right there and it missed it" complaints.

Assembling context and citations

Generation is the part you control least, so constrain it. Two rules carry most of the weight: ground every claim in retrieved context, and make the model cite. Citations aren't just UX — forcing the model to attach a source ID to each sentence makes it far less willing to fabricate, because there's no source to point at for an invented fact.

export function buildPrompt(query: string, chunks: { docId: string; heading: string; content: string }[]) {
  // Reorder for lost-in-the-middle: best first, second-best last.
  const ordered = [...chunks];
  if (ordered.length > 2) {
    const [a, b, ...mid] = ordered;
    ordered.length = 0;
    ordered.push(a, ...mid, b);
  }
 
  const context = ordered
    .map((c, i) => `[${i + 1}] (${c.docId} — ${c.heading})\n${c.content}`)
    .join("\n\n---\n\n");
 
  return {
    system:
      "Answer ONLY from the numbered context. Cite every claim with [n]. " +
      "If the context does not contain the answer, say you don't know. Do not use outside knowledge.",
    user: `Context:\n\n${context}\n\nQuestion: ${query}`,
  };
}

That "if the context does not contain the answer, say you don't know" line is non-negotiable. Without it, an empty or irrelevant retrieval still produces a confident, wrong answer — which is worse than no answer, because users trust it.

The failure modes nobody warns you about

These are the five that have actually bitten me in production:

  • Bad chunking silently caps your ceiling. Symptom: relevant docs exist but never get retrieved. Fix: inspect your chunks by hand before blaming the model.
  • Stale index. Someone updates a doc; your index still serves last quarter's pricing. Fix: store updated_at and doc_version, and re-embed on change — RAG is not write-once.
  • Missing context. The answer spans two chunks that never co-retrieve. Fix: overlap, larger chunks for dense material, or sentence-window retrieval.
  • Hallucination on empty retrieval. Covered above: instruct the model to abstain, and check whether top reranker scores clear a threshold before you even call the LLM.
  • No evaluation. This is the big one. If you can't measure retrieval quality, every change is a guess. You are flying blind and shipping regressions you can't see.

Treat security as a failure mode too: retrieved content is untrusted input, and the OWASP Top 10 for LLM Applications (LLM01: Prompt Injection) is explicit that a poisoned document can carry instructions into your prompt. Never let retrieved text silently override your system instructions.

A checklist before you ship

  1. Evals first. Build a set of 50+ real question/answer pairs. Measure retrieval (recall@k, MRR) separately from generation. You cannot improve what you don't measure.
  2. Inspect your chunks by eye. If a human can't answer from a chunk in isolation, neither can your model.
  3. Hybrid retrieval + a reranker. Dense-only is leaving quality on the table. This is the highest-leverage upgrade after chunking.
  4. Cite everything, instruct the model to abstain, and reorder context best-first/second-last.
  5. Plan for staleness from day one — versioned, re-embeddable chunks, not a frozen index.
  6. Treat retrieved text as untrusted. Defend against injection.

Get chunking, hybrid retrieval, and evaluation right and the rest is tuning. Skip them and you'll have a great demo and an unhappy production.

Further reading

  • pgvector — github.com/pgvector/pgvector
  • OWASP Top 10 for LLM Applications — owasp.org
  • OpenAI embeddings guide — platform.openai.com/docs
  • "Lost in the Middle: How Language Models Use Long Contexts" — Liu et al., 2023