Most RAG demos work beautifully. You chunk some documents, embed them, retrieve the top-k, feed them to a language model, and get a coherent answer. The demo passes. Leadership is impressed. Then legal gets involved.
We've audited RAG deployments across insurance, mining, health, and financial services. The same five failures appear in roughly 80% of systems that don't survive compliance review. None of them are about retrieval quality.
1. No citation trail
The model produces a grounded answer, but there's no mechanism to trace which chunks informed which claims. A compliance officer needs to verify the system didn't hallucinate a policy clause. Without per-claim attribution back to source documents with page numbers, the system is a black box wearing a RAG costume.
2. Stale index, no versioning
Policy documents get updated. Regulatory guidance changes quarterly. Most RAG pipelines have a single vector index with no concept of document versioning. When a user asks about flood coverage, the system might retrieve a chunk from a superseded policy. In insurance, that's a liability event. In health, it's patient safety.
The fix isn't complicated — version-stamped chunks with TTL enforcement and automatic re-indexing triggers — but it requires treating the index as a managed data product, not a static artifact.
3. PII in the retrieval layer
Source documents often contain customer names, policy numbers, claim details. These get embedded and stored in the vector database. When retrieval fetches chunks, PII flows through the entire pipeline — into logs, into the LLM context, potentially into responses served to different users.
# The fix: redact before you embed
chunks = segment(document)
chunks = [redact_pii(c) for c in chunks] # before embedding
embeddings = embed(chunks)
index.upsert(embeddings, metadata={
'doc_version': version,
'redaction_applied': True,
'source_hash': sha256(original),
})
4. No eval gate before deployment
Teams build the pipeline, test it manually with a handful of questions, and deploy. No automated evaluation against a gold dataset. No faithfulness scoring. No regression suite. When the model provider updates their weights — which happens without warning — the pipeline's behaviour shifts. Without eval gates, you find out from users, not from tests.
5. Retrieval without reranking
Embedding similarity is a useful first pass. It is not a relevance signal. Cross-encoder reranking adds 50-200ms of latency and eliminates the most common category of hallucination: the model confabulating from a contextually adjacent but factually wrong chunk.
The pattern we deploy: chunk with semantic boundary detection, embed, retrieve top-20, rerank to top-5, inject with per-chunk citation markers, eval-gate for faithfulness above 0.85, log everything. It's not novel. It's disciplined. Discipline passes compliance review.
● ● ●
Case studyJan 20265 min
Claims processing: 4 hours to 8 minutes
A mid-tier general insurer was losing assessors to manual document review. Each new claim required reading the policy document, cross-referencing claim details, checking exclusions, verifying coverage limits, and drafting an initial assessment. Average time: 4 hours per claim. Assessor turnover: 30% annually.
The problem wasn't AI. It was trust.
The client had tried two chatbot prototypes. Both produced fluent answers. Both were pulled within weeks because assessors couldn't verify the outputs. The system would state a coverage limit without citing the clause. It would miss exclusions obvious to a human reader. Confidence without accountability.
What we built
Document agent. Ingests the policy with semantic chunking — boundaries detected using topic shift signals rather than fixed token counts. Each chunk tagged with section, clause number, and effective date.
Assessment agent. Takes claim details, retrieves relevant chunks, produces a structured assessment: coverage determination, applicable limits, exclusion check, recommended next steps. Every claim in the output cites a specific clause and page number. Constrained to only assert what retrieved evidence supports.
Audit layer. Every tool call, every retrieval, every generation step logged with latency, token counts, and full context. Assessor disagreements feed back into the eval dataset.
Results
4 hours → 8 minutes. Assessors review system output rather than building assessments from scratch. They spend expertise on edge cases — the 15% where the system flags uncertainty — not routine document lookup.
Faithfulness: 0.94. Assessor agreement: 89%. The remaining 11% are genuine edge cases requiring human judgment — exactly where you want a human. Assessor turnover in six months post-deployment: 8%. People stay when the tedious work disappears.
● ● ●
TechnicalDec 20259 min
Building eval frameworks that catch real failures
Evals are the difference between AI that demos well and AI that survives production. Most teams either skip them or measure the wrong things.
The three eval layers
Layer 1: Component evals. Test each component in isolation. Does the chunker produce coherent boundaries? Does the retriever surface the right documents? Fast, cheap, run on every commit.
Layer 2: Pipeline evals. End-to-end against a gold dataset. Given this question and this corpus, does the full pipeline produce a correct, grounded, non-toxic answer? Requires 200-500 curated QA pairs with human-verified gold answers.
Layer 3: Adversarial evals. Inputs designed to break the system. Jailbreak attempts, prompt injection through document contents, questions outside the corpus, deliberately ambiguous queries. These catch the failures that matter most — confident wrong output.
Faithfulness — does the response only assert things supported by retrieved context? Below 0.85 means the system generates ungrounded claims. In regulated deployments, unacceptable.
Relevance — does it actually answer the question? A system can be perfectly faithful and still irrelevant because it retrieved wrong chunks.
Completeness — did it cover all key points? Partial answers in regulated contexts are as dangerous as wrong answers. Missing an exclusion clause is a compliance failure.
Toxicity and bias — automated toxicity catches obvious failures. Bias detection requires nuanced evaluation — does the system treat equivalent queries differently based on demographic signals?
The eval gate
Evals are wired into CI/CD as hard gates. If faithfulness drops below threshold on any model update, provider change, or pipeline modification, deployment blocks automatically. No manual review to block. Manual review to override.
Measured failure is manageable. Unmeasured failure is a lawsuit.
● ● ●
ResearchNov 20258 min
Attention is not memory: what cognitive science tells us about context windows
There's a persistent conflation between context window length and memory capacity. "Claude has a 200K context window" is treated as "Claude can remember 200K tokens." This is wrong in a way that has practical consequences for system design.
Cognitive science made this distinction decades ago. Working memory and long-term memory have fundamentally different properties, different capacity constraints, and different failure modes. Understanding them makes you a better AI engineer.
Working memory: the 4±1 problem
Miller's 1956 paper established human working memory holds roughly 7±2 items. Cowan narrowed this to 4±1 chunks. The key insight: chunking. Expert chess players don't remember individual pieces; they remember board configurations as single chunks. Capacity limits apply to chunks, not raw elements.
LLM attention faces an analogous constraint. The context window holds 200K tokens, but attention concentrates on a much smaller effective set. Models disproportionately attend to first tokens and most recent tokens, with a "lost in the middle" effect for everything between. The context window is not flat memory — it's a curved attention surface with peaks and valleys.
Implications for RAG
If you're stuffing 50 retrieved chunks into a 200K window because "there's room," you're fighting the attention curve. The model will underweight middle chunks. This is exactly the "lost in the middle" finding from Liu et al.
The fix mirrors biological cognition: compress and prioritize before loading into working memory.
# Don't do this:
context = "\n".join(all_50_chunks) # attention curve kills you# Do this:
ranked = rerank(chunks, query, top_k=5)
compressed = [gist(c) for c in ranked]
context = interleave(compressed, query) # position matters
Gist memory: lossy compression that preserves meaning
In cognitive neuroscience, gist memory is the brain's ability to extract semantic essence while discarding surface details. You remember a meeting was contentious and resulted in a decision to delay, but not the exact words. The gist is preserved; the verbatim trace decays.
We use a two-stage retrieval pattern: search over gist-compressed representations to identify sources, then expand to full text of only the most relevant chunks. 4-5x compression of search space. Retrieval accuracy above 95% on benchmarks.
It's about matching information density to attention capacity. Five relevant, well-positioned chunks outperform fifty loosely relevant chunks every time.
Event segmentation: finding natural boundaries
Event Segmentation Theory (Zacks & Swallow, 2007) describes how the brain segments continuous experience into discrete events at points of maximum change — spikes in prediction error where the statistical structure shifts.
We apply this to document chunking. Rather than fixed token counts or paragraph boundaries, we detect semantic boundaries using multi-signal fusion: topic shift, rhetorical shift, entity change, structural markers. Chunks align with natural information structure, so the model processes them more effectively.
Empirical result: semantic boundary detection improves downstream RAG faithfulness by 8-12% vs fixed-size chunking, and 4-6% vs paragraph-level splitting. Consistent across policy documents, clinical guidelines, regulatory filings, and technical specifications.
The context window is not a bucket. It's an attention surface. Design your systems to work with the curve, not against it.