col 1, row 1

Field notes

Lessons from shipping AI where the stakes are real. Production, not theory.

Architecture → Prototype → Eval → Ship · 2 weeks to working prototype · Eval-gated deployment · Zero production incidents · SOC2 + GDPR + APRA compliant · 6 regulated industries · 25M+ end users · Ship fast. Break nothing. · Architecture → Prototype → Eval → Ship · 2 weeks to working prototype · Eval-gated deployment · Zero production incidents · SOC2 + GDPR + APRA compliant · 6 regulated industries · 25M+ end users · Ship fast. Break nothing. ·
Contents
Why your RAG pipeline fails compliance reviewThought leadership
Claims processing: 4 hours to 8 minutesCase study
Building eval frameworks that catch real failuresTechnical
Attention is not memoryResearch

Why your RAG pipeline fails compliance review

Most RAG demos work beautifully. You chunk some documents, embed them, retrieve the top-k, feed them to a language model, and get a coherent answer. The demo passes. Leadership is impressed. Then legal gets involved.

We've audited RAG deployments across insurance, mining, health, and financial services. The same five failures appear in roughly 80% of systems that don't survive compliance review. None of them are about retrieval quality.

1. No citation trail

The model produces a grounded answer, but there's no mechanism to trace which chunks informed which claims. A compliance officer needs to verify the system didn't hallucinate a policy clause. Without per-claim attribution back to source documents with page numbers, the system is a black box wearing a RAG costume.

2. Stale index, no versioning

Policy documents get updated. Regulatory guidance changes quarterly. Most RAG pipelines have a single vector index with no concept of document versioning. When a user asks about flood coverage, the system might retrieve a chunk from a superseded policy. In insurance, that's a liability event. In health, it's patient safety.

The fix isn't complicated — version-stamped chunks with TTL enforcement and automatic re-indexing triggers — but it requires treating the index as a managed data product, not a static artifact.

3. PII in the retrieval layer

Source documents often contain customer names, policy numbers, claim details. These get embedded and stored in the vector database. When retrieval fetches chunks, PII flows through the entire pipeline — into logs, into the LLM context, potentially into responses served to different users.

# The fix: redact before you embed chunks = segment(document) chunks = [redact_pii(c) for c in chunks] # before embedding embeddings = embed(chunks) index.upsert(embeddings, metadata={ 'doc_version': version, 'redaction_applied': True, 'source_hash': sha256(original), })

4. No eval gate before deployment

Teams build the pipeline, test it manually with a handful of questions, and deploy. No automated evaluation against a gold dataset. No faithfulness scoring. No regression suite. When the model provider updates their weights — which happens without warning — the pipeline's behaviour shifts. Without eval gates, you find out from users, not from tests.

5. Retrieval without reranking

Embedding similarity is a useful first pass. It is not a relevance signal. Cross-encoder reranking adds 50-200ms of latency and eliminates the most common category of hallucination: the model confabulating from a contextually adjacent but factually wrong chunk.

The pattern we deploy: chunk with semantic boundary detection, embed, retrieve top-20, rerank to top-5, inject with per-chunk citation markers, eval-gate for faithfulness above 0.85, log everything. It's not novel. It's disciplined. Discipline passes compliance review.

● ● ●

Claims processing: 4 hours to 8 minutes

A mid-tier general insurer was losing assessors to manual document review. Each new claim required reading the policy document, cross-referencing claim details, checking exclusions, verifying coverage limits, and drafting an initial assessment. Average time: 4 hours per claim. Assessor turnover: 30% annually.

The problem wasn't AI. It was trust.

The client had tried two chatbot prototypes. Both produced fluent answers. Both were pulled within weeks because assessors couldn't verify the outputs. The system would state a coverage limit without citing the clause. It would miss exclusions obvious to a human reader. Confidence without accountability.

What we built

Document agent. Ingests the policy with semantic chunking — boundaries detected using topic shift signals rather than fixed token counts. Each chunk tagged with section, clause number, and effective date.

Assessment agent. Takes claim details, retrieves relevant chunks, produces a structured assessment: coverage determination, applicable limits, exclusion check, recommended next steps. Every claim in the output cites a specific clause and page number. Constrained to only assert what retrieved evidence supports.

Audit layer. Every tool call, every retrieval, every generation step logged with latency, token counts, and full context. Assessor disagreements feed back into the eval dataset.

Results

4 hours → 8 minutes. Assessors review system output rather than building assessments from scratch. They spend expertise on edge cases — the 15% where the system flags uncertainty — not routine document lookup.

Faithfulness: 0.94. Assessor agreement: 89%. The remaining 11% are genuine edge cases requiring human judgment — exactly where you want a human. Assessor turnover in six months post-deployment: 8%. People stay when the tedious work disappears.

● ● ●

Building eval frameworks that catch real failures

Evals are the difference between AI that demos well and AI that survives production. Most teams either skip them or measure the wrong things.

The three eval layers

Layer 1: Component evals. Test each component in isolation. Does the chunker produce coherent boundaries? Does the retriever surface the right documents? Fast, cheap, run on every commit.

Layer 2: Pipeline evals. End-to-end against a gold dataset. Given this question and this corpus, does the full pipeline produce a correct, grounded, non-toxic answer? Requires 200-500 curated QA pairs with human-verified gold answers.

Layer 3: Adversarial evals. Inputs designed to break the system. Jailbreak attempts, prompt injection through document contents, questions outside the corpus, deliberately ambiguous queries. These catch the failures that matter most — confident wrong output.

Building the gold dataset

# Generate → Verify → Calibrate pairs = generate_qa( corpus=policy_documents, model='claude-sonnet-4-6', difficulty=['simple', 'multi_hop', 'adversarial'], ) verified = human_review( pairs, reviewers=3, agreement_threshold=0.67, ) calibrate_judge( judge_model='claude-sonnet-4-6', human_scores=verified, target_correlation=0.85, )

Metrics that matter

Faithfulness — does the response only assert things supported by retrieved context? Below 0.85 means the system generates ungrounded claims. In regulated deployments, unacceptable.

Relevance — does it actually answer the question? A system can be perfectly faithful and still irrelevant because it retrieved wrong chunks.

Completeness — did it cover all key points? Partial answers in regulated contexts are as dangerous as wrong answers. Missing an exclusion clause is a compliance failure.

Toxicity and bias — automated toxicity catches obvious failures. Bias detection requires nuanced evaluation — does the system treat equivalent queries differently based on demographic signals?

The eval gate

Evals are wired into CI/CD as hard gates. If faithfulness drops below threshold on any model update, provider change, or pipeline modification, deployment blocks automatically. No manual review to block. Manual review to override.

Measured failure is manageable. Unmeasured failure is a lawsuit.
● ● ●

Attention is not memory: what cognitive science tells us about context windows

There's a persistent conflation between context window length and memory capacity. "Claude has a 200K context window" is treated as "Claude can remember 200K tokens." This is wrong in a way that has practical consequences for system design.

Cognitive science made this distinction decades ago. Working memory and long-term memory have fundamentally different properties, different capacity constraints, and different failure modes. Understanding them makes you a better AI engineer.

Working memory: the 4±1 problem

Miller's 1956 paper established human working memory holds roughly 7±2 items. Cowan narrowed this to 4±1 chunks. The key insight: chunking. Expert chess players don't remember individual pieces; they remember board configurations as single chunks. Capacity limits apply to chunks, not raw elements.

LLM attention faces an analogous constraint. The context window holds 200K tokens, but attention concentrates on a much smaller effective set. Models disproportionately attend to first tokens and most recent tokens, with a "lost in the middle" effect for everything between. The context window is not flat memory — it's a curved attention surface with peaks and valleys.

Implications for RAG

If you're stuffing 50 retrieved chunks into a 200K window because "there's room," you're fighting the attention curve. The model will underweight middle chunks. This is exactly the "lost in the middle" finding from Liu et al.

The fix mirrors biological cognition: compress and prioritize before loading into working memory.

# Don't do this: context = "\n".join(all_50_chunks) # attention curve kills you # Do this: ranked = rerank(chunks, query, top_k=5) compressed = [gist(c) for c in ranked] context = interleave(compressed, query) # position matters

Gist memory: lossy compression that preserves meaning

In cognitive neuroscience, gist memory is the brain's ability to extract semantic essence while discarding surface details. You remember a meeting was contentious and resulted in a decision to delay, but not the exact words. The gist is preserved; the verbatim trace decays.

We use a two-stage retrieval pattern: search over gist-compressed representations to identify sources, then expand to full text of only the most relevant chunks. 4-5x compression of search space. Retrieval accuracy above 95% on benchmarks.

It's about matching information density to attention capacity. Five relevant, well-positioned chunks outperform fifty loosely relevant chunks every time.

Event segmentation: finding natural boundaries

Event Segmentation Theory (Zacks & Swallow, 2007) describes how the brain segments continuous experience into discrete events at points of maximum change — spikes in prediction error where the statistical structure shifts.

We apply this to document chunking. Rather than fixed token counts or paragraph boundaries, we detect semantic boundaries using multi-signal fusion: topic shift, rhetorical shift, entity change, structural markers. Chunks align with natural information structure, so the model processes them more effectively.

Empirical result: semantic boundary detection improves downstream RAG faithfulness by 8-12% vs fixed-size chunking, and 4-6% vs paragraph-level splitting. Consistent across policy documents, clinical guidelines, regulatory filings, and technical specifications.

The context window is not a bucket. It's an attention surface. Design your systems to work with the curve, not against it.
© 2026 Hyperpriors