05/Case Study

Production RAG System

Enterprise Knowledge Retrieval & LLM Evaluation at Innovcentric

RoleAI Software Engineer

TimelineNov 2025 – Jun 2026

StackPython, LangChain, ChromaDB

StatusShipped

Impact

Sub-2-second retrieval across 10,000+ pages with 60% reduction in hallucinations through systematic evaluation.

Overview

Context

At Innovcentric LLC, I built the end-to-end RAG pipeline powering customer support knowledge retrieval. The system ingests 10,000+ pages of support documentation, chunks and embeds them intelligently, and serves precise answers to 100+ support agents in under two seconds. I also built the evaluation framework that ensured retrieval quality and reduced factual errors by approximately 60%.

Challenge

The problem

The customer support documentation spanned 10,000+ pages across multiple formats and update cycles. Support agents needed fast, accurate answers grounded in current documentation. The challenge was building a retrieval system that was both fast and faithful — and proving it with measurable evaluation, not anecdotal testing.

Approach

How I built it

Designed a document ingestion pipeline supporting PDF and DOCX formats with structured metadata extraction

Implemented recursive chunking with 512-token windows and 50-token overlap to preserve context boundaries

Used OpenAI embeddings with ChromaDB vector indexing and metadata filtering for precise retrieval

Built an LLM evaluation framework using RAGAS metrics (faithfulness, relevance, hallucination scoring) and LangSmith observability

Engineered 50+ prompt templates with safety guardrails, tone alignment, and escalation logic

Benchmarked across multiple chunking strategies and embedding models to optimize retrieval quality

Technical Decisions

Why these choices

Recursive chunking over fixed-size splits

Recursive chunking respects document structure (headings, paragraphs, lists), preserving semantic coherence within chunks. This improved retrieval relevance compared to naive fixed-window approaches.

RAGAS + LangSmith for evaluation

Subjective quality assessment doesn't scale. RAGAS provides quantitative metrics for faithfulness and relevance, while LangSmith enables trace-level debugging of retrieval and generation steps.

Metadata filtering alongside semantic search

Pure semantic search can surface contextually similar but categorically wrong results. Metadata filtering (document type, recency, department) ensures retrieval stays within appropriate boundaries.

Outcomes

What shipped

Sub-2-second retrieval latency across 10,000+ pages of documentation

~60% reduction in factual errors through systematic evaluation and prompt engineering

50+ prompt templates with safety guardrails and escalation logic

Benchmarked chunking strategies and embedding models for optimized retrieval quality

Production deployment as Dockerized FastAPI microservices with CI/CD

Takeaways

What I learned

Evaluation infrastructure should be built alongside the RAG pipeline, not retrofitted

Chunking strategy has outsized impact on retrieval quality — it deserves serious experimentation

Safety guardrails in prompts aren't optional for enterprise AI — they're a core product requirement