Skip to content
NR
03/Case Study

Production RAG System

Enterprise Knowledge Retrieval & LLM Evaluation at T-Mobile

RoleAI Software Engineer
TimelineJan 2025 – Apr 2026
StackPython, LangChain, ChromaDB
StatusShipped
Impact

Sub-2-second retrieval across 10,000+ pages with 60% reduction in hallucinations through systematic evaluation.

Overview

Context

At T-Mobile (contracted through Innovcentric LLC), I built the end-to-end RAG pipeline powering customer support knowledge retrieval. The system ingests 10,000+ pages of support documentation, chunks and embeds them intelligently, and serves precise answers to 100+ support agents in under two seconds. I also built the evaluation framework that ensured retrieval quality and reduced factual errors by approximately 60%.

Challenge

The problem

T-Mobile's customer support documentation spanned 10,000+ pages across multiple formats and update cycles. Support agents needed fast, accurate answers grounded in current documentation. The challenge was building a retrieval system that was both fast and faithful — and proving it with measurable evaluation, not anecdotal testing.

Approach

How I built it

01

Designed a document ingestion pipeline supporting PDF and DOCX formats with structured metadata extraction

02

Implemented recursive chunking with 512-token windows and 50-token overlap to preserve context boundaries

03

Used OpenAI embeddings with ChromaDB vector indexing and metadata filtering for precise retrieval

04

Built an LLM evaluation framework using RAGAS metrics (faithfulness, relevance, hallucination scoring) and LangSmith observability

05

Engineered 50+ prompt templates with safety guardrails, tone alignment, and escalation logic

06

Benchmarked across multiple chunking strategies and embedding models to optimize retrieval quality

Technical Decisions

Why these choices

Recursive chunking over fixed-size splits

Recursive chunking respects document structure (headings, paragraphs, lists), preserving semantic coherence within chunks. This improved retrieval relevance compared to naive fixed-window approaches.

RAGAS + LangSmith for evaluation

Subjective quality assessment doesn't scale. RAGAS provides quantitative metrics for faithfulness and relevance, while LangSmith enables trace-level debugging of retrieval and generation steps.

Metadata filtering alongside semantic search

Pure semantic search can surface contextually similar but categorically wrong results. Metadata filtering (document type, recency, department) ensures retrieval stays within appropriate boundaries.

Outcomes

What shipped

Sub-2-second retrieval latency across 10,000+ pages of documentation
~60% reduction in factual errors through systematic evaluation and prompt engineering
50+ prompt templates with safety guardrails and escalation logic
Benchmarked chunking strategies and embedding models for optimized retrieval quality
Production deployment as Dockerized FastAPI microservices with CI/CD
Takeaways

What I learned

Evaluation infrastructure should be built alongside the RAG pipeline, not retrofitted
Chunking strategy has outsized impact on retrieval quality — it deserves serious experimentation
Safety guardrails in prompts aren't optional for enterprise AI — they're a core product requirement