Building Production RAG Systems: Lessons from Healthcare AI
TL;DR
RAG systems in production need three things: reliable retrieval with proper chunking, robust evaluation (retrieval@k, factuality checks), and safe guardrails for sensitive domains like healthcare.
When I built MILA, a neonatal LLM assistant for hospital communication, I learned that production RAG is fundamentally different from demo RAG. This guide shares those lessons.
The Production Reality Check
Most RAG tutorials show you how to embed documents and query them. That gets you 60% of the way there. The other 40% is what keeps the system reliable in production.
Key Insight
RAG systems fail silently. Unlike crashes, retrieval failures just produce plausible-sounding wrong answers. You need evaluation built in from day one.
Document Preparation
Chunking Strategy
Chunk size matters more than you think. For MILA's hospital policies:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "]
)Common Mistake
Don't use fixed-size chunking for structured documents. Policy documents have sections that should stay together. Use semantic chunking when possible.
Metadata is Essential
Every chunk needs metadata for filtering and citation:
{
"source": "feeding_policy_v2.pdf",
"section": "Breastfeeding Guidelines",
"page": 12,
"last_updated": "2024-01-10",
"applicable_units": ["NICU", "PICU"]
}Retrieval Pipeline
Hybrid Search
Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Use both:
from pinecone import Pinecone
# Vector search
vector_results = index.query(
vector=query_embedding,
top_k=10,
include_metadata=True
)
# Keyword boost for medical terms
keyword_boost = boost_results_containing(
results=vector_results,
terms=extract_medical_terms(query)
)Evaluation Framework
This is where most teams cut corners. Don't.
Retrieval Quality
def evaluate_retrieval(queries: list[Query], k: int = 5):
"""Measure if relevant docs appear in top-k results."""
results = []
for query in queries:
retrieved = retriever.get_relevant_docs(query.text, k=k)
retrieved_ids = {doc.id for doc in retrieved}
relevant_ids = set(query.relevant_doc_ids)
recall_at_k = len(retrieved_ids & relevant_ids) / len(relevant_ids)
results.append(recall_at_k)
return sum(results) / len(results)Answer Faithfulness
Check if answers are grounded in retrieved documents:
def check_faithfulness(answer: str, sources: list[str]) -> float:
"""Use LLM to verify claims are supported by sources."""
prompt = f"""
Answer: {answer}
Sources: {sources}
For each claim in the answer, is it supported by the sources?
Return a score from 0-1.
"""
return llm_evaluate(prompt)Production Guardrails
Human-in-the-Loop
For MILA, no message goes to parents without clinician approval:
class MessageWorkflow:
async def generate_draft(self, context: dict) -> Draft:
draft = await self.rag_chain.invoke(context)
draft.status = "pending_review"
return draft
async def approve(self, draft_id: str, clinician_id: str):
# Log approval for audit trail
await self.audit_log.record(draft_id, clinician_id, "approved")
return await self.send_to_family(draft_id)Uncertainty Detection
When retrieval confidence is low, say so:
if max(retrieval_scores) < CONFIDENCE_THRESHOLD:
return {
"response": "I don't have enough information to answer this accurately.",
"suggested_action": "Please consult the policy database directly or ask a supervisor.",
"retrieval_scores": retrieval_scores
}HIPAA Compliance for Healthcare RAG
Building AI systems that handle Protected Health Information (PHI) requires strict adherence to HIPAA regulations. This isn't optional. It's federal law.
The HIPAA Security Rule Essentials
Critical
Any RAG system processing PHI must implement administrative, physical, and technical safeguards. Violations can result in fines up to $1.5 million per incident.
For MILA, we implemented these technical safeguards:
class HIPAACompliantRAG:
def __init__(self):
self.encryption = AES256Encryption()
self.audit_logger = HIPAAAuditLog()
self.access_control = RoleBasedAccessControl()
async def query(self, user: User, query: str) -> Response:
# 1. Verify user authorization
if not self.access_control.can_access_phi(user):
self.audit_logger.log_unauthorized_attempt(user, query)
raise UnauthorizedAccessError()
# 2. Log all PHI access (required by HIPAA)
access_id = self.audit_logger.log_phi_access(
user_id=user.id,
purpose="patient_communication",
timestamp=datetime.utcnow()
)
# 3. Process with encryption in transit
response = await self._process_query(query)
# 4. Log response generation
self.audit_logger.log_response_generated(access_id, response.id)
return responseData Handling Requirements
PHI in your vector database requires special handling:
- Encryption at rest - All embeddings and metadata must be encrypted
- Encryption in transit - TLS 1.2+ for all API calls
- Access logging - Every query touching PHI must be logged with user ID, timestamp, and purpose
- Minimum necessary - Only retrieve the minimum PHI needed for the task
# BAD: Storing raw PHI in metadata
chunk_metadata = {
"patient_name": "John Doe", # Never do this
"mrn": "12345678"
}
# GOOD: De-identified references with access controls
chunk_metadata = {
"document_id": "encrypted_ref_abc123",
"content_type": "care_protocol",
"phi_level": "restricted",
"requires_authorization": True
}Business Associate Agreements
Legal Requirement
Every vendor in your RAG pipeline (LLM provider, vector database, cloud host) must sign a Business Associate Agreement (BAA) before processing PHI.
For MILA's infrastructure:
- OpenAI - Enterprise agreement with BAA
- Pinecone - HIPAA-eligible tier with BAA
- AWS - BAA covering all services used
Audit Trail Requirements
HIPAA requires you to track who accessed what PHI and when. This isn't just logging, it's legal documentation:
class HIPAAAuditLog:
def log_phi_access(
self,
user_id: str,
purpose: str,
timestamp: datetime,
patient_ids: list[str] | None = None
) -> str:
"""
Creates immutable audit record for PHI access.
Retention: minimum 6 years per HIPAA requirements.
"""
record = AuditRecord(
id=generate_uuid(),
user_id=user_id,
action="PHI_ACCESS",
purpose=purpose,
timestamp=timestamp,
patient_ids=hash_patient_ids(patient_ids), # Store hashed
ip_address=get_client_ip(),
user_agent=get_user_agent()
)
# Write to immutable audit store
self.audit_store.append(record)
return record.idMonitoring in Production
Track these metrics daily:
- Retrieval latency - p50 and p95
- Empty retrieval rate - queries with no relevant docs
- User feedback signals - edits, rejections, regenerations
- Cost per query - embedding + LLM tokens
Conclusion
Production RAG requires more than good retrieval. It needs:
- Thoughtful document preparation with metadata
- Hybrid search for robustness
- Continuous evaluation with regression tests
- Guardrails appropriate to your domain
- Monitoring that catches silent failures
- HIPAA compliance for healthcare applications (encryption, audit trails, BAAs)
The difference between a demo and production is trust. Build systems that earn it.
Have questions about RAG systems? Get in touch or check out my MILA project for more details.
Frequently Asked Questions
Osvaldo Restrepo
Senior Full Stack AI & Software Engineer. Building production AI systems that solve real problems.