Instrumenting RAG Pipelines with TruLens, OpenAI, and Chroma — A Practical Guide for AI for Business
TL;DR — Reduce hallucinations and make LLM answers auditable by instrumenting every step of a retrieval-augmented generation (RAG) pipeline, collecting structured traces, and scoring responses with automated feedback. This guide shows a reproducible notebook-scale demo using TruLens, OpenAI models, and an in-memory Chroma vector store, plus a clear path to production.
Why trust matters for RAG and AI agents
Generating an answer is only half the problem. Teams need to know why the model produced that answer: which documents it read, which chunks it used, how confident the generator was, and whether the claims are actually supported. Without those signals you face three predictable issues: undiagnosed hallucinations, slow iteration on prompts and retrieval, and compliance gaps when audits demand provenance.
Think of the pipeline like a factory line. Each station — chunking, embedding, retrieval, and generation — should log what it did. Trace spans are the factory logs that let you debug defects instead of just blaming “the model.”
Solution overview
Combine three components:
- TruLens for instrumentation, trace collection, and automated feedback functions.
- OpenAI models for embeddings (text-embedding-3-small), generation and evaluation (examples use gpt-4o-mini).
- Chroma (chromadb) as a lightweight, local vector database for prototype retrieval.
High-level flow: normalize documents → split into overlapping chunks → embed chunks → store embeddings in Chroma → on query, retrieve top-k contexts (k=4) → generate answer with a prompt variant → record traces and run automated feedback to score groundedness and relevance.
What you’ll get
- A reproducible notebook pipeline you can run locally.
- Structured traces that link queries, retrieved chunks, and generated text.
- Automated metrics (groundedness, answer relevance, context relevance) to compare prompt and retrieval strategies.
- A leaderboard/dashboard to make A/B comparisons auditable and shareable with product and compliance teams.
Demo walkthrough — step by step
1) Chunking & embedding
Chunk documents into ~350-character slices with ~80-character overlap. This balances context fidelity and retrieval precision for short-to-medium documents.
# pseudo-code chunks = split_text(doc_text, chunk_size=350, overlap=80) embeddings = embed_model.embed_batch(chunks) # text-embedding-3-small chromadb_collection.upsert(chunks, embeddings)
2) Retrieval
Query the Chroma collection and fetch top-k (k = 4) results. Record retrieval metadata (chunk ids, scores) on a trace span so downstream evaluation ties back to what was retrieved.
# pseudo-code
with trace.span("retrieval", attributes={ "query": q }):
results = chroma.search(q, top_k=4)
trace.add_attribute("retrieved_ids", [r.id for r in results])
trace.add_attribute("retrieval_scores", [r.score for r in results])
3) Generation
Feed the retrieved contexts into the model with a prompt. Two prompt styles to compare:
- base: permissive synthesis using retrieved context as helpful background.
- strict_citations: require the model to answer only from provided context, signal uncertainty, and append short citation tags referencing chunk IDs.
# pseudo-code
prompt = build_prompt(variant="strict_citations", contexts=results, query=q)
with trace.span("generation", attributes={ "prompt_variant": variant }):
response = openai.chat.completions.create(model="gpt-4o-mini", messages=[...])
trace.add_attribute("gen_text", response.text)
trace.add_attribute("gen_tokens", response.usage_total_tokens)
4) Instrumentation with TruLens
Wrap the whole run in a TruSession / TruApp so each variant is logged separately. Attach Feedback objects that call an evaluation model to score properties like groundedness and relevance.
# pseudo-code (TruLens-like) session = TruSession() app = TruApp(name="RAG_strict", session=session) app.wrap(retrieval_function) app.wrap(generation_function) feedback = Feedback( name="groundedness", eval_function=lambda query, ctx, answer: eval_model.score_groundedness(...) ) session.run(app, queries)
Defining the metrics
Automated feedback turns subjective judgments into consistent, comparable numbers. Define each metric clearly and keep the rubric simple so it’s reproducible.
- Groundedness — fraction of factual claims in the answer that are directly supported by at least one retrieved chunk. Range 0–1. (Evaluator inspects claims and maps to chunk IDs.)
- Answer relevance — how well the answer addresses the user’s intent and question. Range 0–1.
- Context relevance — relevance score for each retrieved chunk (per-chunk), aggregated via mean into a single context score for the query.
Example evaluator prompt (to gpt-4o-mini):
"Given the question, the retrieved context chunks (with IDs), and the assistant answer, rate groundedness 0-1:
1) For each factual claim in the assistant answer, state the supporting chunk ID or 'none'.
2) Return JSON: { "groundedness": 0.0-1.0, "support_map": [ { "claim": "...", "chunk_id": "C3" } ] }"
Sample prompts
Base prompt (simple):
"Using the following context, answer the user's question. If unsure, state your uncertainty."
Strict_citations prompt (enforces provenance):
"Use ONLY the provided context to answer. For each factual claim append [C#] where C# is the chunk ID that supports it. If the answer cannot be found in the context, respond with 'I don't know.' Do not invent facts."
Example trace (JSON)
{
"trace_id": "abc-123",
"query": "What is the product warranty period?",
"retrieval": {
"retrieved_ids": ["C12", "C07", "C03", "C08"],
"scores": [0.92, 0.87, 0.65, 0.34]
},
"generation": {
"variant": "strict_citations",
"gen_text": "The warranty is 12 months [C12].",
"gen_tokens": 78,
"latency_ms": 420
},
"feedback": {
"groundedness": 1.0,
"answer_relevance": 0.95,
"context_relevance_mean": 0.79
}
}
Leaderboard & dashboard
Run both RAG variants across the same query set and store all traces in a TruSession. Produce a leaderboard comparing:
- variant
- avg_groundedness
- avg_answer_relevance
- avg_latency_ms
- avg_eval_cost_per_query
This leaderboard drives data-driven choices: pick the variant that hits your groundedness target while holding answer relevance and latency within your SLA.
Practical tradeoffs and rules of thumb
- Strict citation prompts typically reduce unsupported assertions but increase “I don’t know” rates. Expect a non-trivial rise in conservative responses; measure the business impact.
- Top-k = 4 is a good starting point for short documents. Increase k for long-form or multi-topic corpora, but watch latency and token budgets.
- Use mean aggregation for context relevance by default; switch to weighted mean (by retrieval score) if retrieval scores appear well-calibrated.
- Evaluator and generator from the same provider can share blind spots. Use ensemble evaluators or human spot checks for critical flows.
Production checklist: what changes when you go beyond a notebook
- Vector DB: move from in-memory Chroma to a durable store (Pinecone, Qdrant, Weaviate, Milvus) with replication and indexing tuned for latency.
- Embedding strategy: batch embeddings, cache common embeddings, and version your embedding model.
- Evaluation budget: control eval model costs via sampling, stratified testing, or lower-cost evaluators in non-critical paths.
- Governance: redact PII from traces, add RBAC to dashboards, and apply retention policies to traces and contexts.
- Observability: integrate traces into your APM/SIEM, create alerts for sudden drops in groundedness or spikes in hallucination metrics.
- Security: read OPENAI_API_KEY from environment variables or secret managers—do not hardcode keys.
Why executives should care
Auditable AI reduces operational risk. Instrumented RAG pipelines provide measurable KPIs (percent grounded answers, avg latency, evaluator cost) that can be tied to business outcomes: fewer escalations, faster incident resolution, and defensible compliance posture. That’s tangible ROI for AI automation projects feeding customer support, sales enablement, or knowledge discovery.
Mitigating evaluator bias and blind spots
Automated feedback accelerates iteration, but it’s not a replacement for human judgment. Recommended mitigations:
- Run an ensemble of evaluators (different models or model versions).
- Calibrate automated scores against a labeled human sample periodically.
- Introduce adversarial tests to surface failure modes.
- Keep humans in the loop for policy-sensitive or high-risk outputs.
Quick appendix
Sample span attributes to record
- query_text
- retrieved_ids, retrieval_scores
- prompt_variant
- gen_text, gen_tokens, gen_latency_ms
- feedback.{groundedness, answer_relevance, context_relevance}
Board-ready KPIs
- % Grounded answers (target: defined by domain)
- Average answer relevance
- Avg latency (ms)
- Eval cost per query
- Month-over-month change in hallucination rate
Actionable next steps
- Prototype: Run the notebook with in-memory Chroma and TruLens to gather initial traces.
- Validate: Calibrate automated feedback against 200 labeled questions from your domain.
- Pilot: Move to a managed vector DB and introduce RBAC and retention policies for traces.
- Productionize: Set SLOs for groundedness and latency; integrate trace alerts into incident workflows.
Frequently asked questions
How do you capture and inspect what an LLM used to produce an answer?
Record each retrieval and generation call as a trace span, storing inputs, retrieved chunk IDs and scores, token usage, and outputs. These structured traces let you inspect exact provenance for any response.
How do you convert “seems right” into measurable signals?
Attach automated feedback functions (groundedness, answer relevance, context relevance) implemented as evaluator calls. Store scores per-trace so you can compute per-variant averages and drill into failures.
Does a stricter prompt reduce hallucinations?
Usually. Forcing strict citations and limiting the model to provided context reduces unsupported assertions but increases the rate of conservative “I don’t know” replies. Measure the tradeoff against your business goals.
Can this scale to production?
Yes. The notebook pattern is prototyping-friendly. Production requires persistent vector stores, embedding and eval cost controls, governance for trace data, and integration into monitoring and incident workflows.
Instrument early, measure consistently, and make traceable choices. When teams can point to the context that produced an answer and a numeric score that describes how well it’s supported, LLMs move from opaque risk centers to auditable AI agents that businesses can rely on.