LLM Observability with Langfuse: Trace Prompts, Score Outputs, Run Repeatable Experiments

Trace, Score, Experiment: A Practical Langfuse Pipeline for LLM Observability

TL;DR

LLM observability turns guesswork into accountable decisions. Use Langfuse to trace prompts, version prompts, attach structured scores, and run dataset-driven experiments so you can measure model and prompt changes reliably—even without paid LLM access.
Start small: add an @observe decorator to one flow, move one prompt into the prompt registry, attach two scores (groundedness and user_feedback), and run a tiny dataset experiment.
Practical payoff: faster debugging, measurable regressions/regressions, and safer rollouts for customer-facing features.

Why LLM observability matters for business

As LLM-driven features move from prototypes into production, teams need the same instrumentation that traditional software enjoyed: traces, versioned artifacts, and repeatable experiments. Without that visibility, a model or prompt change becomes a blind gamble that can affect revenue, customer trust, and compliance.

Langfuse is an open-source LLM engineering platform focused on observability, prompt management, evaluation, datasets, and experiments. It gives you the primitives to treat prompts and model outputs like release artifacts: traceable, versioned, and measurable.

High-level pipeline

At a glance, the pattern looks like this:

Client app → Langfuse SDK (observe, propagate_attributes) → LLM (mock-llm-v1 or OpenAI gpt-4o-mini) → Langfuse backend → UI (Traces / Prompts / Scores / Datasets)

Key capabilities you’ll use:

Tracing via a decorator or manual spans to capture units of work (a span is a named step like “retrieve” or “prompt-managed-call”).
Prompt management for named, versioned prompts that compile with runtime variables (prompt = release artifact).
Evaluation and scoring with numeric, categorical, and boolean score types attached to traces.
Datasets & experiments to run reproducible baseline experiments and compare prompt/model changes.

Tracing: decorator vs manual spans

Tracing is the backbone of observability. Use a decorator for quick wins and manual spans for complex flows like RAG (retrieval-augmented generation).

Define terms up front:

LLM — large language model (e.g., gpt-4o-mini).
RAG — retrieval-augmented generation: combining a knowledge store with a model.
Span — a named unit of work or pipeline step (e.g., retrieve, prompt-managed-call).
Decorator — a simple wrapper (@observe) that automatically records a function’s span and model outputs.

Minimal pseudo-code to show the mechanics:

@observe("story_pipeline")
def story_pipeline(user_id, question):
    propagate_attributes(user_id=user_id, session_id="s1", tags=["support"])
    answer = llm.generate(prompt)
    span.score("groundedness", 0.9)   # numeric
    span.score("user_feedback", "positive")  # categorical
    return answer

Manual spans are the preferred choice when you need to group multiple steps and attach attributes across them:

with start_span("retrieve") as s:
    s.propagate_attributes(user_id="u1")
    docs = kb.search(question)

with start_span("prompt-managed-call") as s2:
    s2.attach_prompt(langfuse_prompt)
    gen = llm.generate(compiled_prompt)
    s2.score("resolved", True)  # boolean

Prompt management: prompts as versioned artifacts

Keep prompt text out of code and into a prompt registry. Compile named prompts with runtime variables (tone, company, question) and link an exact prompt version to a generation so you can trace outputs back to the exact wording used.

Example workflow:

Create a named prompt (support-agent).
Version and compile with runtime variables such as company=Acme and tone=polite.
Attach the prompt reference to generations using langfuse_prompt so each output maps to a prompt version.

Why version prompts? Because it answers the perennial question: was it the model or the wording? Treating prompts like release artifacts enables rollbacks and repeatable experiments.

Scoring & evaluation: structured signals that actually mean something

Record structured evaluation signals directly on traces. Langfuse supports numeric, categorical, and boolean score types; each maps to business signals:

Numeric — groundedness (0–1 confidence that the response is supported by context).
Categorical — user_feedback (positive/neutral/negative).
Boolean — resolved (true/false whether the user’s issue was resolved).

Attach scores inline:

span.score("groundedness", 0.8)
span.score("user_feedback", "neutral")
span.score("resolved", False)

Common evaluator examples:

Exact-match accuracy for factual datasets.
Semantic similarity (embedding cosine) for paraphrase-tolerant tasks.
Human-in-the-loop labels for safety and nuanced correctness.

RAG example: constrain, retrieve, and verify

Constrain LLM outputs with a system prompt like:

“Answer the question using ONLY the provided context.”

That instruction forces the model to rely on retrieved context; then attach a groundedness score to measure whether the response used that context. A practical RAG trace has spans for retrieve → rank → prompt-managed-call, and you propagate user_id and session_id across them so related events are grouped in the UI.

Small in-memory KBs are ideal for demos and deterministic tests. In production, use your vector store or retrieval layer and capture which documents were used for each generation to support audits and post-hoc labeling.

Mock LLM: why it matters and its limitations

A deterministic mock LLM (mock-llm-v1) lets teams exercise the whole pipeline without an OpenAI key or model costs. The example workflow uses a mock that returns known capitals for a list of countries and a canned explanation.

“Langfuse is an open-source LLM engineering platform for observability, prompt management, evaluation and datasets.”

When to use mock-LLMs:

Integration testing and CI for instrumentation.
Dry runs for dataset-driven experiments and scoring logic.
Onboarding engineers to prompts and tracing without incurring API costs.

What mock-LLMs miss:

Non-deterministic behaviors, temperature effects, and latency distributions.
Real hallucination patterns that emerge with large models and diverse prompts.
API rate limits and error modes from real providers.

Datasets & experiments: measurable baselines

Create deterministic datasets (e.g., capital-cities-tutorial with items like France → Paris) and define item-level evaluators (exact match, char_length) and aggregate evaluators (mean_accuracy). Run experiment runs (capitals-baseline) with controlled concurrency to produce comparable dataset-run and experiment-run metrics.

Useful experiment controls:

Concurrency and retries to simulate realistic load.
Seeded randomization for reproducibility.
Time limits and failure thresholds to ensure experiments complete.

LangChain integration

If you use LangChain, the Langfuse CallbackHandler instruments chains so you get end-to-end traces across composed steps. That’s useful when your stack already orchestrates calls via chains and agents: the callback captures chain events, generations, and any intermediate tools or retrievers used.

When OpenAI is available, the example defaults to gpt-4o-mini as a model choice. If you don’t have an OpenAI key, the mock LLM gives you a full workflow for observability and evaluation.

Operational questions & best practices

Common operational considerations and practical recommendations:

Scaling: sample and shard traces for high-volume workloads, instrument asynchronously, and batch writes to Langfuse to manage cost and throughput.
Governance & PII: redact or tokenize sensitive fields before sending traces. Maintain retention policies and role-based access to the Langfuse UI for compliance.
Evaluator reliability: calibrate automated evaluators with human labels periodically. Use human-in-the-loop for edge cases and safety checks.
Release process: gate prompt and model changes behind experiments. Use baseline experiment runs and adoption thresholds (e.g., +X% mean_accuracy) before rollout.
Integration: align Langfuse traces with existing APM logs by including trace IDs and key attributes (user_id, session_id, correlation_id).

Quick start — 10 minutes to an observable flow

Install the Langfuse SDK and configure keys or a self-hosted URL for your region (EU/US supported).
Instrument a function with @observe and propagate user_id/session_id.
Create one named prompt (support-agent) in the prompt registry and compile it with runtime variables.
Attach two scores to spans: groundedness (numeric) and user_feedback (categorical).
Create a tiny dataset (10 items) and run an experiment (capitals-baseline) with concurrency=1.
Inspect Traces, Prompts, Scores, and Datasets in the Langfuse UI and iterate.

First 30 days checklist

Instrument a single customer-facing flow with tracing and prompt versioning.
Define and record two business-focused scores.
Run a baseline experiment and save the dataset-run results.
Set retention and redaction rules to protect sensitive data.
Draft a release playbook: experiment → review → staged rollout → monitor.

Common pitfalls

Not propagating user/session attributes, which makes traces hard to correlate.
Over-relying on small datasets or exact-match metrics for semantic tasks.
Sending raw PII into traces without redaction.
Skipping human audits for safety-critical or high-revenue flows.

Guiding questions and short answers

Can you run a full observability and evaluation pipeline without paid LLM access?

Yes. The mock-llm-v1 deterministic model lets you exercise tracing, prompt management, scoring, datasets, and experiments without an OpenAI key.
How do you link a specific prompt version to model outputs?

Compile named prompts (e.g., support-agent) with runtime variables, then attach the prompt reference to generations via langfuse_prompt so each output maps to an exact prompt version.
What types of evaluation signals can you record?

Numeric, categorical, and boolean scores are supported. Use create_score and span.score to attach groundedness (numeric), user_feedback (categorical), or resolved (boolean) values to traces.
Can you measure changes with repeatable experiments?

Yes. Create datasets, define item-level evaluators and aggregate metrics (e.g., mean_accuracy), then run experiments (dataset-run / experiment-run) with concurrency and seeds for reliable comparisons.

“We created a practical end-to-end Langfuse workflow that covers the most important parts of LLM observability and evaluation.”

LLM observability is the control plane that turns ad-hoc model changes into measurable experiments. Start with a small, instrumented flow, version one prompt, attach a couple of scores, and run a tiny dataset experiment. Those first data points compound: they give you the confidence to gate rollouts, reduce rollback time, and make model changes auditable. When model-driven features touch revenue or customer experience, observability is the feature that pays back.