LLM-based Failure Detection and Root-Cause Analysis for AI Agents: From Alerts to Actionable Fixes

From “We Failed” to “Here’s What to Fix”: LLM-based Agent Failure Detection and Root‑Cause Analysis

TL;DR: When AI agents fail in production, dashboards only tell you that something went wrong. LLM-based detectors convert OpenTelemetry/CloudWatch traces into span‑level failure labels, causal chains, and targeted fix recommendations—shifting teams from long manual diagnosis to minutes of actionable remediation. Start with a MEDIUM confidence threshold, run diagnosis ON_FAILURE in CI, and prioritize PRIMARY fixes first.

Why span‑level diagnosis matters for AI agents

A dashboard alert—“goal success fell from 85% to 70%”—is only the start. AI agents coordinate many moving parts (LLM calls, tool invocations, retrievals, orchestration steps). A single failing tool parameter can trigger retries, partial outputs, hallucinations, and a cascade of downstream symptoms. Detecting that an agent failed is only step one — the real challenge is determining why it failed and what to change.

Detecting that an agent failed is only step one — the real challenge is determining why it failed and what to change.

Strands Evals’ detectors use LLM‑based analysis to read traces (a “span” is one unit of work in a trace—an API call, a tool call, an LLM call), tag failures against a taxonomy, and then stitch those tags into causal chains that point to PRIMARY root causes and actionable fixes (system prompt vs. tool description vs. schema changes).

How it works: a two‑phase LLM pipeline

At a high level the workflow has two phases:

Phase 1 — Failure detection: Scan each span and assign one or more failure categories, a confidence score, and supporting evidence.
Phase 2 — Root‑cause analysis: Link detected failures into causal chains, label causality (PRIMARY/SECONDARY/TERTIARY), estimate propagation impact, and recommend fix types (SYSTEM_PROMPT_FIX, TOOL_DESCRIPTION_FIX, etc.).

The detectors expose a concise API you can wire into your evaluation pipeline or CI/CD: detect_failures(session, confidence_threshold), analyze_root_cause(session, failures=...), and diagnose_session(session, ...). Results are structured so you can automatically route fixes to owners, trigger canary rollouts, or create tickets with evidence attached.

Failure taxonomy (one‑line glossary)

Hallucination — fabricated facts in output. Example: asserting a non‑existent paper. Suggested fix: SYSTEM_PROMPT_FIX or retrieval tuning.
Incorrect actions — wrong tool chosen or bad parameters. Example: calling search instead of database query. Suggested fix: TOOL_DESCRIPTION_FIX.
Orchestration errors — steps run in the wrong order or missing synchronization. Example: parsing before retrieval completes. Suggested fix: system-level orchestration change.
Task instruction non‑compliance — agent ignores task constraints. Example: exceeding word limits. Suggested fix: SYSTEM_PROMPT_FIX.
Execution errors — runtime failures like missing parameters or timeouts. Example: missing knowledgeBaseId. Suggested fix: TOOL_DESCRIPTION_FIX or schema validation.
Context handling errors — forgot prior context or truncated memory. Example: missing passage in prompt. Suggested fix: prompt engineering or context window changes.
Repetitive behavior — loops or repeated retries. Example: continuous retrying without backoff. Suggested fix: orchestration/tool retry policy.
LLM output issues — formatting or parseable-output failures. Example: malformed JSON. Suggested fix: output schema enforcement or model config change.
Configuration mismatch — expectations vs runtime config differ. Example: prod tool path differs from staging. Suggested fix: environment/config sync.

Concrete walkthrough: missing tool parameter → hallucination cascade

Example scenario: a research assistant agent is asked to summarize AI energy requirements. A tool call to a knowledge base is missing a required parameter (knowledgeBaseId). The tool returns an error; the agent retries, receives partial results, attempts to synthesize, and eventually hallucinates unrelated content. Detecting the failure manually means scrolling through dozens of spans.

What detectors return (example output):

[
  {
    "span_id": "s_42",
    "category": "execution_error",
    "confidence": 0.90,
    "evidence": "Tool call /kb.query missing parameter knowledgeBaseId; API returned 400",
    "recommended_fix": "TOOL_DESCRIPTION_FIX",
    "causality": "PRIMARY"
  },
  {
    "span_id": "s_45",
    "category": "hallucination",
    "confidence": 0.75,
    "evidence": "Response contains fabricated paper title and citation not found in retrieval",
    "recommended_fix": "SYSTEM_PROMPT_FIX",
    "causality": "SECONDARY"
  },
  {
    "span_id": "s_60",
    "category": "orchestration_error",
    "confidence": 0.90,
    "evidence": "Agent attempted synthesis before retrieval complete; retry loop observed",
    "recommended_fix": "ORCHESTRATION_CHANGE",
    "causality": "SECONDARY"
  }
]

With causality labels, engineers can focus on fixing the PRIMARY execution_error (add parameter validation or default knowledgeBaseId), then rerun diagnostics to confirm that downstream hallucinations and orchestration issues disappear.

Architecture sketch

Trace sources (OpenTelemetry / CloudWatch / Langfuse / OpenSearch)
   ↓
Provider adapters (CloudWatchProvider, LangfuseProvider, OpenSearchProvider)
   ↓
Session builder (converts spans → Session object)
   ↓
Detect phase: detect_failures(session)  → list of failures (category, confidence, evidence)
   ↓
Analyze phase: analyze_root_cause(session, failures) → causal chains & fix recommendations
   ↓
Outputs: tickets / CI hooks / dashboards / remediation playbooks

CI/CD integration and a sample DiagnosisConfig

Attach automated diagnosis to experiments so analysis runs only when useful. Two trigger modes are common:

ON_FAILURE — run detectors only when an experiment fails (cost‑efficient, fast feedback for regressions).
ALWAYS — run on every experiment or on a schedule for periodic audits (higher cost, broader coverage).

Sample pseudocode (conceptual):

# DiagnosisConfig(trigger="ON_FAILURE", confidence_threshold="MEDIUM")
experiment.run()
if experiment.failed():
  diagnose_session(session, confidence_threshold="MEDIUM")

Cost, privacy, and operational guardrails

LLM‑based analysis runs on inference backends (examples use Amazon Bedrock) and incurs charges. CloudWatch log storage and query costs also apply. Practical mitigation patterns:

Sampling policy: analyze 100% of failing sessions, 10% of successful ones, or nightly audits at low cadence.
Failure‑path pruning: for traces >200 spans, keep only ancestors and descendants of detected failures to reduce tokens and inference calls.
Cache & reuse: pass pre‑detected failures into analyze_root_cause to avoid re-running detection on the same session.
PII controls: redact or hash sensitive fields, filter PII before sending traces, or consider on‑prem inference for regulated data.
Cost estimate guidance: cost ≈ (#detected_sessions × avg_tokens_per_session × price_per_token) + log_storage. Tune sampling to your budget.

Best practices — deploy detectors without chaos

Start with MEDIUM confidence threshold for routine runs; set to LOW for exploratory audits and HIGH for tight production monitoring.
Prefer ON_FAILURE in CI to control cost while ensuring regressions always get diagnoses.
Fix PRIMARY failures first — these are probable root causes rather than propagated symptoms.
Group remediation by fix type (system prompt, tool description/schema, orchestration) so teams can own rollouts and validation plans.
Validate fixes with small canaries or A/B tests before wide rollout to avoid introducing regressions elsewhere.

ROI and a short vignette

Anonymized case: Team X enabled detectors on a multi‑tool research assistant. A recurring orchestration bug that previously took ~4 hours to diagnose was reduced to ~20 minutes from detection to an actionable ticket. That reduced incident overhead, sped up rollbacks and patches, and allowed the team to ship feature changes faster. Even with LLM inference costs, the saved engineer hours and fewer rollbacks produced a net positive ROI for high‑value agent workflows.

Limitations, risks, and open questions

Calibration & false alarms: LLMs can produce false positives/negatives. Treat detector recommendations as high‑signal triage, not immutable truths.
Generalization: Accuracy can vary across agent architectures, domains, and LLM backends—validate detectors on representative traces.
Privacy & compliance: Sending production traces to external inference requires careful PII handling and contractual safeguards.
Validation of fixes: Automating remediation is powerful, but always validate with canaries and rollback plans.

Quick checklist for leaders

Inventory high‑value agent workflows and their trace availability (OTel / CloudWatch).
Enable detectors in staging with MEDIUM threshold and ON_FAILURE CI trigger.
Define owners for SYSTEM_PROMPT_FIX vs TOOL_DESCRIPTION_FIX remediations.
Implement PII redaction and a sampling policy to control cost.
Measure mean time to repair (MTTR) before & after detectors to quantify impact.

Key questions and short answers

What does detect_failures do?

It scans session spans with LLM-based analysis, assigns failure category(ies), confidence scores, and evidence for each span.
How does analyze_root_cause separate root causes from symptoms?

It links detected failures into causal chains, labels each as PRIMARY/SECONDARY/TERTIARY, estimates propagation impact, and recommends fix types accordingly.
Can I diagnose production traces automatically?

Yes—use CloudWatchProvider (or LangfuseProvider/OpenSearchProvider) to fetch OTEL or framework-exported traces; providers convert traces into Session objects for diagnosis.
How do I control inference costs?

Sample aggressively (100% failures, partial successes), prune failure paths, cache pre-detected failures, and run ALWAYS only on a low cadence.

Next steps

If you run AI agents in production—especially ones orchestrating tools, retrieval systems, or multi‑step workflows—adding LLM‑based failure detection and root cause analysis tightens feedback loops and reduces time to repair. Try a sample trace (for example, the flawed_session.json test traces in the Strands Evals repo) and run detect_failures with a MEDIUM threshold to see how recommendations map to your ownership model.

Explore the Strands Evals project and test traces here: Strands Evals on GitHub. For trace formats and instrumentation, see OpenTelemetry. If you plan to use Amazon Bedrock as the inference backend, review Amazon Bedrock for model and cost details.

Detectors aim to shorten diagnosis from hours to minutes by automatically extracting failures, causal links, and fix recommendations.