Google Auto‑Diagnose: LLM debugging for integration‑test failures at scale with ~90% accuracy

Auto‑Diagnose: LLM debugging for integration‑test failures at scale

TL;DR: Google’s Auto‑Diagnose uses Gemini 2.5 Flash (no fine‑tuning) plus heavy prompt engineering and robust log plumbing to triage integration‑test failures automatically. It finds evidence‑backed root causes about 90% of the time in a manual eval, returns results fast (median 56s), and reduces debugging time while exposing observability gaps.

Why integration tests break and why diagnosis is expensive

An integration test fails and engineers drop what they’re doing to hunt logs. Often the trace is long, spread across a test driver and many services, and timestamped inconsistently. A quick survey inside Google showed most integration tests are functional (hermetic): they run an isolated scenario that depends on multiple components. Another survey of 116 engineers found 38.4% of integration‑test failures take more than an hour to diagnose and 8.9% take over a day. That tail is costly for any engineering organization.

Definitions for quick scanning:

SUT — System Under Test (the set of services/components exercised by the test).
Hermetic integration tests — tests that run in isolation with controlled dependencies, not relying on external services.
Critique — Google’s code‑review system where Auto‑Diagnose posts findings.
Timestamp stitching — joining logs from driver and components by timestamp to recreate an event timeline.
p90 — 90th percentile latency (how slow the slower 10% of runs are).
Token — a unit of text processed by an LLM (inputs + outputs consume tokens).

How Auto‑Diagnose works (architecture & prompt design)

Google runs Auto‑Diagnose in production. The core idea is simple: collect logs, stitch them into a coherent timeline, feed that context to a large language model (Gemini 2.5 Flash), and post a concise, evidence‑backed diagnosis to the code review. Rather than fine‑tuning the model, Google invested in three things: centralized log plumbing, precise prompt engineering, and hard refusal rules that prevent guessing.

High‑level flow:

Failed test triggers Auto‑Diagnose via Pub/Sub.
System gathers test‑driver logs and SUT component logs.
Logs are timestamp‑stitched and enriched with component metadata (roles, owners, recent changes).
Pre‑processing filters noise and optionally summarizes long sections to reduce token cost.
Gemini 2.5 Flash receives the organized context with near‑deterministic prompt settings (temperature 0.1, topp 0.8).
Post‑processing formats the model’s evidence lines and recommendations, then posts into Critique.

Two guardrails are critical:

Refuse‑to‑guess — the prompt forces the model to return only well‑supported conclusions or explicitly request more information. That reduces hallucinations and surfaces missing logs.
Human feedback loop — reviewers and authors can mark diagnoses as “Please fix,” “Helpful,” or “Not helpful,” creating rapid supervision and iterative prompt improvements.

“Auto‑Diagnose reads combined test‑driver and component logs, locates a plausible root cause, and posts a concise diagnosis into the code review.”

“The prompt forces the model to refuse to conclude when logs lack the necessary evidence, which reduces hallucinations and has even exposed logging infrastructure bugs.”

Measured results and developer feedback

Evaluation: Manual eval of 71 real failures from 39 teams → 90.14% root‑cause accuracy (compared to human‑labeled root causes). Sample size is promising but not exhaustive; teams should replicate evaluation on their own failure modes before wider rollout.
Production scale (since May 2025): processed 52,635 distinct failing tests across 224,782 executions, touching 91,130 code changes by 22,962 authors.
Latency: median (p50) = 56 seconds; p90 = 346 seconds — fast enough to appear while engineers are still working on the failure.
Token usage: average input ≈ 110,617 tokens; average output ≈ 5,962 tokens per run. Long logs drive token counts.
Developer feedback: 517 feedback reports from 437 developers; reviewers clicked “Please fix” frequently (roughly 84.3%). Authors’ helpfulness ratio = 62.96% (Helpful / (Helpful + Not helpful)). “Not helpful” sits at 5.8%, below Google’s 10% retention threshold.
Side effects: ~20 “more information needed” prompts in production highlighted missing logs and revealed actual infrastructure bugs (e.g., crash paths without persisted logs).

Sample diagnosis (anonymized)

Timestamp evidence:

• 2025‑06‑04T11:03:21Z — Test driver reported: “agent failed to start: exit code 137.”

• 2025‑06‑04T11:03:22Z — svc‑auth container OOMKilled; no crash log persisted.

Conclusion:

• Likely root cause: svc‑auth process was OOMKilled during test startup (evidence lines above).

Recommendation:

• Reproduce locally with increased memory, add crash log persistence for OOM events, and add guardrails to capture container exit reasons.

Practical design lessons for AI automation and observability

Two lessons are immediately transferable to other organizations building LLM debugging tools:

Model choice vs. workflow investment: Fine‑tuning helps, but it’s not the only path. A general LLM + precise prompt engineering + reliable data plumbing can hit high accuracy quickly. The heavy lift is making telemetry consistent and accessible.
Fail safely: Constrain the model to refuse when evidence is missing. That both limits harmful hallucinations and becomes a continuous observability audit: when the model asks for more, fix the logging.

Cost, token economics, and ROI

Token usage matters. With average runs consuming ~116k tokens total, estimate per‑run cost as:

cost per run = (input tokens + output tokens) / 1,000 × model price per 1k tokens

Example (illustrative): at $0.03 per 1k tokens, one run ≈ 116k tokens → 116 × $0.03 ≈ $3.48. Compare that against developer time: if Auto‑Diagnose saves even 15 minutes of engineer time (at $100/hour fully loaded), that’s ~$25 saved versus a few dollars in model cost — positive ROI for high‑value suites. Check your vendor pricing and run strategies (summarization, selective context, retrieval augmentation) to reduce token load.

Ways to cut token costs:

Pre‑summarize repetitive logs and include only delta or error windows.
Use embeddings + retrieval to supply only the most relevant log snippets rather than full streams.
Apply client‑side filters to remove low‑value noise before sending context to the model.

Limitations, risks, and operational considerations

Coverage: Hermetic integration tests with centralized logs are ideal. Less‑instrumented test suites, flaky tests, or production incidents with partial telemetry will reduce accuracy.
Evaluation caveats: The 90.14% accuracy figure came from a 71‑failure manual eval. That’s strong but not definitive across all architectures; replicate tests on your fleet before trusting automation blindly.
Security & compliance: Logs can contain PII or secrets. Options: run models in‑VPC/private endpoints, redact/transform sensitive fields prior to sending, or host models on‑prem. Implement encryption, access controls, and retention policies.
Maintenance: As log formats and systems evolve, prompts, refusal rules, and pre‑processing must be updated. Treat prompt rules as software that requires versioning and tests.
Skill shift: Teams may lean on automated triage. That’s often beneficial (more focus on higher‑value work), but maintain opportunities for engineers to exercise diagnostic skills through rotational duties or incident drills.

Implementation checklist for a pilot

Pick a focused scope: start with one high‑value, well‑instrumented integration suite.
Centralize logs and timestamps: ensure reliable timestamp stitching and component metadata (owners, role, binary versions).
Define refusal rules: require evidence lines and an explicit “more information needed” path.
Plan token strategy: measure average log length, try summarization, and budget cost per run.
Instrument feedback: add reviewer/author feedback buttons and telemetry to iterate prompts and measure accuracy.
Secure data paths: redact PII, use private endpoints, and enforce retention controls.
Run a 6–8 week pilot: measure time‑to‑diagnose, helpfulness, false positives, and infra improvements surfaced.

Key questions and short answers

What problem does Auto‑Diagnose solve?

Automated root‑cause diagnosis for hermetic integration‑test failures by reading test‑driver and SUT logs, stitching timelines, and posting concise, evidence‑backed findings into code reviews.

How accurate is it?

About 90.14% root‑cause accuracy on a manual evaluation of 71 failures across 39 teams; production feedback and adoption also support practical usefulness, though teams should evaluate against their own failure modes.

Which model and approach were used?

Gemini 2.5 Flash, used without fine‑tuning. The system relies on careful prompt engineering, near‑deterministic settings (temperature = 0.1, topp = 0.8), and strong pre/post processing.

Can this work outside Google?

Yes — but only if you invest in observability plumbing (centralized logs, reliable timestamps, metadata), prompt guardrails, token‑cost planning, and strong security controls.

Next steps for leaders

AI for software engineering pays off fastest when embedded into an end‑to‑end workflow. Auto‑Diagnose demonstrates three pragmatic truths: good telemetry beats custom fine‑tuning, refusal rules build trust, and automated triage surfaces infrastructure debt. For C‑suite and engineering leaders: prioritize observability investments, run a short pilot with clear success metrics, and treat prompts and pipelines as product features that need continuous improvement.

Start small, measure hard, and use automated diagnostics not to replace human judgment but to make human judgment faster and more focused.