How a Haystack Multi-Agent System Automates Incident Detection, Investigation, and Postmortems
TL;DR
A reproducible notebook demonstrates a Haystack-powered multi-agent system that detects incidents, investigates metrics and logs with evidence-backed SQL queries, proposes mitigations, and drafts production-grade postmortems automatically. The approach uses in-memory DuckDB for fast analytics, a rolling z‑score anomaly detector for explainable alerts, and three specialist AI agents (coordinator, profiler, writer) that are constrained to call programmatic tools instead of looking up external documents (no RAG — Retrieval-Augmented Generation). The result is structured, auditable output that teams can plug into runbooks and automation while retaining human oversight.
A simple real-world hook
At 02:14 a.m., the payments API shows a sudden p95 spike (95th percentile — a standard latency tail metric), a burst of errors, and a drop in RPS (requests per second). Who triages? How fast can the team form a testable hypothesis and push a fix? AI agents can accelerate that workflow while keeping the evidence auditable.
Why this matters for engineering leaders and SREs
Noise from metrics and logs is the daily grind for SREs. Human-led triage is slow and inconsistent. A tool-first, agentic (autonomous, multi-role AI) system can reduce time-to-detect and time-to-postmortem, codify runbooks, and produce machine-readable incident records. But to be useful in production the system must avoid hallucinations, be traceable, and integrate with existing workflows. That’s where a Haystack multi-agent pattern combined with a strict non-RAG grounding shines: AI agents are forced to call tools and run SQL against local data rather than inventing explanations.
“Prefer calling tools over guessing.”
Architecture & components
- Data layer: Synthetic or real metrics and logs are persisted as CSV, then loaded into an in-memory DuckDB for fast SQL analysis.
- Anomaly detector: Rolling z‑score (explainable statistical method; z‑score measures how many standard deviations a value is from the mean) used to flag contiguous anomaly spans and rank candidate incident windows.
- Tools (callable functions): load_inputs, detect_incident_window, sql_investigate, log_pattern_scan, propose_mitigations, draft_postmortem. Each returns structured outputs consumed by agents.
- Agents: Three LLM-driven roles orchestrated via Haystack:
- Coordinator — runs the end-to-end flow and aggregates outputs.
- Profiler — produces a JSON with falsifiable hypothesis, named service/mechanism, symptoms, top contributors, and key facts.
- Writer — formats a production-grade postmortem JSON and runbook snippets with owners and ETAs (days).
- LLM: OpenAI model (gpt-4o-mini used in the demo) via OpenAIChatGenerator.
- State schema: inputs, incident_window, investigation_notes, hypothesis, key_facts, mitigation_plan, postmortem (structured JSON).
- Constraint: No RAG — agents must rely on local CSV/DuckDB queries and the provided tooling. This reduces hallucination risk and improves auditability.
Step-by-step pipeline (what the system actually does)
- Ingest: Metrics and logs are synthesized or streamed, saved to CSV, and loaded into DuckDB.
- Detect: The rolling z‑score detector scans metric time-series to propose anomaly windows. Candidate windows are ranked by anomaly score.
- Probe: The coordinator calls sql_investigate and log_pattern_scan tools to fetch correlated metrics, top endpoints, error kinds, and region/service breakdowns.
- Hypothesize: The profiler agent synthesizes evidence into a falsifiable hypothesis (must name a service/mechanism and include a testable claim). Example: “Payments-service DB connection pool exhausted causing p95 spike; a specific query shows connection saturation in db-proxy between t1–t2.”
- Validate: Additional SQL queries corroborate or refute the hypothesis. The system stores query outputs as key_facts to preserve traceability.
- Mitigate: The propose_mitigations tool suggests concrete fixes mapped to owners and ETA estimates (e.g., scale pool + increase timeout — owner: infra-team, ETA: 1 day).
- Document: The writer agent emits a structured postmortem JSON and runbook snippets ready for tickets or automation hooks.
Example incident timeline
Injected incident at ~62% of a 24‑hour synthetic timeline produced the following flow:
- T+0: Detector flags a contiguous anomaly window driven by p95 latency and error-rate spikes across payments and db-proxy.
- T+2m: sql_investigate shows connection pool exhaustion on db-proxy and a drop in RPS for payments.
- T+5m: log_pattern_scan surfaces repeated “DBConnPoolExhausted” messages and an increase in upstream timeouts.
- T+7m: Profiler emits hypothesis naming db-proxy connection pool as the primary mechanism with a falsifiable test (query to confirm max connections reached during window).
- T+12m: Writer produces postmortem JSON with corrective actions (increase pool, add saturation alarm, add retry/backoff), owners, and ETAs.
Sample postmortem JSON schema
{
"incident_id": "inc-20260127-001",
"incident_window": {"start": "2026-01-26T14:48:00Z", "end": "2026-01-26T15:03:00Z"},
"symptoms": ["p95 latency spike payments", "error-rate increase", "RPS drop payments"],
"top_contributors": [
{"service": "db-proxy", "metric": "max_connections", "value": 500},
{"service": "payments", "metric": "error_rate", "value": 0.12}
],
"hypothesis": {
"statement": "db-proxy connection pool exhaustion caused upstream timeouts in payments.",
"service": "db-proxy",
"mechanism": "connection_pool_exhaustion",
"falsifiable_test": "Query peak_connections >= configured_pool_size during incident window"
},
"key_facts": [
{"query": "SELECT max_connections FROM db_metrics WHERE ts BETWEEN ...", "result_summary": "peak=500, configured=250"}
],
"mitigations": [
{"action": "Increase db-proxy pool to 600", "owner": "infra-team", "eta_days": 1},
{"action": "Add alert on connections > 80% of pool", "owner": "observability", "eta_days": 2}
],
"runbook_snippets": ["Step 1: Verify pool settings", "Step 2: Increase pool and restart db-proxy"],
"confidence": 0.87
}
Evaluation metrics and governance
Measure outcomes, not just activity. Recommended KPIs:
- Detection metrics: MTTD (mean time to detect), precision/recall or F1 for incident windows, false positive rate.
- Investigation metrics: Hypothesis validation rate (percent of profiler hypotheses validated by SQL checks), time-to-hypothesis.
- Documentation metrics: Time-to-postmortem, percent of postmortems accepted without human rewrite, completeness score (presence of owners, ETAs, falsifiable test).
- Business impact: MTTR reduction, number of incidents with automated runbook steps executed, cost savings from fewer escalations.
Key questions and short answers
- How reproducible is this approach for an engineering team?
Very reproducible: the demo runs from a single notebook, installs dependencies at runtime, writes CSVs, and loads them into DuckDB so investigators can replay and audit every step.
- Can agents be trusted to avoid hallucination?
Trust is improved but not eliminated: non‑RAG grounding and tool-first reasoning significantly reduce hallucination risk because agents must cite SQL results and tool outputs rather than invent facts.
- Will this scale to messy production telemetry?
The pattern scales, but production requires additional work: schema validation, synthetic canaries, monitoring for schema drift, and human-in-the-loop gates for high-risk actions.
- What controls are needed before automated mitigations run?
Confidence thresholds, manual approvals for destructive actions, rollback procedures, and a staged rollout (suggestions → confirmed fixes → automated remediation) are essential.
Failure modes and practical mitigations
- Schema drift: Timestamp or field name changes break SQL. Mitigation: schema validation layer and lightweight ETL that normalizes fields.
- Partial instrumentation: Missing metrics for a new service. Mitigation: fall back to logs, add synthetic canaries, surface uncertainty in postmortem.
- Correlated noise: Shared infra causing false correlation. Mitigation: cross-check with region/service-level aggregation and include confidence scoring.
Rollout plan and reproducibility checklist
Recommended phased rollout:
- Phase 0 — Read-only: Agents suggest hypotheses and mitigations; humans validate and act.
- Phase 1 — Assisted ops: Agents create tickets and populate runbooks; humans approve before execution.
- Phase 2 — Conditional automation: Permit automated fixes for low-risk, high-confidence scenarios with rollback playbooks.
Reproducibility checklist (minimal)
- Notebook repository (demo includes runnable notebook with dependency install).
- Python packages: haystack-ai, openai, pandas, numpy, duckdb (install at runtime in the notebook).
- OpenAI API key stored securely (avoid hardcoding; use environment variables or secrets manager).
- Expected runtime: minutes to run the demo dataset; production requires resource planning for log ingest and DuckDB memory.
- Costs: LLM calls (gpt-4o-mini) per agent step — budget and rate-limit accordingly.
Security, privacy, and compliance
External LLMs create data exposure risks. Best practices include:
- Redact or tokenise sensitive data (PII, secrets) before sending to an LLM.
- Use private LLM endpoints, bring-your-own-model (BYOM) in a VPC, or vendor contracts that guarantee non-retention.
- Log all LLM inputs and outputs to enable audits; tie them to incident records in DuckDB or your observability store.
Next steps for teams
- Run the demo notebook to see the coordinator → profiler → writer loop and the resulting postmortem JSON.
- Start with a small pilot covering non-critical services and a read-only mode to measure false positive and hypothesis validity rates.
- Embed the structured postmortem outputs into ticketing or runbook systems so they feed automation and governance workflows.
Agentic AI for SRE—when built with explicit state, tool-first reasoning, and non-RAG grounding—can turn noisy telemetry into auditable, falsifiable incident narratives. The pattern is a practical way to scale investigation capacity while preserving human control over critical mitigations.