Haystack Multi-Agent System Automates Incident Detection, Investigation, and Postmortems for SREs

How a Haystack Multi-Agent System Automates Incident Detection, Investigation, and Postmortems

TL;DR

A reproducible notebook demonstrates a Haystack-powered multi-agent system that detects incidents, investigates metrics and logs with evidence-backed SQL queries, proposes mitigations, and drafts production-grade postmortems automatically. The approach uses in-memory DuckDB for fast analytics, a rolling z‑score anomaly detector for explainable alerts, and three specialist AI agents (coordinator, profiler, writer) that are constrained to call programmatic tools instead of looking up external documents (no RAG — Retrieval-Augmented Generation). The result is structured, auditable output that teams can plug into runbooks and automation while retaining human oversight.

A simple real-world hook

At 02:14 a.m., the payments API shows a sudden p95 spike (95th percentile — a standard latency tail metric), a burst of errors, and a drop in RPS (requests per second). Who triages? How fast can the team form a testable hypothesis and push a fix? AI agents can accelerate that workflow while keeping the evidence auditable.

Why this matters for engineering leaders and SREs

Noise from metrics and logs is the daily grind for SREs. Human-led triage is slow and inconsistent. A tool-first, agentic (autonomous, multi-role AI) system can reduce time-to-detect and time-to-postmortem, codify runbooks, and produce machine-readable incident records. But to be useful in production the system must avoid hallucinations, be traceable, and integrate with existing workflows. That’s where a Haystack multi-agent pattern combined with a strict non-RAG grounding shines: AI agents are forced to call tools and run SQL against local data rather than inventing explanations.

“Prefer calling tools over guessing.”

Architecture & components

Data layer: Synthetic or real metrics and logs are persisted as CSV, then loaded into an in-memory DuckDB for fast SQL analysis.
Anomaly detector: Rolling z‑score (explainable statistical method; z‑score measures how many standard deviations a value is from the mean) used to flag contiguous anomaly spans and rank candidate incident windows.
Tools (callable functions): load_inputs, detect_incident_window, sql_investigate, log_pattern_scan, propose_mitigations, draft_postmortem. Each returns structured outputs consumed by agents.
Agents: Three LLM-driven roles orchestrated via Haystack:
- Coordinator — runs the end-to-end flow and aggregates outputs.
- Profiler — produces a JSON with falsifiable hypothesis, named service/mechanism, symptoms, top contributors, and key facts.
- Writer — formats a production-grade postmortem JSON and runbook snippets with owners and ETAs (days).
LLM: OpenAI model (gpt-4o-mini used in the demo) via OpenAIChatGenerator.
State schema: inputs, incident_window, investigation_notes, hypothesis, key_facts, mitigation_plan, postmortem (structured JSON).
Constraint: No RAG — agents must rely on local CSV/DuckDB queries and the provided tooling. This reduces hallucination risk and improves auditability.

Step-by-step pipeline (what the system actually does)

Ingest: Metrics and logs are synthesized or streamed, saved to CSV, and loaded into DuckDB.
Detect: The rolling z‑score detector scans metric time-series to propose anomaly windows. Candidate windows are ranked by anomaly score.
Probe: The coordinator calls sql_investigate and log_pattern_scan tools to fetch correlated metrics, top endpoints, error kinds, and region/service breakdowns.
Hypothesize: The profiler agent synthesizes evidence into a falsifiable hypothesis (must name a service/mechanism and include a testable claim). Example: “Payments-service DB connection pool exhausted causing p95 spike; a specific query shows connection saturation in db-proxy between t1–t2.”
Validate: Additional SQL queries corroborate or refute the hypothesis. The system stores query outputs as key_facts to preserve traceability.
Mitigate: The propose_mitigations tool suggests concrete fixes mapped to owners and ETA estimates (e.g., scale pool + increase timeout — owner: infra-team, ETA: 1 day).
Document: The writer agent emits a structured postmortem JSON and runbook snippets ready for tickets or automation hooks.

Example incident timeline

Injected incident at ~62% of a 24‑hour synthetic timeline produced the following flow:

T+0: Detector flags a contiguous anomaly window driven by p95 latency and error-rate spikes across payments and db-proxy.
T+2m: sql_investigate shows connection pool exhaustion on db-proxy and a drop in RPS for payments.
T+5m: log_pattern_scan surfaces repeated “DBConnPoolExhausted” messages and an increase in upstream timeouts.
T+7m: Profiler emits hypothesis naming db-proxy connection pool as the primary mechanism with a falsifiable test (query to confirm max connections reached during window).
T+12m: Writer produces postmortem JSON with corrective actions (increase pool, add saturation alarm, add retry/backoff), owners, and ETAs.

Sample postmortem JSON schema

{
  "incident_id": "inc-20260127-001",
  "incident_window": {"start": "2026-01-26T14:48:00Z", "end": "2026-01-26T15:03:00Z"},
  "symptoms": ["p95 latency spike payments", "error-rate increase", "RPS drop payments"],
  "top_contributors": [
    {"service": "db-proxy", "metric": "max_connections", "value": 500},
    {"service": "payments", "metric": "error_rate", "value": 0.12}
  ],
  "hypothesis": {
    "statement": "db-proxy connection pool exhaustion caused upstream timeouts in payments.",
    "service": "db-proxy",
    "mechanism": "connection_pool_exhaustion",
    "falsifiable_test": "Query peak_connections >= configured_pool_size during incident window"
  },
  "key_facts": [
    {"query": "SELECT max_connections FROM db_metrics WHERE ts BETWEEN ...", "result_summary": "peak=500, configured=250"}
  ],
  "mitigations": [
    {"action": "Increase db-proxy pool to 600", "owner": "infra-team", "eta_days": 1},
    {"action": "Add alert on connections > 80% of pool", "owner": "observability", "eta_days": 2}
  ],
  "runbook_snippets": ["Step 1: Verify pool settings", "Step 2: Increase pool and restart db-proxy"],
  "confidence": 0.87
}

Evaluation metrics and governance

Measure outcomes, not just activity. Recommended KPIs:

Detection metrics: MTTD (mean time to detect), precision/recall or F1 for incident windows, false positive rate.
Investigation metrics: Hypothesis validation rate (percent of profiler hypotheses validated by SQL checks), time-to-hypothesis.
Documentation metrics: Time-to-postmortem, percent of postmortems accepted without human rewrite, completeness score (presence of owners, ETAs, falsifiable test).
Business impact: MTTR reduction, number of incidents with automated runbook steps executed, cost savings from fewer escalations.

Key questions and short answers

How reproducible is this approach for an engineering team?

Very reproducible: the demo runs from a single notebook, installs dependencies at runtime, writes CSVs, and loads them into DuckDB so investigators can replay and audit every step.
Can agents be trusted to avoid hallucination?

Trust is improved but not eliminated: non‑RAG grounding and tool-first reasoning significantly reduce hallucination risk because agents must cite SQL results and tool outputs rather than invent facts.
Will this scale to messy production telemetry?

The pattern scales, but production requires additional work: schema validation, synthetic canaries, monitoring for schema drift, and human-in-the-loop gates for high-risk actions.
What controls are needed before automated mitigations run?

Confidence thresholds, manual approvals for destructive actions, rollback procedures, and a staged rollout (suggestions → confirmed fixes → automated remediation) are essential.

Failure modes and practical mitigations

Schema drift: Timestamp or field name changes break SQL. Mitigation: schema validation layer and lightweight ETL that normalizes fields.
Partial instrumentation: Missing metrics for a new service. Mitigation: fall back to logs, add synthetic canaries, surface uncertainty in postmortem.
Correlated noise: Shared infra causing false correlation. Mitigation: cross-check with region/service-level aggregation and include confidence scoring.

Rollout plan and reproducibility checklist

Recommended phased rollout:

Phase 0 — Read-only: Agents suggest hypotheses and mitigations; humans validate and act.
Phase 1 — Assisted ops: Agents create tickets and populate runbooks; humans approve before execution.
Phase 2 — Conditional automation: Permit automated fixes for low-risk, high-confidence scenarios with rollback playbooks.

Reproducibility checklist (minimal)

Notebook repository (demo includes runnable notebook with dependency install).
Python packages: haystack-ai, openai, pandas, numpy, duckdb (install at runtime in the notebook).
OpenAI API key stored securely (avoid hardcoding; use environment variables or secrets manager).
Expected runtime: minutes to run the demo dataset; production requires resource planning for log ingest and DuckDB memory.
Costs: LLM calls (gpt-4o-mini) per agent step — budget and rate-limit accordingly.

Security, privacy, and compliance

External LLMs create data exposure risks. Best practices include:

Redact or tokenise sensitive data (PII, secrets) before sending to an LLM.
Use private LLM endpoints, bring-your-own-model (BYOM) in a VPC, or vendor contracts that guarantee non-retention.
Log all LLM inputs and outputs to enable audits; tie them to incident records in DuckDB or your observability store.

Next steps for teams

Run the demo notebook to see the coordinator → profiler → writer loop and the resulting postmortem JSON.
Start with a small pilot covering non-critical services and a read-only mode to measure false positive and hypothesis validity rates.
Embed the structured postmortem outputs into ticketing or runbook systems so they feed automation and governance workflows.

Agentic AI for SRE—when built with explicit state, tool-first reasoning, and non-RAG grounding—can turn noisy telemetry into auditable, falsifiable incident narratives. The pattern is a practical way to scale investigation capacity while preserving human control over critical mitigations.