Haystack Multi-Agent System Automates Incident Detection, Investigation, and Postmortems for SREs

How a Haystack Multi-Agent System Automates Incident Detection, Investigation, and Postmortems

TL;DR

A reproducible notebook demonstrates a Haystack-powered multi-agent system that detects incidents, investigates metrics and logs with evidence-backed SQL queries, proposes mitigations, and drafts production-grade postmortems automatically. The approach uses in-memory DuckDB for fast analytics, a rolling z‑score anomaly detector for explainable alerts, and three specialist AI agents (coordinator, profiler, writer) that are constrained to call programmatic tools instead of looking up external documents (no RAG — Retrieval-Augmented Generation). The result is structured, auditable output that teams can plug into runbooks and automation while retaining human oversight.

A simple real-world hook

At 02:14 a.m., the payments API shows a sudden p95 spike (95th percentile — a standard latency tail metric), a burst of errors, and a drop in RPS (requests per second). Who triages? How fast can the team form a testable hypothesis and push a fix? AI agents can accelerate that workflow while keeping the evidence auditable.

Why this matters for engineering leaders and SREs

Noise from metrics and logs is the daily grind for SREs. Human-led triage is slow and inconsistent. A tool-first, agentic (autonomous, multi-role AI) system can reduce time-to-detect and time-to-postmortem, codify runbooks, and produce machine-readable incident records. But to be useful in production the system must avoid hallucinations, be traceable, and integrate with existing workflows. That’s where a Haystack multi-agent pattern combined with a strict non-RAG grounding shines: AI agents are forced to call tools and run SQL against local data rather than inventing explanations.

“Prefer calling tools over guessing.”

Architecture & components

  • Data layer: Synthetic or real metrics and logs are persisted as CSV, then loaded into an in-memory DuckDB for fast SQL analysis.
  • Anomaly detector: Rolling z‑score (explainable statistical method; z‑score measures how many standard deviations a value is from the mean) used to flag contiguous anomaly spans and rank candidate incident windows.
  • Tools (callable functions): load_inputs, detect_incident_window, sql_investigate, log_pattern_scan, propose_mitigations, draft_postmortem. Each returns structured outputs consumed by agents.
  • Agents: Three LLM-driven roles orchestrated via Haystack:
    • Coordinator — runs the end-to-end flow and aggregates outputs.
    • Profiler — produces a JSON with falsifiable hypothesis, named service/mechanism, symptoms, top contributors, and key facts.
    • Writer — formats a production-grade postmortem JSON and runbook snippets with owners and ETAs (days).
  • LLM: OpenAI model (gpt-4o-mini used in the demo) via OpenAIChatGenerator.
  • State schema: inputs, incident_window, investigation_notes, hypothesis, key_facts, mitigation_plan, postmortem (structured JSON).
  • Constraint: No RAG — agents must rely on local CSV/DuckDB queries and the provided tooling. This reduces hallucination risk and improves auditability.

Step-by-step pipeline (what the system actually does)

  1. Ingest: Metrics and logs are synthesized or streamed, saved to CSV, and loaded into DuckDB.
  2. Detect: The rolling z‑score detector scans metric time-series to propose anomaly windows. Candidate windows are ranked by anomaly score.
  3. Probe: The coordinator calls sql_investigate and log_pattern_scan tools to fetch correlated metrics, top endpoints, error kinds, and region/service breakdowns.
  4. Hypothesize: The profiler agent synthesizes evidence into a falsifiable hypothesis (must name a service/mechanism and include a testable claim). Example: “Payments-service DB connection pool exhausted causing p95 spike; a specific query shows connection saturation in db-proxy between t1–t2.”
  5. Validate: Additional SQL queries corroborate or refute the hypothesis. The system stores query outputs as key_facts to preserve traceability.
  6. Mitigate: The propose_mitigations tool suggests concrete fixes mapped to owners and ETA estimates (e.g., scale pool + increase timeout — owner: infra-team, ETA: 1 day).
  7. Document: The writer agent emits a structured postmortem JSON and runbook snippets ready for tickets or automation hooks.

Example incident timeline

Injected incident at ~62% of a 24‑hour synthetic timeline produced the following flow:

  • T+0: Detector flags a contiguous anomaly window driven by p95 latency and error-rate spikes across payments and db-proxy.
  • T+2m: sql_investigate shows connection pool exhaustion on db-proxy and a drop in RPS for payments.
  • T+5m: log_pattern_scan surfaces repeated “DBConnPoolExhausted” messages and an increase in upstream timeouts.
  • T+7m: Profiler emits hypothesis naming db-proxy connection pool as the primary mechanism with a falsifiable test (query to confirm max connections reached during window).
  • T+12m: Writer produces postmortem JSON with corrective actions (increase pool, add saturation alarm, add retry/backoff), owners, and ETAs.

Sample postmortem JSON schema

{
  "incident_id": "inc-20260127-001",
  "incident_window": {"start": "2026-01-26T14:48:00Z", "end": "2026-01-26T15:03:00Z"},
  "symptoms": ["p95 latency spike payments", "error-rate increase", "RPS drop payments"],
  "top_contributors": [
    {"service": "db-proxy", "metric": "max_connections", "value": 500},
    {"service": "payments", "metric": "error_rate", "value": 0.12}
  ],
  "hypothesis": {
    "statement": "db-proxy connection pool exhaustion caused upstream timeouts in payments.",
    "service": "db-proxy",
    "mechanism": "connection_pool_exhaustion",
    "falsifiable_test": "Query peak_connections >= configured_pool_size during incident window"
  },
  "key_facts": [
    {"query": "SELECT max_connections FROM db_metrics WHERE ts BETWEEN ...", "result_summary": "peak=500, configured=250"}
  ],
  "mitigations": [
    {"action": "Increase db-proxy pool to 600", "owner": "infra-team", "eta_days": 1},
    {"action": "Add alert on connections > 80% of pool", "owner": "observability", "eta_days": 2}
  ],
  "runbook_snippets": ["Step 1: Verify pool settings", "Step 2: Increase pool and restart db-proxy"],
  "confidence": 0.87
}

Evaluation metrics and governance

Measure outcomes, not just activity. Recommended KPIs:

  • Detection metrics: MTTD (mean time to detect), precision/recall or F1 for incident windows, false positive rate.
  • Investigation metrics: Hypothesis validation rate (percent of profiler hypotheses validated by SQL checks), time-to-hypothesis.
  • Documentation metrics: Time-to-postmortem, percent of postmortems accepted without human rewrite, completeness score (presence of owners, ETAs, falsifiable test).
  • Business impact: MTTR reduction, number of incidents with automated runbook steps executed, cost savings from fewer escalations.

Key questions and short answers

  • How reproducible is this approach for an engineering team?

    Very reproducible: the demo runs from a single notebook, installs dependencies at runtime, writes CSVs, and loads them into DuckDB so investigators can replay and audit every step.

  • Can agents be trusted to avoid hallucination?

    Trust is improved but not eliminated: non‑RAG grounding and tool-first reasoning significantly reduce hallucination risk because agents must cite SQL results and tool outputs rather than invent facts.

  • Will this scale to messy production telemetry?

    The pattern scales, but production requires additional work: schema validation, synthetic canaries, monitoring for schema drift, and human-in-the-loop gates for high-risk actions.

  • What controls are needed before automated mitigations run?

    Confidence thresholds, manual approvals for destructive actions, rollback procedures, and a staged rollout (suggestions → confirmed fixes → automated remediation) are essential.

Failure modes and practical mitigations

  • Schema drift: Timestamp or field name changes break SQL. Mitigation: schema validation layer and lightweight ETL that normalizes fields.
  • Partial instrumentation: Missing metrics for a new service. Mitigation: fall back to logs, add synthetic canaries, surface uncertainty in postmortem.
  • Correlated noise: Shared infra causing false correlation. Mitigation: cross-check with region/service-level aggregation and include confidence scoring.

Rollout plan and reproducibility checklist

Recommended phased rollout:

  1. Phase 0 — Read-only: Agents suggest hypotheses and mitigations; humans validate and act.
  2. Phase 1 — Assisted ops: Agents create tickets and populate runbooks; humans approve before execution.
  3. Phase 2 — Conditional automation: Permit automated fixes for low-risk, high-confidence scenarios with rollback playbooks.

Reproducibility checklist (minimal)

  • Notebook repository (demo includes runnable notebook with dependency install).
  • Python packages: haystack-ai, openai, pandas, numpy, duckdb (install at runtime in the notebook).
  • OpenAI API key stored securely (avoid hardcoding; use environment variables or secrets manager).
  • Expected runtime: minutes to run the demo dataset; production requires resource planning for log ingest and DuckDB memory.
  • Costs: LLM calls (gpt-4o-mini) per agent step — budget and rate-limit accordingly.

Security, privacy, and compliance

External LLMs create data exposure risks. Best practices include:

  • Redact or tokenise sensitive data (PII, secrets) before sending to an LLM.
  • Use private LLM endpoints, bring-your-own-model (BYOM) in a VPC, or vendor contracts that guarantee non-retention.
  • Log all LLM inputs and outputs to enable audits; tie them to incident records in DuckDB or your observability store.

Next steps for teams

  • Run the demo notebook to see the coordinator → profiler → writer loop and the resulting postmortem JSON.
  • Start with a small pilot covering non-critical services and a read-only mode to measure false positive and hypothesis validity rates.
  • Embed the structured postmortem outputs into ticketing or runbook systems so they feed automation and governance workflows.

Agentic AI for SRE—when built with explicit state, tool-first reasoning, and non-RAG grounding—can turn noisy telemetry into auditable, falsifiable incident narratives. The pattern is a practical way to scale investigation capacity while preserving human control over critical mitigations.