Hierarchical Planner for AI Agents: Practical Template to Build Auditable Multi‑Agent AI Automation

Hierarchical Planner for AI Agents: A Practical Template for Multi‑Agent AI Automation

TL;DR: Use a three-role pattern—planner (produces a JSON plan), executor (runs model reasoning or sandboxed Python), and aggregator (synthesizes results)—to build auditable, modular AI agents. This Colab‑runnable template leverages an open‑source LLM (Qwen2.5-1.5B-Instruct) with optional 4‑bit quantization and includes utilities for JSON extraction, sandboxed code execution, and fallback planning. Try the prototype to prototype your first multi‑step workflow; productionizing it requires hardened sandboxing, observability, and test coverage.

Why a hierarchical planner matters for AI Automation

Think of a business process as a small team: one person designs the playbook, another executes each play, and a third writes the after‑action report. Translating that to software gives you a planner → executor → aggregator pipeline. This decomposition reduces the brittle “one giant prompt” approach and aligns with how enterprises already audit work: clear responsibilities, checkpoints, and outputs you can verify.

Open‑source LLMs (large language models) have matured enough that teams can run capable reasoning models locally or in private cloud VMs, avoiding black‑box APIs while enabling tighter integration with internal data and tooling. The template here shows a pragmatic way to assemble tool‑augmented AI agents that plan, act, and summarize—without pretending the model is infallible.

The pattern: planner → executor → aggregator

At a high level the system has three agents with distinct responsibilities:

Planner: Breaks a high‑level goal into a structured plan (3–8 independent steps) and chooses a tool for each step (none, llm, or python).
Executor: For steps marked “llm”, uses the model to produce a result; for steps marked “python”, returns only Python code which is executed in a sandbox. All outputs and errors are captured.
Aggregator: Combines step outputs into a final, actionable response with assumptions, next steps, and a short rationale.

Planner: split a goal into a compact JSON plan with assumptions and a short list of independent steps, and pick the tool for each step.
Executor: given a single step and context, return the step result or, if the tool is python, generate runnable Python code only.
Aggregator: synthesize the collected step outputs into a practical final answer and recommended next actions.

Minimal planner JSON example

{
  "goal": "Launch internal logistics MVP",
  "assumptions": ["3-person team", "no external APIs"],
  "steps": [
    {"id": 1, "task": "Define delivery zones", "tool": "llm"},
    {"id": 2, "task": "Simulate route optimization", "tool": "python"},
    {"id": 3, "task": "Draft MVP checklist and risks", "tool": "llm"}
  ]
}

That schema keeps plans machine‑readable and constrained. Add optional fields like timeout, retry_policy, and expected_outputs to increase safety as you move to production.

Prototype stack and key utilities

The reference implementation uses Qwen2.5-1.5B-Instruct (an open‑source instruct LLM) loaded via Hugging Face transformers and PyTorch. To reduce GPU VRAM, it attempts 4‑bit quantized loading with bitsandbytes; if quantized loading fails it falls back to standard precision or CPU. That flexibility helps teams run locally in Colab or on a modest cloud VM.

Important tooling in the notebook:

llm_chat wrapper — standardized prompting, temperature, and decoding settings so the planner behavior is predictable.
extract_json_block — a resilient parser that recovers JSON from fenced or inline text using bracket matching and retries if needed.
run_python — a sandboxed executor that runs generated Python code in a constrained environment and returns success, stdout, and exception traces.
Fallback & truncation logic — if the planner produces malformed JSON, the notebook prompts for correction and eventually uses a safe default plan; the executor truncates long contexts to avoid oversized prompts.

Demo vignette: a five‑step logistics MVP

Walkthrough (condensed): a product manager asks the agent to produce a launch checklist for a local drone delivery pilot. The planner returns a 5‑step JSON plan: define zones (llm), generate route simulation code (python), run a short simulation (python), summarize results (llm), and produce a risk register (llm). The executor runs the generated Python in a sandbox, captures stdout and exceptions, and returns structured outputs. The aggregator stitches those into a short launch plan with three prioritized next actions and explicit assumptions.

Why this matters: the planner enforces structure, the executor provides reproducible actions and logs, and the aggregator produces a single artifact you can hand to an operations team. If a step fails, the system surfaces the error and either retries or falls back to a safe plan—so execution remains auditable.

Operational checklist for moving beyond prototype

Prototype patterns are great for discovery. Production requires deliberate maturity steps. The following checklist pairs recommended actions with maturity targets.

Sandboxing & runtime isolation
Prototype — containerized Python worker with resource limits; network blocked.
Production — hardware‑enforced isolation (e.g., gVisor/Firecracker), seccomp, read‑only mounts, egress filtering, and a least‑privilege execution user.
JSON validation & schema evolution
Prototype — robust extract_json_block + retry prompts.
Production — strict schema validation, versioned plan schemas, contract tests, and automated rejection of non‑conformant plans.
Observability & governance
Prototype — basic logging of step outputs and exceptions.
Production — structured logs, distributed tracing, immutable audit logs, and rationale capture for each planner decision.
Testing & CI
Prototype — manual runs and sample inputs.
Production — unit tests for validation, E2E tests with mocked LLM responses, chaos tests for executor failures, and performance/regression benchmarks.
Security review
Prototype — informal code review.
Production — threat modeling, data exfiltration tests, third‑party code audits, and periodic red‑team exercises.
Model lifecycle & monitoring
Prototype — occasional model updates.
Production — model versioning, drift monitoring, automated rollbacks, and scheduled re‑evaluation of prompts and schemas.

Local open‑source vs cloud APIs: a practical decision guide

Both approaches are valid. Choose based on control, cost profile, and ops maturity.

Choose local/open‑source when: data residency and privacy are priorities, you need deep integration with internal systems, or you plan to run high volumes where per‑call API costs are prohibitive. Quantized models (4‑bit via bitsandbytes) can reduce VRAM and make local GPU inference affordable.
Choose cloud APIs when: you want to move fast without building inference ops, need access to the latest models without maintenance, or lack the team bandwidth to harden sandboxing and observability.

Practical tip: prototype with a Colab‑runnable open‑source stack to validate your workflows, then re‑evaluate whether to keep the model local or switch to an API for scale.

Limitations, risks, and mitigations

No architecture is magic. Key risks and suggested mitigations:

Hallucinations & malformed plans — Validate output schemas, perform sanity checks on step outputs, and implement human‑in‑the‑loop gates for high‑risk actions.
Unsafe code execution — Never run generated code on a host with sensitive credentials; enforce strict runtime isolation, set CPU/memory limits, and block network egress or make it explicit and reviewed.
Operational complexity — Invest early in observability, contract tests, and model lifecycle management to avoid unpredictable behavior as models update.
Scalability — Quantization and careful context truncation help, but production systems will need batching, concurrency control, and cost monitoring to be viable at scale.

Short implementation hints

Small, practical patterns that help immediately:

Validate planner JSON on every run and fail fast with a helpful error message.
When the executor runs Python, capture stdout, stderr, exit codes, and stack traces and attach them to the step record.
Truncate long historical context using a relevance heuristic rather than blunt token limits—keep the most recent and most semantically relevant outputs.
Log planner rationale (short bullet points) to help future debugging and audits.

Business use‑cases that map well to this pattern

Logistics planning — multi‑step route design, simulation, and risk assessment.
Automated reporting — data extraction, transformation steps executed via Python, and human‑facing narrative synthesis.
Internal dev tooling — LLM‑driven scaffold generation followed by safe code execution for unit test stubs or mock data creation.
Sales enablement — create multi‑part outreach sequences, simulate responses, and aggregate a prioritized contact plan.

FAQ

Can AI agents generate and run code safely?

Yes — but only with strict sandboxing (containers with seccomp, network controls, read‑only mounts), rigorous input validation, and monitoring. Prototype helpers show the mechanics; production requires hardened isolation and audit logs.
How do you recover from malformed planner JSON?

First validate and request a corrected output from the model; if it still fails after N retries, fall back to a conservative default plan and surface the issue to a human reviewer.
Is running Qwen2.5 with 4‑bit quantization realistic for businesses?

Yes for many teams: 4‑bit quantization (bitsandbytes) reduces VRAM requirements and can enable local GPU inference on smaller hardware. Always implement graceful fallbacks to standard precision or CPU when quantized loading fails.
Local model or cloud API — which should we pick?

Local models offer control and potential cost savings at scale but require ops and security investment. Cloud APIs are faster to adopt but trade off control and potentially cost for convenience.

Next steps

Run the Colab demo to see the planner → executor → aggregator flow in action. Use the prototype to validate a single business workflow, then apply the operational checklist to harden the system for production. If you want a ready‑made executive brief or a production architecture sketch (security, monitoring, scaling), say which one you’d prefer and it can be prepared as the next deliverable.