BankerToolBench: Why AI Agents Aren’t Client-Ready for Investment Banking

Why AI Agents Aren’t Client‑Ready for Investment Banking (Yet)

Executive summary: BankerToolBench — a new open benchmark built by Handshake AI and McGill University and evaluated by ~500 current and former bankers — tested leading AI agents on real banking deliverables: working Excel financial models, PowerPoint decks, Word memos and reports that pull live market data. The headline: no model produced client‑ready work out of the box. Top performers like GPT‑5.4 can speed drafting and provide a usable starting point in some cases, but critical issues (formula bugs, broken business logic, aborted data pulls and fabricated numbers) mean human oversight, auditable provenance and strong governance remain non‑negotiable.

What BankerToolBench measured — and why it matters

Researchers asked AI agents to do what junior bankers actually do: build functioning Excel models with formulas and scenario links, produce decks and memos that cite market data and filings, and make outputs auditable and editable. The study logged more than 5,700 human hours. One hundred tasks were created by 172 bankers from firms such as Goldman Sachs, JPMorgan and Evercore. Tasks averaged five hours for a human to complete; some required as much as 21 hours.

Nine models were tested, including GPT‑5.2 and GPT‑5.4 (OpenAI), Claude Opus variants (Anthropic), Gemini releases (Google), Grok 4, Qwen‑3.5‑397B and GLM‑5. Deliverables were judged against banker‑designed rubrics averaging roughly 150 criteria spanning technical correctness, client readiness, compliance, auditability and consistency.

“None of the outputs were acceptable to send to a client as produced by the models.”

Top‑line findings

Best performer overall: GPT‑5.4. Still, only 16% of its outputs were seen as a useful starting point; requiring consistency across three runs dropped that to 13%.
No model produced an output judged ready to submit to a client without human rework. GPT‑5.4 cleared all critically weighted criteria in only 2% of tasks.
Models did better on slides than spreadsheets. Excel financial models — especially merger models, debt capital markets work and capital‑structure tables — were the hardest problems.
Reinforcement learning (methods like GRPO and DPO) improved some open models by 5–13×, but they started from a weak baseline and still left major gaps.
An automated verifier called Gandalf (based on Gemini 3 Flash Preview) agreed with human reviewers 88.2% of the time; two humans agreed with each other 84.6% of the time.

Where agents break — the failure modes you need to know

Four recurring failure patterns explain why outputs look polished but aren’t safe for clients:

Hardcoding and formula bugs (≈41%): Cells contain fixed numbers instead of formulas. Result: scenario analysis breaks and the model isn’t auditable. Example: a revenue waterfall that shows correct totals but ignores the growth input when a stress scenario is applied.
Broken business logic (≈27%): Incorrect assumptions or mismatched timing across schedules. Example: debt amortization schedules that ignore covenant triggers or use inconsistent interest periods.
Aborted data queries (≈18%): The agent fails to fetch or reconcile market data, leaving placeholders or partial tables.
Fabrications (≈13%): Invented numbers or sources when data isn’t available. Outputs may cite “market comps” that don’t exist.

Those failure modes matter because investment‑banking work isn’t visual polish — it’s correctness under change, traceability and the ability to run alternatives quickly. A slide that looks right but hides hardcoded inputs is actively dangerous.

Why these tasks are uniquely difficult for AI agents

Two reasons make banking a harsh benchmark for AI automation:

Tooling and execution load: One task could involve up to 539 model calls; about 97% of those calls were tied to tool use or code execution rather than just prose generation. Agents must orchestrate code, data pulls and file transforms reliably.
Auditability and scenario flexibility: Deliverables must show provenance, support sensitivity testing and remain readable to regulators and auditors. That demands deterministic formula construction and traceable data lineage — not a typical strength of current chat‑style LLMs.

What the automated verifier means — and its limits

Gandalf’s high agreement with human graders is encouraging: automated grading can scale and surface obvious failures. But verifiers can inherit blind spots and create automation bias. Treat automated checks as an augmentation, not a replacement:

Use periodic blind human audits to catch verifier edge cases.
Run adversarial tests that deliberately inject malformed inputs to see how robust verification is.
Log verifier decisions with explanations so humans can inspect why something passed.

Practical implications for operations, compliance and MRM

These results change how banks should approach AI for finance. Agents are useful as supervised assistants — they can accelerate drafting and reduce rote work — but they are not yet safe to push directly to clients. Key operational actions:

Require human sign‑offs on any client delivery.
Enforce versioned audit logs that show every model call, data source and formula change.
Validate provenance for each externally sourced number before it enters a model.
Map agent outputs into existing model risk management (MRM) frameworks and treat them like black‑box models until verified.

Concrete 30/60/90 pilot plan with KPIs

Suggested pilot to move from experimentation to controlled production:

Days 0–30 — Sandbox & baseline: Run agents on internal, non‑client datasets. KPI: measure critical error rate (formula failures, fabrications). Target: <5% critical error rate for any deliverable to graduate.
Days 31–60 — Controlled use: Add a human‑in‑the‑loop gate and integrate provenance logging and data sandboxes. KPI: time saved on first draft vs time fixing; target 25–40% reduction in drafting time while maintaining critical error rate below threshold.
Days 61–90 — Expand and harden: Stress test with simulated client scenarios, run third‑party audits of logs and verifier outputs. KPI: inter‑rater agreement between automated verifier and human auditors >90% on critical checks; formal sign‑off process defined for client deliverables.

Vendor and procurement checklist for AI agents

When evaluating vendors, demand capabilities that address the study’s weaknesses:

Private data integrations with sandboxed access to confidential deal data.
Audit logs that record every model call, code execution and data source.
Formula integrity checks and a way to export human‑readable formula maps.
Provenance tagging for every externally sourced number.
Support for multi‑step agent orchestration (Excel ↔ PowerPoint ↔ database).
SLAs that cover mis‑generation, data breaches, and remediation responsibilities.
Ability to run adversarial and stress tests in a controlled environment.

Regulatory and legal roadmap — what to ask your legal team

Who signs client deliverables that include AI‑generated content? Define accountability in writing.
Record retention and eDiscovery: ensure logs and provenance are stored according to regulatory timelines and accessible for audits.
Vendor contracts must specify liability for fabricated data or compliance breaches traced to agent outputs.
Ensure data residency and confidentiality controls meet jurisdictional requirements for client and market data.

“Treat AI outputs as drafts, not deliverables: auditability, provenance and human sign‑offs are non‑negotiable.”

Quick checklist C‑suite can act on today

Run pilots on internal, non‑client data first.
Require auditable logs and versioning for every generated artifact.
Enforce mandatory human sign‑off for client deliverables.
Test for formula integrity and scenario flexibility before deployment.
Validate provenance for all externally sourced numbers.
Define KPIs and failure thresholds for automated outputs.
Engage legal and compliance early on data access and retention.
Choose vendors that provide sandboxed private data integrations.
Build a model‑risk playbook for agent‑specific failure modes.
Plan a 90‑day pilot with staged gating and rollback procedures.

Where progress is likely — and where surprises remain

Benchmarks like BankerToolBench do two things: they expose brittle behavior and provide training data to fix it. Reinforcement learning and better tool integrations already show measurable gains for some models. Vendors are also improving workflow features that stitch spreadsheets to slides and to data sources.

Still, private deal data, iterative team workflows and bespoke bank tooling could either mask or magnify current weaknesses. A productionized environment with tight data controls may improve accuracy for routine items — but it could also create new governance and liability concerns if systems start producing subtly incorrect analyses that humans trust too readily.

Bottom line

AI agents can be powerful productivity tools for bankers when used as supervised assistants. They are not yet autonomous producers of client‑ready work. The right governance—auditable provenance, human sign‑offs, formula integrity checks and a rigorous pilot program—turns agent capabilities into real business value while managing operational and regulatory risk.

If helpful, a one‑page board/CIO briefing and a 90‑day pilot checklist can be prepared to accelerate decision‑making and safe adoption.