How Amazon AMET Payments Cut Test Case Creation from a Week to Hours with Multi-Agent AI

How Amazon AMET Payments Used Multi‑Agent AI to Cut Test Case Creation from a Week to Hours

TL;DR
Problem: Manual test-case creation for payments features took 3–5 days per feature and consumed about 1.0 FTE annually.
Solution: SAARAM — a human‑centric, multi‑agent system built with Strands Agents SDK and Amazon Bedrock (Claude Sonnet) that mirrors how expert testers work.
Impact: Test-case generation dropped from roughly a week to a few hours, QA preparatory load fell to ~0.2 FTE, and the system surfaced about 40% more edge cases while enforcing consistent, schema-driven outputs.

Why payments QA needed a different approach

Payment features look simple on paper—charge here, authorize there—but validating every path across five countries (UAE, Saudi Arabia, Egypt, Türkiye, South Africa) and millions of customers is complex. Amazon.ae’s AMET Payments team releases about five payment-related features per month; each needs exhaustive test coverage across locale, currency, provider, and regulatory permutations.

The old process relied on human testers writing end‑to‑end test cases. That routinely consumed 3–5 days per feature, and the manual approach missed many edge cases. Early experiments that handed a single large language model (LLM — a large neural model that generates text) a monolithic prompt produced brittle outputs and hallucinations. The team shifted tactics: instead of forcing an LLM to “be a tester,” they modeled how expert testers think and split the workflow into specialized AI agents.

“Instead of asking how AI should think about testing, ask how experienced testers think—then mirror that process.”
— Jayashree, Quality Assurance Engineer, AMET Payments

What SAARAM is and why it works

SAARAM (QA Lifecycle App) is a multi‑agent architecture that automates test-case generation for payments QA. It combines:

Strands Agents SDK for multi‑agent orchestration (coordinating many small AI agents),
Amazon Bedrock (managed LLM service) calling Claude Sonnet for reasoning tasks,
Pydantic schemas to enforce structured, machine‑checkable outputs and reduce hallucinations,
Mermaid diagrams to visualize flows and states, and
workflow orchestration patterns for parallelism, retries, and audit logging.

The approach addresses three recurring GenAI pain points: hallucinations, token limits (how much text a model can process), and brittle single‑shot prompts. Key moves were architectural (multi‑agent, modular phases), output‑level (schema enforcement), and context engineering (selective context pruning to keep only what’s needed between stages).

Immediate, measurable benefits

Time to generate test cases reduced from roughly one week to a few hours per feature.
QA preparatory staffing dropped from ~1.0 FTE to ~0.2 FTE for validation—freeing testers for exploratory and high‑risk work.
Coverage improved: the system detected about 40% more edge cases than the prior manual process.
Output consistency: agents adhered to test‑case standards and formats through schema validation.

How the multi‑agent pipeline works (step by step)

Think of agents as a QA assembly line: each station does one job and hands a validated output to the next. That makes the system easier to test, debug, and evolve.

Intake / Gateway Agent — ingests heterogeneous inputs: product docs, Figma UX files, API specs, or code repos. Normalizes and tags artifacts.
Data Extractor Agent — pulls essential facts: payment flows, event sequences, authorization paths, error codes, locale rules.
Visualizer Agent — generates flow and state diagrams (Mermaid) so humans and agents share a common mental model.
Condenser Agent — performs selective context pruning (we used the term “context condensation”): it decides which facts and examples to carry forward so downstream agents don’t hit token limits.
Test Generator Agent — produces structured test cases according to the agreed schema (Pydantic). The generator can run in parallel over permutations and failure modes.
Validator Agent — runs schema checks, basic consistency validations, and prepares human review queues for edge or ambiguous cases.
Orchestrator (Strands Agents SDK) — manages task dependencies, parallel execution, automatic retries, state persistence, and audit logs.

Agents communicate via strict message contracts: each agent expects a well‑defined input and emits a validated output. That contract discipline is what makes the pipeline production‑grade.

Example of a simple test‑case object (pseudocode showing the shape enforced by Pydantic):

{ “id”: “TC-0001”, “title”: “Card decline during 3DS redirect”, “preconditions”: [“User has card saved”, “3DS enabled”], “steps”: [{“action”:”Initiate payment”,”input”:”card X”,”expected”:”3DS redirect”}, {“action”:”3DS decline”,”input”:”auth fail”,”expected”:”Payment declined error shown”}], “expected_result”:”User sees decline message and order not placed”, “tags”:[“3ds”,”decline”,”edge-case”], “risk_level”:”high” }

Presenting outputs as validated JSON‑like objects lets downstream systems (test runners, bug trackers, CI pipelines) ingest test cases programmatically without fragile parsing.

Concrete before/after vignette

Before: A senior tester spent hours enumerating permutations for 3DS flows across two issuers and three locales, often missing low‑probability network or provider error permutations.

After: SAARAM consumed the UX spec and API guide, produced a Mermaid flow for the 3DS handshake, condensed relevant context, and generated a structured set of test cases that included provider timeouts, duplicate‑session behavior, and invalid payload edge cases. Human validators reviewed the set in minutes and prioritized a handful for manual exploratory checks.

How the team measured the 40% edge‑case uplift

Measurement followed a head‑to‑head validation approach: for a rollout window the team compared the union of edge cases found manually versus those surfaced by SAARAM. Senior QAs validated whether each suggested case was meaningful and distinct. The edge‑case uplift reflects the validated additional permutations the system suggested that the manual process had missed. For operational decisions the team also tracked rework and production incidents to correlate coverage improvements with fewer post‑release bugs.

Limitations and failure modes

Context gaps: if source artifacts lack detail, agents may miss domain assumptions—human input is still required for ambiguous specs.
Prompt/schema drift: as payments logic changes, prompts and Pydantic schemas must be versioned and reviewed or outputs degrade.
Cost/latency tradeoffs: generating exhaustive permutations can be computationally expensive; teams must balance depth versus run cost.
Human-in-the-loop discipline: validators can become rubber stamps if sampling and randomized audits aren’t enforced.

Governance, security, and observability

Operationalization required controls and telemetry:

Access control: limit agent access to sensitive repos and artifacts; use IAM roles and least privilege for Bedrock and Strands interactions.
Data handling: encrypt inputs at rest and in transit; apply data redaction for PII during intake.
Prompt/schemas versioning: store prompts, templates, and Pydantic schemas in versioned repositories; require PR reviews for changes.
Human‑in‑the‑loop policy: enforce sampling rates, randomized audits, and rotating validators to preserve review quality.
Observability: track metrics such as agent latency, success/failure rates, hallucination rate (schema‑failures), test‑case acceptance rate, and estimated FTE hours saved. Use CloudWatch and OpenTelemetry for telemetry and dashboards.

Costs and ROI considerations

Running a multi‑agent pipeline adds compute and managed LLM costs (Bedrock usage, model selections like Claude Sonnet), orchestration overhead, and engineering time to build and maintain schemas and agents. The counterbalance is reduced tester hours, fewer production incidents, and faster release cycles. Track wall‑clock time saved per feature, reductions in post‑release bugs, and QA headcount redeployment to compute a realistic ROI and identify the break‑even point for your organization.

Roadmap and portability

Next technical moves planned include integrating Amazon Bedrock Knowledge Bases to inject historical, validated test-case examples and adopting Bedrock AgentCore for lifecycle and runtime governance. The pattern generalizes: multi‑agent orchestration, schema‑driven outputs, and context engineering are applicable to support automation, compliance checks, and other QA domains. Alternate components (open LLMs, other agent frameworks) can be substituted for different cost or compliance constraints.

Key takeaways and checklist for teams planning agent‑based QA automation

Design agents to reflect human workflows:
map intake → extract → visualize → condense → generate → validate.
Enforce structured outputs:
use schemas (Pydantic or similar) so outputs are machine‑readable and auditable.
Condense context:
keep only what downstream agents need to avoid token limits and noisy prompts.
Instrument everything:
collect metrics on latency, hallucination/schem a failures, acceptance rate, and FTE hours saved.
Govern changes:
version prompts and schemas, apply access controls, and require periodic human audits.

“Breaking tasks into specialized agents that reflect human testing phases delivered far better accuracy and reliability than a monolithic AI approach.”
— Harsha Pradha G, Senior QA Engineer, AMET Payments

AI agents won’t replace expert testers—but when designed to mirror expert cognition, they amplify coverage, consistency, and speed. For payments QA and other multi‑step domains, multi‑agent architecture with strict schemas and context engineering provides a practical, auditable path from prototype to production.