ToolSimulator: LLM-Powered Stateful Testing for AI Agents — Safe, Scalable, CI-Ready

ToolSimulator: Safe, Scalable Testing for AI Agents and Agentic Workflows

A customer support agent books a flight in staging, the mock returns success, and two days later production breaks because the booking never persisted. Tests passed, users suffered. That failure mode is exactly why testing AI agents that call external services needs a different approach than traditional unit tests or brittle fixtures.

ToolSimulator, built into the Strands Evals SDK (installable via pip install strands-evals), provides LLM-powered simulations of external tools so teams can test AI agents and agentic workflows at scale without hitting live APIs. It sits between static mocks and risky integration tests: realistic, stateful simulations that respect contracts and preserve privacy.

Why testing agentic systems is hard

Three practical problems make naive approaches unreliable for agents that act:

  • External dependencies slow or flake CI — live APIs introduce network and rate-limit variability.
  • Live calls cause side effects and data exposure — accidental writes or PII leakage are real risks.
  • Static mocks lack statefulness — fixtures can’t mirror multi-turn flows where earlier actions change later results.

ToolSimulator enables thorough, safe testing of agents that rely on external tools—without making live API calls.

How it works — the core idea

ToolSimulator intercepts calls your agent would make to registered tools and generates realistic responses using an LLM backend. It uses the tool’s function signature and docstring as context, accepts a human-friendly initial state seed, and can validate outputs against Pydantic schemas so simulated responses match the API contracts your agent expects.

Quick definitions for readers who want context:

  • LLM (large language model): a predictive text engine (similar in role to GPT-style models) used here to synthesize plausible API responses.
  • Pydantic: a Python library for defining data models and validating structured output. ToolSimulator can enforce an output_schema defined with Pydantic so agents see well-formed data.

Key capabilities

  • Adaptive response generation: Instead of fixed fixtures, the simulator uses context to produce variable, plausible outputs (e.g., different flight options and fares) so agents face realistic variations.
  • Stateful workflow support: A share_state_id links multiple simulated tools to the same backend so writes persist and subsequent reads reflect earlier changes (create booking → check booking).
  • Schema enforcement: Pass an output_schema (Pydantic model) to validate simulated responses and ensure agents handle expected shapes and types.
  • Parallel experiments: Multiple ToolSimulator instances can run concurrently; each keeps its own registry and state so tests run in parallel without cross-talk.
  • Telemetry integration: Strands Evals ties simulations into an evaluation pipeline with Case and Experiment objects and utilities like GoalSuccessRateEvaluator, StrandsEvalsTelemetry, and StrandsInMemorySessionMapper so runs are observable and traceable.

Practical quickstart (pseudo-code)

# install
pip install strands-evals

# register a tool and set up a simulator
from strands_evals import ToolSimulator

sim = ToolSimulator(llm_backend="bedrock")  # or local LLM
@sim.register_tool(name="search_flights", signature="(origin, dest, date)")
def search_flights(...): pass

# seed state and schema
sim.set_initial_state(share_state_id="flight_db",
                      initial_state_description="Two existing bookings for user A. Flights SEA->JFK available $180-$420.")
sim.register_output_schema("search_flights", FlightSearchSchema)  # Pydantic model

# run agent under test; calls intercepted and simulated by sim
result = run_agent_with_simulator(agent, sim)

The snippet above captures the core flow: register/decorate tools, optionally seed state, attach schemas, and run your agent with calls intercepted by the simulator.

Sample session trace

Here is a condensed, representative session showing how simulated responses help catch state issues that static mocks miss.

1) Agent -> search_flights(origin="SEA", dest="JFK", date="2026-06-30")
   Simulator -> [{flight_id: "F100", price: 250}, {flight_id: "F200", price: 380}]

2) Agent -> create_booking(flight_id="F100", user="userA")
   Simulator -> {booking_id: "B123", status: "confirmed"}

3) Agent -> get_booking_status(booking_id="B123")
   Simulator -> {booking_id: "B123", status: "confirmed", seats: 1}

With share_state_id linking the tools, the booking created at step 2 is visible at step 3. Static mocks that return canned responses could miss bugs like failing to persist IDs or mis-ordered side effects.

Deployment patterns and CI/CD for agent testing

ToolSimulator supports different backends and deployment models depending on constraints:

  • Local development: Run a local LLM or lightweight backend for fast iteration. No cloud account required.
  • Managed LLMs: Use Amazon Bedrock, OpenAI, or other managed services for higher throughput in CI. Be mindful of cost and telemetry retention.
  • Serverless wrapper: Package the simulator or light orchestration in AWS Lambda for ephemeral CI jobs or isolated staging environments.
  • Parallel experiments: Instantiate multiple simulators in CI to run cases concurrently; each keeps its own state store via unique share_state_id values.

CI integration checklist:

  • Start with default simulator behavior to validate high-level flows.
  • Seed richer initial states for stateful tests where necessary.
  • Use output_schema to assert response contracts in assertions instead of fragile text matching.
  • Log seeds or randomness parameters in telemetry for reproducibility.
  • Keep a small set of targeted live integration tests for critical paths (payments, compliance flows).

Choosing an LLM backend and controlling nondeterminism

Backends fall into three buckets with trade-offs:

  • Local open models (Llama-family forks, Mistral): lower cost, better data control, may require beefy hardware.
  • Managed APIs (Amazon Bedrock, OpenAI): higher throughput, managed availability, simpler scaling, but cost and data-sharing considerations.
  • Deterministic / seeded backends: some backends support deterministic modes or seed controls. Use them for CI-critical tests that need repeatable outputs.

Best practices for reproducibility:

  • Design tests that assert state invariants and schema conformance rather than exact text matches.
  • Record random seeds and model parameters in telemetry so failing runs can be replayed.
  • Use deterministic model settings when you need byte-for-byte reproducibility.

Cost, fidelity, and privacy trade-offs

LLM-driven simulation reduces the operational risk of live calls, but it introduces compute and cost considerations. Costs depend on tokens per simulated call, concurrency, and how often simulations run in CI. Consider:

  • Run the bulk of regression tests with lower-cost or local models, and reserve managed LLM runs for periodic higher-fidelity sweeps.
  • Seed state with statistical summaries (for example, DataFrame.describe()) rather than raw PII to preserve privacy.
  • For regulated environments (HIPAA, GDPR), avoid sending raw sensitive data to third-party LLMs—use local models or sanitize and summarize before seeding.

Observability and evaluation

Strands Evals integrates ToolSimulator into an evaluation pipeline so simulated runs become first-class experiments. Useful primitives include:

  • Case / Experiment: Group and parameterize test cases to sweep scenarios.
  • GoalSuccessRateEvaluator: Measure agent-level outcomes (did the agent complete the user goal?).
  • StrandsEvalsTelemetry & StrandsInMemorySessionMapper: Convert raw traces into session trajectories so you can inspect step-by-step state transitions and diagnose why a goal failed.

Example telemetry observation: a trace shows create_booking returned a confirmation ID, but a subsequent get_booking_status returned a different schema due to a simulated schema drift. With session mapping you can locate the step and update your output_schema or agent parsing logic accordingly.

Limitations and recommended hybrid strategy

ToolSimulator is powerful, but it isn’t a silver bullet. Known limitations include:

  • LLM-generated simulations may not perfectly reproduce complex, non-deterministic production behaviors or edge-case error codes.
  • Cost and latency of running LLMs in CI can add up if you simulate every test case at high concurrency.
  • Sending summarized state to hosted LLMs still carries privacy and compliance responsibilities.

Recommended hybrid testing strategy:

  • Use ToolSimulator for broad, repeatable coverage and to validate multi-turn state transitions and failure handling.
  • Run a small number of targeted live integration tests for critical endpoints (payments, external identity checks, etc.).
  • Use telemetry to compare simulated behavior against live runs periodically and update simulation seeds or schemas when you detect divergence.

FAQ

  • How much does LLM-based simulation cost?

    Costs depend on model choice, tokens per call, and concurrency. Use local or lower-cost models for most CI runs and reserve managed APIs for periodic, high-fidelity validation. Track token usage in telemetry to budget accurately.

  • Can I use ToolSimulator without an AWS account?

    Yes. You can run locally with compatible LLMs. Managed services like Amazon Bedrock are optional for scale and throughput.

  • How do I avoid leaking sensitive data to LLM backends?

    Sanitize and summarize data before seeding (e.g., use DataFrame.describe()), use local models for sensitive workflows, or redact PII before passing any context to hosted LLMs.

  • Are simulated responses validated?

    Yes. Provide a Pydantic output_schema to strictly validate responses. That prevents malformed simulated data from masking parsing and contract bugs.

  • What about nondeterministic failures?

    Log seeds and model parameters so you can replay runs. Prefer deterministic backends or seedable random states in CI-critical tests.

Key takeaways & common questions

  • What problem does ToolSimulator solve?

    It enables testing of tool-calling AI agents at scale by simulating external tools with adaptive, stateful, and schema-validated LLM responses—without making live API calls.

  • How does it model stateful workflows?

    By using share_state_id to tie tools to the same simulated backend and initial_state_description to seed human-friendly initial state so writes persist and reads reflect prior actions.

  • Can simulations enforce strict response formats?

    Yes—ToolSimulator accepts an output_schema (Pydantic model) to validate simulated outputs, ensuring responses match the API contracts your agent expects.

  • Is it expensive or cloud-dependent?

    Costs depend on the chosen LLM backend. You can run locally without an AWS account, or use managed services like Amazon Bedrock for scale—many teams use a hybrid approach to balance fidelity and cost.

  • How do I keep tests reproducible given LLM nondeterminism?

    Use deterministic model modes or seeds, record parameters in telemetry, and design assertions around state invariants and schemas rather than exact text.

ROI and next steps for teams

For product and engineering leaders, ToolSimulator reduces risky live tests, shortens feedback loops, and surfaces stateful bugs earlier—translating to fewer production incidents and faster shipping of agentic features. For engineers, it provides a practical path to test multi-turn interactions and validate parsing logic without juggling brittle fixtures.

Try a quick experiment: pick a single multi-step user flow (booking, payment, or lead creation), implement a simulator-backed test suite for it, and run it in CI. Compare flakiness, mean time to detect regressions, and the number of production rollbacks over the following sprints. Many teams see immediate signal improvements and reduced manual test overhead.

Want to get started? Install strands-evals, seed a small initial_state_description, add Pydantic output schemas for critical tools, and run a handful of simulated cases in CI. Use telemetry to track divergences and reserve a compact set of live integration tests for the highest-risk paths.