Evaluate Multi-Turn AI Agents with ActorSimulator: Persona-Driven Testing for Better Goal Success

How to evaluate multi-turn AI agents with ActorSimulator

TL;DR: Single-turn tests miss the dynamics of real conversations. ActorSimulator (part of the Strands Evaluations SDK) creates persona-driven, goal-tracking simulated users that maintain conversation history, emit structured reasoning, and integrate with telemetry so you can measure session-level metrics like goal success and helpfulness at scale. Start with a few high-value scenarios, mix auto-generated and custom personas, instrument every turn, and treat simulation outputs as a reliable signal in your CI/CD pipeline.

Why multi-turn evaluation matters for AI agents

Customers rarely stop after one reply. Real conversations are journeys — they branch, backtrack, ask follow-ups, and sometimes end in frustration. A single-turn snapshot is a photograph; multi-turn evaluation is the full movie. If your agent must resolve bookings, troubleshoot issues, or guide sales conversations, you need tests that exercise how the agent responds across turns.

“Production conversations rarely end after a single exchange.”

Where most teams fall short

Static datasets: Predefined input/output pairs are cheap but brittle. They can’t capture the choices a user makes after an unexpected assistant reply.
Scripted flows: Better than manual testing for repeatability, but brittle when the assistant behaves differently — they assume a fixed path.
Ad-hoc LLM prompts: Asking a model to “act like a user” often drifts between runs and produces inconsistent behavior that undermines CI trust.

How ActorSimulator simulates realistic users — the three pillars

Persona consistency — Each actor has a profile (communication style, expertise level, patience, context, and a concrete goal). Profiles can be auto-generated or injected to target specific segments (e.g., an impatient expert).
Goal tracking — The simulator tracks whether the actor’s goal has been met, gives up, or hits a max_turns threshold. This produces clear stop signals for evaluators and enables session-level metrics.
Adaptive behavior + structured reasoning — Actors generate context-aware replies that react to the assistant’s prior responses. Each reply includes structured reasoning explaining why the actor responded that way, which makes failure triage far faster.

Put together, these capabilities let you run repeatable, realistic conversations and capture why an actor continued, changed course, or dropped out.

Simple multi-turn failure example

Scenario: Travel assistant helping a user book a flight.

User: “Find me a flight from Paris to Tokyo on May 12, economy.”
Assistant (incorrectly): “Here are nonstop flights on May 12 with airline X.”
User (follow-up): “I prefer flights arriving before 10pm. Can you filter for that?”
Assistant (ignores constraint, suggests later flights) — session stalls or the user abandons.

Single-turn tests would only validate the assistant’s first reply. ActorSimulator reproduces the follow-up and records structured reasoning from the actor (e.g., “I asked for arrival before 10pm because I have an evening meeting”), which makes it obvious whether the assistant ignored a critical constraint or misused a tool.

Integrating with observability: spans, session mapping, and evaluators

ActorSimulator emits per-turn traces that integrate with OpenTelemetry (spans for model calls, tool invocations, and timings). You can map those spans into coherent sessions using a session mapper (for example, StrandsInMemorySessionMapper) so downstream evaluators consume session-shaped data rather than isolated traces.

These session traces feed evaluators like HelpfulnessEvaluator and GoalSuccessRateEvaluator to produce business-facing metrics such as:

Goal Success Rate (percentage of sessions that completed the actor_goal)
Turn-to-resolution (median number of turns to success)
Abandonment Rate (sessions where actor gave up)
Average tool-call latency and tool-failure counts

Why structured reasoning matters

When a simulated user explains why they asked a follow-up, evaluators and engineers can classify failures quickly: was there a misunderstanding, missing data, invalid tool call, or an ambiguous assistant response? That reduces time-to-fix and improves the signal-to-noise ratio for CI alerts.

“Each simulated response includes structured reasoning so evaluators can inspect why an actor followed up, expressed confusion, or redirected.”

Quickstart (minimal)

Install and start a simple run. Example steps:

Install — pip install strands-agents-evals
Seed a case — provide an input prompt and optional task description (e.g., “Book a round-trip from London to Tokyo arriving before 10pm”).
Run ActorSimulator — launch simulated sessions, capture spans to OpenTelemetry, and map sessions for evaluators.

Minimal pseudocode (conceptual):

from strands_evals import ActorSimulator, StrandsInMemorySessionMapper, HelpfulnessEvaluator

sim = ActorSimulator(model="gpt-4-like")
case = {"input": "Book a flight Paris → Tokyo on May 12, economy", "task": "arrive before 10pm"}
session_traces = sim.run(case, num_sessions=50, max_turns=8)

mapper = StrandsInMemorySessionMapper()
sessions = mapper.map(session_traces)

evaluator = HelpfulnessEvaluator()
results = evaluator.evaluate(sessions)
print(results)

Sample ActorProfile (pseudocode)

{
  "name": "ImpatientExpert",
  "communication_style": "concise, direct",
  "expertise_level": "high",
  "patience_level": "low",
  "context": "traveler with strict meeting schedule",
  "actor_goal": "book a flight arriving before 10pm on May 12, minimal layovers"
}

Practical guidelines & turn limits

Pick turn limits that match the workflow: 3–5 turns for focused tasks; 8–10 for multi-step workflows. Raise the limit if many sessions stop at the ceiling without success.
Mix breadth and depth: Use auto-generated profiles for coverage and injected ActorProfile objects to stress known problem personas (novice, impatient, technical, non-native speaker).
Write clear task descriptions: Specific, measurable goals produce clearer stop signals and metrics (e.g., “book an economy ticket arriving < 10pm on May 12”).
Instrument every turn: Capture model calls, tool invocations, and latencies so you can correlate goal failures with system behavior.
Start small and iterate: Begin with a handful of high-value scenarios (customer support, booking, sales enablement) before expanding to edge cases.

Metrics, dashboards, and CI integration

Turn telemetry into actionable alerts:

Set baseline thresholds for Goal Success Rate and Turn-to-resolution (e.g., a 5% drop in goal success after a model update triggers an investigation).
Track pattern-level signals over time rather than reacting to single-session failures.
Use dashboards that connect session traces to human-readable transcripts and structured reasoning so product owners can triage quickly.

Example CI rule: fail a rollout if Avg. Goal Success Rate across core scenarios drops by more than X% or Abandonment Rate increases by Y%.

Validating fidelity and mitigating bias

LLM-based simulators can introduce biases and drift. Validate simulation fidelity against production samples:

Compare distributional KPIs (average turns, abandonment, common intents) between simulated sessions and a sampled set of real conversations.
Monitor for prompt skews and random-seed drift; add deterministic seeding for repeatability where possible.
Include human-in-the-loop spot checks for safety-critical or regulated domains.

Simulation is a powerful signal, not a replacement for production monitoring or compliance audits.

Mini experiment: find regressions after a model update

Design:

Select 10 high-value multi-turn scenarios (e.g., flight bookings, refund requests, sales qualification).
Run 200 simulated sessions per scenario using a mix of auto-generated and custom profiles before the model change (baseline).
Run the same simulations after the model update.
Compare Goal Success Rate, Turn-to-resolution, and Abandonment Rate. Inspect structured reasoning for sessions that regressed most.

Outcome you can expect: actionable RCA within hours. Structured reasoning points your engineers at whether regressions come from tool misuse, ambiguous responses, or new hallucination patterns.

Quick integration checklist for teams

Pick 5–10 core scenarios that reflect business impact.
Create or auto-generate a coverage mix of actor profiles.
Instrument model calls and tools with OpenTelemetry spans.
Map spans to sessions (StrandsInMemorySessionMapper or equivalent).
Evaluate sessions with Helpfulness, Goal Success Rate, and other evaluators.
Set CI alerts on pattern-level regressions, not single flukes.

What ActorSimulator won’t solve for you

Perfect fidelity to every nuance of real users — simulators are approximations and must be validated against production samples.
Regulatory sign-off for safety-critical decisions — human audits remain essential where lives or compliance are at stake.
Cost-free coverage — more scenarios and higher session counts improve signal but increase compute and evaluation costs.

Business value — what leaders should care about

Reduce manual testing overhead by catching multi-turn regressions automatically.
Expose user-facing failures earlier in the delivery pipeline, reducing customer-impacting incidents.
Improve developer efficiency: structured reasoning accelerates root-cause analysis.
Measure product-level health (goal success, abandonment) in business terms that stakeholders understand.

Next steps and resources

To experiment quickly: pip install strands-agents-evals, clone the Strands Agents samples and the Amazon Bedrock AgentCore samples to see ActorSimulator wired into production-like flows. Start with a small suite of scenarios, instrument every turn, and expand coverage where you see patterns of failure.

Authors and contributors: Ishan Singh, Jonathan Buck, Vinayak Arannil, Abhishek Kumar (AWS).