Strands Agents + Exa: Clean, Semantic Web Search & Extraction for Token‑Efficient AI Agents

TL;DR
Raw web pages and search snippets are noisy for LLMs. Combining Strands Agents SDK with Exa gives AI agents semantic, de‑noised search hits and full‑page extractions that are ready for model context windows.
Two tools—exa_search and exa_get_contents—are exposed via the strands-agents-tools package. They accelerate grounded, multi‑step agent workflows while helping control token costs.
Start with Exa’s Auto mode for balanced latency; use Deep or live crawl for freshness when you need it. Instrument runs with Amazon Bedrock AgentCore + OpenTelemetry for traceable, debuggable agent behavior.

Problem + promise: Raw HTML and search snippets are noisy and optimized for humans—not for LLMs. Pairing the Strands Agents SDK with Exa supplies agents with structured, agent-ready web content so models can reason reliably, reduce hallucinations, and spend fewer context tokens.

Exa returns cleaned, semantic results and full-page extracts formatted for direct inclusion into an LLM’s context window—no custom scrapers required.

How the pieces fit

Strands Agents SDK provides a model-driven loop: the LLM decides when to call tools based on tool signatures rather than following brittle, hard‑coded scripts. Exa provides two tools that plug into that loop via the strands-agents-tools package:

exa_search — semantic search that returns categorized, ranked hits. Supports filters (news, research papers, GitHub, PDFs), date and domain constraints, and selectable search modes for latency vs. coverage.
exa_get_contents — full-page extraction that returns cleaned article or page text. Features cached results, live-crawl fallback/force, configurable timeouts, and optional subpage crawling to follow references.

To the agent, these tools look like any other instrument: call them, get structured output, feed it to the model, and let the model decide the next step. That pattern is crucial for building multi-step, traceable workflows that reach depth without exploding token budgets.

Search modes: latency vs. coverage

Choose a mode by trading speed for completeness:

Instant (~200 ms) — for UI-facing, real‑time experiences where latency is critical.
Fast (~450 ms) — useful when running many sequential queries.
Auto (~1 s) — recommended default: a balanced mix of speed and recall.
Deep (~3–6 s) — highest coverage and freshness; use when completeness matters or you need live crawled content.

Rule of thumb: start with Auto and escalate to Deep for targeted follow-ups. Deep and live crawls increase latency and API cost, so gate them with the agent’s decision logic.

Content shaping and token control

Exa gives several levers to keep context windows efficient and answers grounded:

Category and date filters — limit results to news, PDFs, research, or GitHub to reduce noise.
Highlights and maxCharacters — request only the most relevant snippets or cap returned text to avoid token bloat.
JSON schema‑guided extraction — instruct Exa to return structured fields (title, authors, abstract, conclusion) so agents receive parsed data instead of raw paragraphs.
Cached extraction with live‑crawl fallback — prefer cached content for speed and cost; use live crawl when freshness or missing content is essential.

Example impact: swapping a full HTML extract for a JSON‑schemaed 3‑field extract often reduces context tokens per document by 60–80%, depending on the source. That’s a direct saving on model inference costs and faster reasoning loops.

A concrete multi‑step pattern: deep research assistant

Practical workflows illustrate value better than theory. Here’s a six‑step pipeline you can deploy inside a Strands agent. Example research topic: “recent advances in transformer efficiency.”

Overview — exa_search (Auto): fetch high‑level summaries and category tabs (news, papers, GitHub). Agent asks: “What are the current themes?”
News scan — exa_search (Fast): get the latest coverage and citations to companies or product launches.
Academic sweep — exa_search (Deep) + category=research: identify preprints and peer‑reviewed papers.
Code discovery — exa_search with category=github: locate repositories implementing promising techniques.
Deep extraction — exa_get_contents: pull structured fields (methods, metrics, datasets) via JSON schema and follow subpages for supplemental tables or appendices.
Synthesis — model synthesizes a 1–2 page, source‑attributed brief with a ranked list of the most actionable leads (papers + repos), and a recommended next experiment.

Each step is a distinct agent decision. The model escalates from overview to deep extraction only for the most promising sources. That conserves tokens while preserving depth where it matters.

A multi-step pipeline yields grounded answers, reduces context waste, and lets agents autonomously dig where value is highest.

Micro‑story: a product manager’s day

A product manager needs a competitive brief on “company X’s recent AI features.” In the past, they spent hours stitching links and citations. Using a Strands + Exa agent, they get a 2‑page, source‑attributed brief in under 90 seconds. The agent used Auto mode for the overview, switched to Deep for two key whitepapers, and extracted only the methods and results fields via JSON schema—saving tokens and producing clear, auditable citations for the deck.

Observability and governance

Agentic workflows are non‑deterministic. Instrumentation is no longer optional—it’s essential. Amazon Bedrock AgentCore Observability uses OpenTelemetry to capture spans for both LLM calls and tool calls. Traces appear in the CloudWatch GenAI Observability Dashboard with per‑span metadata (parameters, latency, token counts).

What to instrument and watch:

Tool calls (exa_search / exa_get_contents): mode, filters, latency, result count.
Model prompts and responses: token usage, response time, and returned schema validity.
Alerts: spikes in Deep mode calls, sudden increases in token consumption, or repeated live crawls for the same URLs.

Observability lets you answer questions like “Which queries triggered live crawls?” and “Which sources account for most tokens?”—turning opaque failures into inspectable traces.

Enterprise tradeoffs and mitigation

Key risks and practical mitigations:

Freshness vs. cost/latency: Deep and live crawls increase both. Mitigate by caching, gating live crawls to high‑value queries, and using Auto for initial discovery.
Coverage gaps: Paywalled or niche sources may be missing. Plan for hybrid ingestion (internal data lakes, partner APIs) when exhaustiveness is required.
Copyright & licensing: Extracting and summarizing third‑party content can create legal exposure. Include legal in design reviews and implement content usage policies and attribution rules.
PII & privacy: Strip privacy‑sensitive fields, apply access controls, and set retention TTLs for extracted content logged to your observability stack.
Non‑determinism: Use schema outputs, seed prompts, and replayable traces to improve reproducibility for regulated workflows.

Checklist before production

Legal review for content licensing and attribution policies.
Rate‑limit and cost controls for Deep/live crawl modes.
OpenTelemetry spans for tool calls and prompts; dashboards and alerts in CloudWatch.
PII detection/stripping and retention policies for extracted text.
Load testing to understand latency and token cost under realistic traffic.

Alternatives and when to choose this combo

Options include building custom scrapers, using raw search APIs (Google/Bing/SerpAPI), or relying on browser‑based ChatGPT plugins. Each has tradeoffs:

Custom scrapers: complete control, but heavy engineering and brittle maintenance.
Raw search APIs: return HTML and human‑focused snippets; you still need parsing and deduplication layers.
Browser plugins: limited automation, often inconsistent formats, and less observability.

Choose Strands + Exa when you want a low‑engineering path to structured web evidence, with the flexibility of a model‑driven orchestration layer and built‑in observability for production agents.

Quickstart: get an experiment running (5 steps)

Install: pip install strands-agents strands-agents-tools (Python 3.10+).
Set EXA_API_KEY=your_key in the environment and configure Bedrock credentials if using Bedrock models.
Add exa_search and exa_get_contents to your Strands agent’s tools list.
Run the GitHub sample repo’s deep research assistant and observe traces in CloudWatch.
Compare results between Auto and Deep modes on 10 representative queries to measure latency and token impact.

Minimal Python example: register the tools

from strands_agents import Agent
from strands_agents_tools.exa import exa_search, exa_get_contents

agent = Agent(model="us.anthropic.claude-sonnet-4-6", max_tokens=20000)
agent.add_tool(exa_search)
agent.add_tool(exa_get_contents)

# Agent will call tools as needed in the model-driven loop
agent.run("Research the latest transformer efficiency techniques and summarize sources.")

Practical experiment to measure impact

Suggested A/B test to quantify benefits:

Define 50 research queries relevant to your domain.
Run baseline agent using raw search + custom parsing (or direct web scraping).
Run Strands + Exa agent with Auto + JSON schema extracts.
Compare: time to assemble brief, number of tokens consumed, and an accuracy score from human raters for factual correctness and citation quality.

Frequently asked questions

How do agents get current, structured web knowledge?

By calling Exa’s exa_search for semantic hits and exa_get_contents for full‑page, schemaed extractions. The strands-agents-tools package exposes both to the Strands agent loop so the LLM can orchestrate discovery and extraction.

Which search mode should I start with?

Begin with Auto (~1s latency) to balance speed and recall. Use Deep (~3–6s) or force a live crawl for high‑value sources requiring freshness.

How can I control token costs while keeping grounded answers?

Use JSON schema extraction, highlights/maxCharacters, cached extracts with live fallback, and a multi-step pattern that only deep‑extracts the most promising sources.

How do I debug non‑deterministic agent runs?

Instrument LLM and tool spans via Amazon Bedrock AgentCore (OpenTelemetry) and inspect span-level traces in the CloudWatch GenAI Observability Dashboard to trace which calls produced which outputs.

What to do next (30‑minute experiment)

Fork the sample GitHub repo for the deep research assistant.
Set EXA_API_KEY and install strands-agents and strands-agents-tools.
Run the demo with Auto mode and capture a trace in CloudWatch.
Rerun the same query with Deep + JSON schema extraction and compare latency, token usage, and the quality of the final brief.
Share results with legal and security teams before expanding to production.

Combining Strands’ model-driven orchestration with Exa’s semantic, agent-optimized search and extraction moves web knowledge from noisy input to structured evidence. For AI teams and business leaders building research, competitive intelligence, or tech‑support agents, that’s a multipliers‑level improvement: faster prototypes, lower token bills, and answers you can trace back to sources.