Hermes Tool Search: Cut the MCP Tools Tax, Save Tokens and Boost AI Agent Accuracy

Hermes Agent Tool Search: cut the MCP Tools Tax, save tokens, and boost agent accuracy

TL;DR: Exposing every tool JSON schema to a model bloats the context window, raises token cost, and degrades decision-making. Hermes Agent’s Tool Search lazy-loads schemas on demand (BM25-backed search + three bridge calls), cutting tool-definition tokens by ~85% in internal tests and lifting accuracy substantially. Trade-offs: a small latency hit, ~300-token bridge overhead, and retrieval risk—best when you have many tools but use only a few per turn.

Why tool schemas become an expensive bottleneck

MCP (Model Context Protocol) connects LLMs to external tools by exposing each tool’s JSON schema so the model knows how to call it. When you expose a large catalog every turn, those schemas quickly dominate the model’s context window. Nous Research and independent engineering reports call this the MCP Tools Tax: tens of thousands of tokens per turn just for tool definitions.

Concrete scale examples:

A Hermes deployment with five MCP servers and 34 tools hit ~45,000 tokens per turn; ~22,000 of those were tool-schema overhead.
Anthropic reported worst-case tool-definition exposure as high as ~134,000 tokens before optimizations.
Research (e.g., “Tool Attention”) measures a typical MCP Tools Tax of ~15,000–60,000 tokens per turn for multi-server setups.

That bloat matters: cold or cache-miss turns can add roughly $0.07–$0.10 per generation in token cost, and large catalogs increase false positives and decision paralysis for the model.

What Tool Search is and how it works (simple)

Tool Search replaces deferred tool schemas with a small bridge interface so the model asks for only the schema it needs. Deferred tool schemas are the JSON descriptions of each tool that the model would otherwise need to see. Instead of listing them all, Hermes injects three lightweight bridge tools into the model-visible tool array:

tool_search(query, limit?) — search the per-turn catalog for likely tools
tool_describe(name) — load a compact description or the full schema for a single tool
tool_call(name, arguments) — call the live tool once loaded and validated

The per-turn catalog is rebuilt each assembly from current tool definitions (no persistent cache to avoid drift). Retrieval uses BM25 over tool names, descriptions, and parameter names, with substring fallback to avoid zero‑IDF failures.

Nous Research describes Tool Search as a progressive-disclosure layer that loads tool schemas only on demand rather than upfront.

Example user turn: before vs after Tool Search

Scenario baseline (numbers from an example Hermes deployment):

Without Tool Search: 45,000 tokens/turn total, ~22,000 tokens from tool schemas.
Tool-definition tokens reduced by ~85% under Tool Search → ~3,300 tool-definition tokens.
Bridge overhead: ~300 tokens + at least one extra round-trip for a cold tool.

Approximate arithmetic:

45,000 (original) − 22,000 (old tool schemas) + 3,300 (new schemas) + 300 (bridge) ≈ 26,600 tokens/turn
Token reduction ≈ 18,400 tokens/turn

Cost illustration (assumptions: cold turns previously added ~$0.08 per generation):

If Tool Search prevents that $0.08 hit on each cold interaction, then at 10,000 interactions/day the savings ≈ $800/day (~$24k/month).
Real savings depend on your model pricing, cache hit rates, and how many turns are cold vs cached.

Evidence: accuracy and token gains

Anthropic’s internal MCP evaluations on Claude models report large improvements when irrelevant schemas are deferred and Tool Search is used:

Claude Opus 4: accuracy rose from 49% → 74% with Tool Search.
Claude Opus 4.5: accuracy rose from 79.5% → 88.1%.
Tool-definition token usage dropped ≈85% while preserving access to the full tool library.

Those gains come from fewer false positives and cleaner decision-making: the model isn’t overwhelmed by an alphabet of tools it doesn’t need.

Trade-offs, failure modes, and practical mitigations

Tool Search introduces predictable costs and risks:

Latency: cold tool calls require an extra round-trip. For synchronous flows, measure user impact and consider prefetch heuristics.
Bridge overhead: the bridge itself adds roughly 300 tokens per call.
Retrieval failures: small or underpowered models may struggle to formulate effective search queries, yielding misses.
CPU/catalog cost: rebuilding the catalog each turn avoids drift but adds CPU work; this is usually modest compared with token savings but should be monitored.
Security and governance: metadata poisoning or misconfigured tool descriptors can mislead the agent—validate and sign tool metadata and enforce ACLs on the bridge.

Mitigations and best practices:

Use template-driven query augmentation (few-shot query examples) so models generate better search queries.
Implement prefetch heuristics: if a flow commonly needs one or two tools, preload them for the first few turns.
Hybrid retrieval: combine BM25 for names/params with embeddings for semantic matches when descriptions are verbose.
Guardrails: maintain the same authorization, logging, and hooks on the bridge so policy and auditing remain intact.

Who should enable Tool Search?

Short checklist for decision-makers:

Enable Tool Search if you have 15+ tools, multiple MCP servers, or most turns use only a handful of tools.
Delay or skip Tool Search if your flows consistently call a large fraction of your catalog every turn or you have a tiny toolset.
Consider hybrid strategies if latency-sensitive flows need near-zero round-trip penalties.

Implementation and ops checklist

Technical preconditions and hygiene

Ensure each tool has a clear name, concise description, and sensible parameter names (this metadata feeds BM25).
Enforce authenticated, signed tool metadata to prevent poisoning.
Keep access controls so the bridge only exposes authorized targets to each agent/user.

Telemetry to instrument (minimum viable set)

Tokens per turn: total and tool-schema share.
tool_search invocations per turn and hit rate (search→correct tool).
Average extra latency introduced by bridge calls.
Retrieval success rate (search→expected tool index) and fallback rate.
End-to-end task success / accuracy and false positive rate.
Cost per 1,000 interactions before/after.

A/B testing plan

Run Tool Search on 10–20% of traffic for 2–4 weeks against a control.
Compare accuracy, latency, and token cost with statistical significance.
Alert thresholds (examples): retrieval success rate < 95%; bridge latency > 200 ms; variance in predicted vs actual token savings > 10%.

Alternatives and future directions

Dense vector retrieval: embeddings can find semantically relevant tools when descriptions are verbose, but they add embedding-index overhead and maintenance.
Prefetch/hinting: pre-load likely tools for common flows to avoid cold-round trips.
Hybrid BM25+embedding: BM25 for titles/params + vectors for long descriptions gives complementary recall.
Cached partial catalogs: keep per-session caches for recently used tools to reduce repeated rebuild cost while still guarding against drift.

Appendix — sample hermes.yaml snippet and query flow

Example hermes.yaml knobs (defaults):

tool_search:
  enabled: auto        # auto / on / off
  threshold_pct: 10    # activate when deferred schemas would exceed 10% of context
  search_default_limit: 5
  max_search_limit: 20

Typical model flow for a cold tool call:

Model decides it needs a capability → calls tool_search(“create pull request”, 3).
Bridge returns a short ranked list (e.g., “github.create_pr”, “gitlab.create_mr”).
Model calls tool_describe(“github.create_pr”) to load schema and constraints.
Model prepares arguments and calls tool_call(“github.create_pr”, {…}).

Security and governance reminders

Validate tool metadata sources and sign descriptors to prevent tampering.
Correlate tool_search/tool_call events with request IDs for audit trails.
Apply per-tool authorization checks on the bridge; do not expose tools a user or agent isn’t permitted to call.
Rate-limit high-cardinality catalogs and add circuit-breakers for noisy or expensive endpoints.

Final thoughts

Tool Search is a practical, low-risk pattern that adapts classic indexing and lazy-loading to AI agents. For organizations exposing dozens or hundreds of integrations, it’s an effective lever to reduce the MCP Tools Tax, cut token spend, and improve accuracy—especially when interactions typically use only a few tools per turn. Expect a small latency trade-off and invest in telemetry and query-augmentation to capture the full benefit.

If you’d like, a decision checklist or a telemetry dashboard template can be provided to help engineers and leaders evaluate Tool Search against their workloads and run a safe experiment. Practical changes—cleaner tool metadata, prefetch heuristics for hot flows, and signed descriptors—unlock the best outcomes.