Cost‑aware LLM routing with NadirClaw: route cheap when possible, escalate when needed
Executive summary: If your team is calling a high‑capability LLM for every prompt, you’re likely overpaying and adding latency. The practical pattern is local prompt classification + model switching: use a tiny local encoder to decide which requests are “simple” and send those to a cheaper model, while escalating the tricky stuff to a high‑capability model. NadirClaw is a lightweight proxy that implements this pattern, giving observable routing, cost estimates, and a safe escalation path.
The problem: expensive, slow, one‑size‑fits‑all LLM stacks
Many production systems default to “always use the biggest model” because it reduces product risk. That works—but it’s costly. For use cases like customer support, sales assistants, or internal chatbots, a large share of requests are low complexity (intent classification, simple summarization, standard responses). Sending those to a top‑tier model wastes tokens and increases latency.
Cost-aware routing addresses three KPIs at once: lower token spend, reduced median latency, and retained access to high capability when it matters.
The pattern: local classification + model switching
At its core the pattern is simple:
- Embed incoming prompts locally.
- Compare the prompt embedding to precomputed centroids for “simple” and “complex”.
- If the prompt is close to the simple centroid and confidence is high, route to a cheap model; otherwise escalate to a stronger model.
Terminology (plain language)
- Embeddings — numeric summaries of text that let systems measure semantic similarity.
- Centroids — the “average” embedding for a group of sample prompts (for example, the average of all known simple prompts).
- Cosine similarity — a score that measures how close two embeddings are (higher = more similar).
- Routing confidence — a margin or difference between similarity scores; if it’s below a configured threshold the request is escalated.
We use centroid similarity to show, visually and numerically, why a request goes to the cheap model or the expensive one—this makes routing interpretable and auditable for product and compliance teams.
“Use a tiny local classifier to decide whether a prompt is simple or complex before calling a heavier model.”
How NadirClaw implements the pattern (demo highlights)
NadirClaw runs as an OpenAI‑compatible local proxy (default port 8856) and ships a CLI for local classification and a live proxy mode. The demo uses a small encoder (sentence-transformers/all-MiniLM-L6-v2) to compute embeddings and precompute two centroids—simple and complex. Cosine similarity determines which centroid a prompt is closest to; a routing confidence threshold controls when to escalate.
Key demo features:
- Local classification CLI (no provider API key required for classification).
- Decision‑boundary visualization showing similarity to simple vs. complex centroids.
- OpenAI‑compatible proxy that forwards escalated requests to chosen models (example: a “cheap/safe” model and a “high‑capability” model).
- JSONL request logs and a built‑in report command for auditing routing decisions and estimating cost.
Example model pair used in the demo (abstracted for executives): “cheap/safe” (flash) and “high‑capability” (pro). The specific demo used Gemini models—gemini‑2.5‑flash and gemini‑2.5‑pro—but the proxy is provider‑agnostic.
“Centroid-based cosine similarity is used to explain and visualize why a prompt routes to simple vs. complex.”
How routing confidence works (and the 0.06 example)
Routing confidence is typically computed as a margin between similarity-to-simple and similarity-to-complex. A configured threshold (the demo uses 0.06 by default) is the minimum margin required to keep the request on the cheap model. If the margin is below the threshold, the proxy escalates to the stronger model.
Practical frame: raising the threshold forces more escalations (stricter acceptance for cheap routing). Lowering the threshold makes the proxy more permissive about using the cheap model. Tune this knob to trade cost savings for safety.
Real numbers: demo pricing and an illustrative calculation
The demo assumes these prices per million tokens:
- Flash (cheap/safe): input $0.30, output $2.50
- Pro (high‑capability): input $1.25, output $10.00
Example workload (illustrative): 5,000 requests/day, average input 30 tokens, average output 120 tokens (150 tokens/request). That’s ~22.5M tokens/month (4.5M input, 18M output).
Costs if every request used the Pro model:
- Pro input cost: 4.5M tokens → 4.5 * $1.25 = $5.63
- Pro output cost: 18M tokens → 18 * $10.00 = $180.00
- Total always‑Pro ≈ $185.63 / month
Costs if 70% of requests go to Flash and 30% to Pro:
- Flash handles 3.15M input and 12.6M output → flash cost ≈ $32.45
- Pro handles 1.35M input and 5.4M output → pro cost ≈ $55.69
- Total mixed routing ≈ $88.13 / month
That configuration yields ≈ $97.50 monthly savings (≈ 52% cost reduction) versus always‑Pro. These are illustrative numbers based on the demo pricing; your mileage will vary with request volume, token lengths, and routed percentages.
When this pattern makes sense — and when it doesn’t
- Good fit: Chat assistants, customer support triage, AI for sales scripts, code scaffolding, and workflows with many low‑complexity prompts and a smaller set of high‑complexity tasks.
- Not a fit: Life‑critical outputs, legal or compliance documents that require guaranteed high quality, or any scenario where misclassification carries unacceptable risk. In those cases, always use the high‑capability model.
Operational tradeoffs and mitigations
Centroid routing is cheap and interpretable, but it has failure modes you must guard against:
- Centroid drift: As product usage changes, centroids become less representative. Mitigation: monitor centroid distance distributions and retrain when average margin drops or misroute rates exceed a threshold.
- Adversarial/ambiguous prompts: Conservative defaults and modifier detection (agentic, chain‑of‑thought, vision/tool requests) should force escalation when special capabilities are needed.
- Proxy availability and security: The local proxy adds operational surface area. Plan for secrets management, TLS, key rotation, log encryption, and health checks.
- Misroutes and UX impact: Track false negative rate (complex queries sent to cheap model). Set SLOs (e.g., <1% critical misroutes) and roll back threshold changes via feature flags or traffic sampling if the SLO worsens.
Suggested monitoring and metrics
- Percent of requests routed to cheap model
- Median and P95 latency per routed model
- Cost per resolved request (USD)
- Misroute rate (human review or automated quality checks)
- Centroid margin distribution (histogram over time)
Logging fields to capture (JSONL)
- timestamp
- request_id
- embedded_similarity_simple
- embedded_similarity_complex
- confidence_margin
- routed_model
- input_tokens, output_tokens
- latency_ms
- outcome_label (if available from human QA)
“A low-confidence threshold can be tuned to force uncertain prompts up to the stronger model.”
Rollout & evaluation plan (A/B test design)
Start conservative and measure both cost and quality.
- Phase 0 — Shadow mode: run the classifier locally and log decisions without routing live traffic.
- Phase 1 — Partial rollout: route a small fraction (5–10%) of traffic through NadirClaw while the rest uses always‑Pro. Measure cost, latency, and misroute rate for 2–4 weeks.
- Phase 2 — Ramp: if misroute and quality metrics are acceptable, increase routing to 30–70% with canary checks and rollback triggers.
- Phase 3 — Full rollout with ongoing monitoring and automated retraining triggers.
Retraining & governance
Define retraining triggers and governance upfront:
- Retrain centroids when average confidence margin drops X% or misroute rate exceeds Y%.
- Keep a labeled validation set for periodic accuracy checks (precision/recall for the “complex” class).
- Set a retention policy for JSONL logs and ensure compliance with data handling rules; consider hashing or removing PII before embedding if policy requires it.
Quick implementation checklist for execs
- Decide routing policy: who, which product flows, SLA thresholds.
- Estimate cost model using your traffic profile and demo pricing assumptions.
- Run local CLI classification on a sample of traffic (no API key required) to estimate routability.
- Deploy proxy in canary, capture logs, and run the NadirClaw report to validate savings and accuracy.
- Define retraining cadence, security requirements, and an incident playbook for misroutes.
Common pitfalls & mitigations
- Pitfall: Centroids built from non‑representative samples. Mitigation: sample traffic stratified by product area and label a validation set.
- Pitfall: Modifier markers missed (e.g., tool/vision requests). Mitigation: conservative default to escalate when modifiers are detected or confidence is low.
- Pitfall: Over‑aggressive threshold tuning that harms UX. Mitigation: A/B test thresholds and monitor CSAT or downstream error rates.
When to choose always‑Pro
There are legitimate cases to skip routing: regulatory/legal outputs, high‑stakes decisioning, and contexts where degraded responses harm customers or compliance. Use routing where cost/latency matters and the cost of a misclassification is manageable.
Suggested SEO title and meta
Title: Cost‑aware LLM Routing with NadirClaw: Reduce Token Spend for AI Agents
Meta: Use a tiny local classifier and centroid embeddings to route requests, cut token costs, and keep latency low. Practical NadirClaw demo + checklist for AI for business.
Next steps — 5‑step quickstart
- Install nadirclaw and sentence‑transformers (Python packages) and fetch the demo repo from GitHub.
- Build representative simple/complex prompt sets and compute centroids with all‑MiniLM‑L6‑v2.
- Run the local classifier CLI to visualize similarity scores and decision boundaries.
- Start the OpenAI‑compatible proxy on port 8856 and point a small sample of traffic at it.
- Run mixed workloads, produce the JSONL logs, and use NadirClaw’s report command to estimate savings and validate routing.
Key takeaways
- Use local classification to avoid unnecessary LLM calls.
A tiny encoder + centroid comparisons let you keep trivial requests off the expensive model. - Centroid-based cosine similarity is interpretable.
Visualize and audit routing decisions so product, finance and compliance teams can align. - Thresholds control the safety/cost tradeoff.
Configure routing confidence to tune how often traffic escalates to the high‑capability model. - Operational work remains.
Plan for monitoring, retraining, secrets management and rollback mechanisms before full rollout.
“Running the proxy locally lets you observe model selection, latency, token usage and estimate cost savings against an always‑Pro baseline.”
Try the demo on a laptop, run a shadow pass on your traffic, and quantify the cost/quality tradeoffs. For product teams building AI agents or ChatGPT‑style assistants, cost‑aware routing is one of the most practical levers to control spend while keeping capability where it matters.