How a Model‑Agnostic LLM Harness Made Cheap Models Act Premium
TL;DR: Poetiq built an automated, model‑agnostic orchestration layer—or “harness”—using only API access to Gemini 3.1 Pro, then applied that harness unchanged to other LLMs. Every model tested improved on LiveCodeBench Pro (LCB Pro), sometimes dramatically. The lesson for AI teams: smarter orchestration can be a cheaper, faster route to better performance than chasing or fine‑tuning bigger models.
What is a harness, and why does model‑agnostic matter?
A harness is a lightweight inference orchestration layer that wraps an LLM with structured prompt flows, multi‑call strategies (sending several prompts and combining answers), assembly logic (stitching partial outputs), and validators (unit tests and runtime/memory checks). “Model‑agnostic” means the harness works without changing a model’s weights or internal activations—only standard API calls are used. That lets you swap back ends like plug‑and‑play AI agents.
LiveCodeBench Pro (LCB Pro) is a competitive programming benchmark focused on C++ problems with strict correctness, runtime and memory limits and withheld ground‑truth code. It’s designed to resist dataset contamination, so gains on LCB Pro are more likely to reflect real procedural reasoning and engineering than dataset memorization.
How Poetiq built and tested the Meta‑System
Poetiq’s Meta‑System automatically constructs and iteratively refines a task‑specific harness using recursive improvement. The process used Gemini 3.1 Pro as the optimization target and relied only on API calls—no fine‑tuning or internal model access. Once optimized, the harness was applied unchanged to a broad set of other models.
Key methodological notes and limits of public reporting:
- Poetiq reports the harness was generated via an automated search/optimization loop; specific compute budgets, iteration counts, or exact search algorithms were not disclosed publicly.
- Results reported are benchmark outcomes on LCB Pro; whether the harness was evaluated on private production codebases was not stated.
- For enterprises, an important next step is an ablation study: which harness components (decomposition, multi‑call, validator, assembly) drive most gains?
Poetiq’s Meta‑System automatically constructs and refines a task‑specific orchestration layer around an LLM, without tuning the model itself.
Results snapshot (select models)
| Model | Before (LCB Pro) | After (with harness) | Delta |
|---|---|---|---|
| GPT 5.5 High | 89.6% | 93.9% | +4.3% |
| Gemini 3.1 Pro (optimization target) | 78.6% | 90.9% | +12.3% |
| Gemini 3.0 Flash | 72.3% | 82.3% | +10.0% |
| Kimi K2.6 | ~50.0% | 79.9% | ≈+30.0% |
| Nemotron 3 Super 120B | (reported baseline) | (+12.8% improvement) | +12.8% |
On LCB Pro’s Hard tier the relative gains were especially large: Gemini 3.1 Pro rose from 7.7% → 58.3%, and GPT 5.5 High from 50.0% → 75.0%. Those jumps suggest the harness helps with stepwise procedural planning, validation against tight constraints, and iterative refinement—skills that matter in production code generation.
The harness was built using standard API calls only, and once optimized on Gemini 3.1 Pro, it improved every tested model when applied unchanged.
Why this matters for business
System engineering around models is now as strategically important as model selection. Practical implications:
- Cost efficiency: Smaller, cheaper models can behave like premium offerings when driven by a strong harness—lower inference and licensing costs for equivalent or better throughput.
- Vendor flexibility: A model‑agnostic harness reduces lock‑in. Swap vendors for price, latency, or compliance without re‑architecting your orchestration.
- Faster productization: Build coding assistants, automated code review, or CI integrations by investing in orchestration rather than expensive fine‑tuning or retraining cycles.
- Performance multiplier: The harness concept amplifies ROI on existing LLM investments—often with less engineering risk than full fine‑tuning pipelines.
Think of the harness as the conductor: the instruments (models) are important, but the conductor’s score and timing can make a small ensemble sound like a full orchestra.
Trade‑offs, risks and governance
- Benchmarks are proxies: LCB Pro is tougher than many tests, but success on a benchmark isn’t the same as reliability across messy, legacy codebases and deployment constraints.
- Operational overhead: Multi‑call orchestration increases latency and orchestration compute. You must measure cost‑per‑passing‑solution, not just raw accuracy.
- Brittleness and overfitting: Automatically generated strategies can exploit test harness quirks. Continuous monitoring and adversarial testing are essential.
- Legal and contractual boundaries: Using vendor APIs to optimize behavior across multiple models may raise licensing or terms‑of‑service questions—get legal counsel involved early.
- Explainability: Automatically synthesized harness logic can be harder to audit. Maintain versioning, human‑review gates, and detailed logs for regulatory use cases.
Practical 7‑step pilot checklist for engineering teams
- Choose representative tasks: Pick 5–20 CI tests, algorithmic problems, or code review patterns that reflect real workload diversity.
- Baseline with a cheap model: Measure pass‑rate, latency, error types, and cost using a small, inexpensive LLM as your control.
- Implement a minimal harness: Decomposer → N parallel/sequential model calls → validator (unit tests/constraints) → assembler.
- Iterate the harness: Run 10–50 optimization cycles (automated or guided), logging passing rates and failure modes after each change.
- Compare economics: Calculate cost‑per‑passing‑solution and total latency; run an A/B test against a premium model if available.
- Govern and harden: Add logging, human approval gates for risky outputs, version control, and rollback procedures.
- Decide and scale: Roll to production, continue tuning, or pivot to a bigger model if ROI isn’t favorable.
Hypothetical ROI example (illustrative)
Example assumptions (hypothetical): small model call = $0.002, premium model call = $0.02, harness requires 3 small calls per task on average, premium model requires 1 call.
- Small model alone: 40% pass rate → average cost per attempt = $0.002; expected cost per pass = $0.002 / 0.40 = $0.005
- Small model + harness: 80% pass rate → 3 calls = $0.006 per attempt; expected cost per pass = $0.006 / 0.80 = $0.0075
- Premium model: 90% pass rate → 1 call = $0.02 per attempt; expected cost per pass = $0.022
Under these numbers, even with extra calls, the harnessed small model can be materially cheaper per passing solution than the premium model. Swap in your real costs and pass‑rates to get an instant heuristic for investment decisions.
Decision checklist: when to build a harness vs fine‑tune vs buy bigger
- Build a harness when you need vendor flexibility, have strict runtime/validation constraints, or want faster time‑to‑value without data sets for fine‑tuning.
- Fine‑tune when you have large, high‑quality domain data, regulatory demands for bespoke performance, and the budget/governance to manage retraining.
- Buy bigger when latency is critical (single‑call required), or when the premium model delivers uniquely superior reasoning that orchestration can’t bridge.
Operationalizing safely
Plan for monitoring (per‑task pass rates), A/B experimentation, cost tracking (cost‑per‑pass), and an audit trail of harness changes. Version your harness like code, include human‑in‑the‑loop checkpoints for edge cases, and ensure validators reflect production constraints (memory, timeouts, security scans).
Key questions leaders will ask
- Can performance be improved without model fine‑tuning?
Yes. Poetiq improved multiple LLMs on LCB Pro using an API‑only, model‑agnostic harness—no fine‑tuning required.
- Does a harness optimized on one model transfer to others?
Yes. A harness optimized on Gemini 3.1 Pro reportedly improved every tested model when applied unchanged, demonstrating cross‑model transfer.
- Do gains hold on very hard algorithmic tasks?
Yes. The largest relative improvements were on LCB Pro’s Hard tier, where the harness moved some models from single‑digit to majority pass rates.
- What are the operational costs and risks?
Costs include additional API calls and latency; risks include brittle strategies and possible legal/licensing constraints. Governance and testing are essential.
- Will this work outside benchmarks?
Partially. Harnesses address orchestration gaps common in production, but real‑world validation against your codebase and CI is required.
Poetiq’s results point to a practical strategic pivot: invest in LLM orchestration and AI automation to extract more value from models you already have. For most organizations that want to improve AI for coding or engineering workflows without heavy data pipelines or retraining costs, a focused harness pilot is the fastest path to discoverable ROI.
Author: Saipien — AI strategy and implementation for leaders. Want a 6‑week pilot checklist tailored to your stack? Contact us for a one‑page plan that maps tests, cost assumptions, and success criteria.