Polar: Train LLM Agents on Your Production Harness Without Rewriting – Faster RL & SFT

Polar: Train LLM-based AI Agents Without Rewriting Your Production Harness

TL;DR: Polar is a model-call proxy that lets teams run reinforcement learning (RL) and supervised fine-tuning (SFT) on real production harnesses (SDKs, CLIs, tool orchestration) without reimplementing them. Point your harness at Polar’s gateway, capture token-level signals, and optionally stitch multi-turn traces with prefix_merging to get big performance and efficiency gains—especially when the base model hasn’t seen your runtime protocol before.

The problem: rewiring a production harness kills velocity

Your production “harness” is the glue: vendor SDKs, custom CLIs, or orchestration code that calls an LLM and passes outputs into tools. Traditionally, training agents with reinforcement learning required reimplementing that harness inside a research environment (env.step/reset), which is slow, error-prone, and often loses execution details that matter at runtime.

Polar flips the script: instead of ripping out the harness, insert a proxy at the model API boundary. That single change—pointing your model base URL to Polar—preserves the harness exactly while collecting the token-level signals researchers need for RL and SFT.

What Polar does (plain English)

Acts as a model API proxy: your harness continues to call the same model API shape, but the requests now go through Polar’s gateway.
Normalizes provider formats: Polar detects provider APIs (OpenAI Chat/Responses, Anthropic, Google generateContent, etc.) and converts them into a common internal shape for consistent capture.
Captures token-level metadata: prompt and response tokens, sampled token IDs, log probabilities (logprobs), and finish reasons—data used for loss computation and credit assignment.
Returns responses in the original provider schema so the harness sees no behavioral change.
Supports streaming: non-streaming upstream responses can be synthesized into provider-shaped streams so streaming harnesses remain compatible.
Orchestrates experiments: a rollout server schedules runs and gateway nodes handle session lifecycles, trajectory building, and evaluation.

Polar lets you run agent training by intercepting calls at the model border—no harness rewrite required.

How Polar reconstructs training data

Captured calls must be converted into trajectories for training. Polar provides two reconstruction modes:

per_request (simple): treat each model call as an independent trace. Lossless per call, but multi-turn sessions are fragmented into many small traces—useful when calls are atomic.
prefix_merging (efficient): merge ordered completions into longer trajectories using strict token-prefix checks. Only sampled assistant tokens are included in the trainable loss; canonical or interstitial tokens are masked. This creates longer, token-faithful traces aligned with runtime behavior.

Prefix_merging is the efficiency lever. In an ablation with the same model and hardware (three training steps), prefix_merging reduced trainer updates from 1,185 to 218, cut wall-clock time from 189.5 minutes to 35.2 minutes (≈5.4× speedup), and raised average rollout GPU utilization from 20.4% to 87.7%.

Evidence: impact on real benchmarks

Using Guided Reinforcement Policy Optimization (GRPO)—guided policy optimization for language agents—and the SWE-Bench / SkyRL harnesses, Polar produced notable performance gains when models had to learn harness-specific execution semantics:

Codex harness: baseline 3.8% → Polar RL 26.4% (+22.6 points)
Claude Code: 29.8% → 34.6% (+4.8 points)
Qwen Code: 34.6% → 35.2% (+0.6 points)
Pi: 34.2% → 40.4% (+6.2 points)

Polar also handled large-scale offline SFT generation. Example: Qwen3.5-122B-A10B on 8×H100 attempted 1,638 sessions, accepted 504 trajectories (30.8% acceptance) for ≈64 GPU-hours. Accepted trajectories averaged ~104 messages and 51 assistant turns—rich dialogs ready for supervised fine-tuning.

Why this matters for business

Training with token-faithful trajectories that mirror your production harness reduces the train-to-deploy gap. That matters when agents interact with vendor SDKs, CI/CD code-generation tools, or company-specific CLIs and toolchains. The most pronounced uplifts come when the base model hasn’t seen your harness protocol before—retraining with Polar teaches the model to behave under your exact runtime semantics.

Concrete business gains

Lower integration cost: swap a model URL instead of re-implementing a harness.
Faster experiments: prefix_merging reduces trainer churn and wall-clock time, lowering compute waste.
Higher fidelity data: token-level captures and streaming preservation create SFT corpora that reflect production behaviors and edge cases.
Risk-localized: reward/evaluator design remains a human responsibility—Polar centralizes integration friction so teams can focus on reward engineering and governance.

Where Polar fits in an AI automation stack

Polar is especially attractive when:

Your agent depends on third-party SDKs or custom CLIs that are expensive to reimplement.
You need high-fidelity SFT data from real interactions (long contexts, tool calls, streaming).
You want to experiment with RL for language agents but can’t disrupt production behavior.

How to run a 4–6 week Polar pilot

Prep (week 1): confirm your harness accepts a configurable model base URL and verify your serving stack exposes token IDs/logprobs or can return them via the provider. Plan security and retention policies.
Routing & infra (week 1–2): deploy Polar gateway nodes near your harness, configure the rollout server, and set up secure networking and logging controls.
Minimal evaluator (week 2): implement a simple completion-based reward (task success / correctness) for early signal; avoid complex weighted rewards until you see initial behaviors.
Data collection (week 3): run an online GRPO experiment or generate offline SFT trajectories to gather seed data—monitor for reward-hacking and distribution shift.
Iteration & SFT (week 4–6): refine evaluators, run prefix_merging for efficiency, perform SFT or RL updates, and validate on your production harness.

Minimal config change example: point your harness to Polar’s gateway instead of the provider endpoint. For example, update MODEL_BASE_URL from https://api.provider.com to https://polar-gateway.local.

Adoption checklist: engineering, security, governance

Engineering: harness must allow configurable model base URL; ensure proxy latency is acceptable for your SLAs; plan gateway scaling and failover.
Data capture: verify provider support for token IDs and logprobs; if unavailable, expect partial fidelity and adapt reward/evaluator design.
Security & privacy: encrypt traffic, apply strict RBAC to captured trajectories, redact or anonymize PII if needed, and set retention and audit policies.
Compliance: confirm provider terms allow proxying and storing token-level metadata; assess GDPR/CCPA concerns for user-generated content.
Evaluation & rewards: build robust evaluators and monitor for reward-hacking; use session normalization and PRM-style credit assignment where appropriate.

Quick cost & ROI sketch

Example from the Polar demo: ~64 GPU-hours produced 504 accepted trajectories for offline SFT. Cloud GPU pricing varies widely; at $5–$50 per GPU-hour that equates to roughly $320–$3,200 in raw GPU cost. Add engineering time and storage. The ROI comes from faster model improvements, lower engineering rework, and higher deployment fidelity—especially for high-impact applications like code generation in CI or revenue-impacting sales assistants.

Risks, mitigations, and when not to use Polar

Risk: Token-level capture may violate provider terms or data governance. Mitigation: legal review, selective redaction, consent and retention policies.
Risk: Harnesses that cannot change model base URL. Mitigation: work with harness owners or proxy at a network edge, but this adds complexity.
Risk: Providers that don’t expose token IDs/logprobs or that use encrypted channels. Mitigation: partial fidelity for training, rely on other signals, or run SFT generation on models you host.
When not to use Polar: trivial harnesses that you can reimplement quickly, environments that cannot expose required metadata, or cases where legal/regulatory constraints forbid capturing conversational content.

Practical gotchas

per_request mode can encourage reward-hacking because it fragments sessions—use prefix_merging once you have deployment-grade evaluators.
Streaming compatibility is preserved by synthesizing streams, but latency-sensitive production paths should be load-tested.
Store provenance: record model versions, gateway config, evaluator code, and timestamps for reproducibility and audits.

Experiment details & quick numbers

Prefix_merging ablation (same model/hardware, 3 training steps): trainer updates 1,185 → 218; wall-clock time 189.5 min → 35.2 min; rollout GPU utilization 20.4% → 87.7%.
GRPO results on Qwen3.5-4B (SkyRL, prefix_merging): Codex +22.6 pts, Claude Code +4.8 pts, Qwen Code +0.6 pts, Pi +6.2 pts.
Offline SFT demo: Qwen3.5-122B-A10B on 8×H100 — 1,638 attempts → 504 accepted (30.8%); ~64 GPU-hours total; ~104 messages/session on accepted trajectories.

Key takeaways and frequently asked questions

What is Polar’s core advantage?

It lets you train LLM-based agents by proxying model calls at the API boundary, so you avoid rewriting harness code. That preserves runtime behavior and captures token-level data needed for RL and SFT.

How does Polar capture the training signal?

It detects provider API formats, normalizes requests, records token IDs and log probabilities, and reconstructs token-faithful trajectories using per_request or prefix_merging for training.

Which trajectory mode should I pick?

Use per_request for simple, lossless per-call capture or prefix_merging for efficient, longer multi-turn traces. Prefix_merging significantly reduces trainer updates and wall-clock time when multi-turn fidelity matters.

Will Polar improve model behavior in production?

Yes—especially when the pre-trained base model hasn’t been exposed to your harness protocol. Experiments show large uplifts (example: a 22.6-point gain on Codex) when training matches execution semantics.

What are the main operational risks?

The proxy captures sensitive token-level data and depends on provider support for token IDs/logprobs. Teams must address security, privacy, provider terms, and robust reward/evaluator design to avoid reward-hacking and distribution shift.

Where to try it and next steps

Polar is open-source and registered as a NeMo Gym environment under an Apache-2.0 license. Explore the implementation and experiment configs on GitHub: ProRL-Agent-Server. Read the technical preprint on arXiv: arXiv:2605.24220. For NVIDIA NeMo resources, see NVIDIA NeMo.

If your team needs help turning Polar into a pilot—adoption checklist, a short ROI model, or an integration plan—consider starting with a scoped four-week pilot: validate token capture, deploy a gateway, run a minimal evaluator, and iterate to SFT or GRPO as signals mature.