OSCAR: INT2 KV-Cache Compression for Long-Context LLMs — 8× Memory Reduction, Near-BF16 Accuracy

OSCAR: Attention‑Aware INT2 Compression for Long‑Context LLM Serving

Executive summary: Long‑context LLM deployments are increasingly bottlenecked by KV cache memory and bandwidth. OSCAR (Offline Spectral Covariance‑Aware Rotation) compresses most of a model’s key–value (KV) cache to roughly 2 bits per element (INT2) while preserving near‑BF16 (bfloat16) accuracy and keeping standard paged KV‑cache serving and prefix caching intact. The practical payoff: ~8× KV memory reduction, large throughput wins at 10k–100k+ token contexts, and only small accuracy regressions on evaluated models.

Why KV cache compression matters for long‑context LLMs

As prompt length grows, the KV cache (key–value cache) stored during autoregressive decoding becomes the dominant consumer of GPU memory and PCIe/NVLink bandwidth. That makes memory per KV pair—not raw FLOPs—the cost driver for supporting ChatGPT‑style long transcripts, codebases, or multi‑hour dialogues. Traditional quantization approaches either stop at INT4+ or force custom memory layouts that break paged attention and prefix caching, creating operational friction for production services.

TL;DR — what OSCAR does and why it’s different

OSCAR computes offline, attention‑aware rotations from a short calibration pass and then combines those rotations with a Hadamard equalizer and a bit‑reversal grouping to concentrate quantization error into directions attention rarely reads. Practically, most of history is stored as INT2 while a small recent window and a BF16 sink keep the most critical tokens in high precision. That lets you reclaim KV memory at ~2.28 bits per element (~8× vs BF16) without changing client APIs or paged‑KV semantics.

“OSCAR computes rotations from attention statistics so quantization error is steered into directions the attention mechanism is unlikely to read.”

Think of the rotation like turning a framed painting: you angle the canvas so the fine details face the viewer and the noisy background—where you tolerate loss—gets compressed. That’s the intuition: steer error away from attention‑sensitive subspaces instead of naively squeezing everything into four levels.

How OSCAR works — a practical technical overview

Calibration and rotations: Run a small calibration pass to estimate attention statistics. For keys, compute query covariance (which channels co‑occur across queries). For values, compute a score‑weighted value covariance (weights reflect which value channels the attention actually uses). Derive one‑time rotation matrices from these covariances.
Hadamard equalizer: Apply a Walsh‑Hadamard transform as an equalizer to reduce per‑channel outliers and flatten dynamic range before quantization.
Permuted bit‑reversal grouping: Group rotated channels using a bit‑reversal permutation so the INT2 groups map compatibly to paged KV layouts. That preserves channel locality required by standard paged attention and prefix caching.
Mixed‑precision cache layout: Keep a small BF16 sink (S0 = 64 tokens) and a recent BF16 window (W = 256 tokens); store the remaining history in INT2 after rotation and clipping. Typical calibration clip percentiles are cK ≈ 0.96 (keys) and cV ≈ 0.92 (values). Default group size GK = 64 channels per INT2 group works well across evaluated models.

“By combining attention‑aware eigen‑bases with a Hadamard equalizer and a bit‑reversal grouping, OSCAR achieves practical INT2 KV caches without custom paged layouts.”

Two practical touches make OSCAR production‑friendly: rotations and thresholds can be precomputed and published (see RotationZoo / ModelScope for supported models), and the value‑rotation step can be absorbed into model weights offline to remove runtime overhead entirely.

Benchmarks & accuracy tradeoffs

Key measured outcomes across evaluated models (Qwen3 family and GLM‑4.7‑FP8):

Effective precision: ~2.28 bits per KV element, yielding ~8× KV memory reduction vs BF16.
Decode throughput: up to ~3× speedup at 100K context for single‑decoder scenarios. Job‑level throughput at 100K context, batch size 32: 6.17× (Qwen3‑4B‑Thinking) and 7.83× (GLM‑4.7‑FP8).
Accuracy (averaged across AIME25, GPQA‑Diamond, HumanEval, LiveCodeBench v6, MATH500 at 32K gen):
- Qwen3‑4B‑Thinking: −3.78 points vs BF16
- Qwen3‑8B: −1.42 points
- Qwen3‑32B: −0.02 points
- GLM‑4.7‑FP8 (358B): +0.27 points
Long‑context robustness: OSCAR matches BF16 on GLM‑4.7‑FP8 through 128K context in RULER‑NIAH tests.

Comparisons: naive INT2 without rotations typically fails (models produce nonsense), Hadamard‑only rotations (like QuaRot) help but are insufficient at INT2, and many per‑channel schemes break paged layouts. OSCAR outperforms these while remaining compatible with OpenAI‑style serving via SGLang.

Deployment & integration

SGLang integration: OSCAR is implemented within SGLang’s serving stack, keeping standard paged KV‑cache semantics and API compatibility.
Runtime kernels: Fused Triton kernels are used for efficient INT2 write/read and rotations. Triton here refers to the GPU kernel programming library that enables high throughput on NVIDIA GPUs.
Precomputed rotations: RotationZoo (ModelScope) publishes rotation matrices and clipping thresholds for supported models (Qwen3‑4B/8B/32B, GLM‑4.7‑FP8, MiniMax‑M2.7), which reduces the calibration burden.
Hardware: H100 is recommended for best performance; A100 is supported. Non‑NVIDIA or lower‑end GPUs may need custom kernel support or will see reduced benefits.
Offline absorption: Value rotations can be baked into model weights offline to eliminate runtime rotation costs if desired.

Limitations, caveats and open questions

Calibration sensitivity: Rotations are computed offline from calibration data. Domain shift (heavy fine‑tuning, unusual token distributions, or niche languages) can reduce effectiveness. Recalibration or model‑specific rotations may be necessary.
Hardware & runtime support: OSCAR depends on efficient fused kernels (Triton). Lower‑end GPUs or non‑NVIDIA stacks may require alternate implementations with lower performance.
Edge cases: Rare‑token prompts, adversarial inputs, or workloads that rely heavily on small attention subvectors may reveal failure modes. Continuous monitoring and conservative rollout are recommended.
Generality: Precomputed rotations cover several popular models, but teams with proprietary or heavily modified models should run their own calibration and validation before production rollout.
Operational complexity: Adding INT2 history increases the need for observability: instrumentation for per‑prompt quality, drift detection after fine‑tuning, and quick rollback paths are essential.

Deployment checklist for engineering teams

Inventory: Confirm model(s) in use and whether they are in RotationZoo. If not, plan a calibration dataset representative of your workload.
Calibrate: Run a short calibration pass to compute query covariances and score‑weighted value covariances. Capture clip percentiles (expect cK≈0.96, cV≈0.92 as starting points).
Validate locally: Apply rotations, Hadamard equalization, and INT2 packing in a staging environment. Compare generation quality on held‑out prompts (including rare tokens) and measure decode latency.
Measure performance: Benchmark KV memory footprint, decode throughput at target contexts (10k, 50k, 100k), and job‑level throughput for your batch sizes on H100/A100.
Rollout strategy: Start with a small percentage of traffic for long‑context workloads. Monitor quality metrics, latency, and GPU memory pressure; use a fast rollback path to BF16 if needed.
Operationalize monitoring: Track per‑prompt quality deltas, KL/score shifts, and error rates. Recalibrate if model weights change significantly or if traffic distribution shifts.

What to monitor

Per‑task quality delta vs BF16 on a representative validation set.
Token‑level anomalies (spikes in perplexity or repetition) indicative of attention corruption.
Memory and throughput trends across contexts and batch sizes.
Drift after model updates or fine‑tuning; consider scheduled recalibration.

Key questions and quick answers

How aggressive is OSCAR’s compression and how much accuracy is lost?

OSCAR compresses KV storage to ~2.28 bits per element (~8× reduction vs BF16). Accuracy loss is modest on evaluated models: small deltas for larger models and somewhat larger deltas on very small models (e.g., –3.78 points on Qwen3‑4B‑Thinking); GLM‑4.7‑FP8 actually matched or improved on some benchmarks.

Does OSCAR break paged KV‑cache serving or prefix caching?

No. The combination of Hadamard equalization and bit‑reversal grouping preserves channel locality required for standard paged KV layouts and OpenAI‑style prefix caching.

What infrastructure is required to run OSCAR?

Integrate via SGLang with fused Triton kernels for best performance; H100 GPUs are recommended, A100 supported. Precomputed rotations in RotationZoo reduce calibration work for common models.

Is the rotation offline or adaptive?

OSCAR’s rotations and thresholds are computed offline from calibration data. Extending the idea to adaptive or online rotations is an open area for research and engineering.

Where to get the code and resources

Rotation matrices and thresholds for supported models are available in RotationZoo (ModelScope). Implementation details, fused kernels, and experiments are described in the OSCAR repository and the accompanying paper (see arXiv:2605.17757 and the FutureMLS‑Lab / OSCAR GitHub). These resources include examples for calibration, the default parameters (GK=64, S0=64, W=256), and guidance on absorbing value rotations into model weights.

For teams running long‑context workloads, OSCAR is an immediate lever to reduce KV memory footprint and improve throughput without rewriting client code or abandoning prefix caching. The pragmatic next step is a targeted calibration and staging benchmark to quantify your cost and latency savings under your workload.