P-EAGLE: Parallel Speculative Decoding Boosts Throughput, Eliminates Drafter Bottleneck

P-EAGLE: Parallel speculative decoding that eliminates the drafter bottleneck

Speculative decoding speeds up generation by letting a small, fast model guess likely next tokens and the big model only verify them. P-EAGLE makes that guessing step parallel: instead of K sequential small steps, it predicts K tokens in one pass—more throughput, same final outputs.

What is speculative decoding?

Speculative decoding is a runtime trick: a lightweight drafter proposes candidate tokens and an expensive target model verifies those candidates, reducing the number of full forward passes the heavy model must run. EAGLE and EAGLE-3 are earlier variants that improved drafter accuracy and representation, but kept the drafter autoregressive—one forward pass per proposed token.

Definitions on first use: K = number of speculative tokens proposed at once. OTPS = output tokens per second (a throughput metric). vLLM = an inference server that supports speculative decoding. FP8 = 8-bit floating-point quantization used in some production benchmarks.

How P-EAGLE parallelizes drafting

P-EAGLE (Parallel-EAGLE) reframes drafting from a chain of K small steps into a single forward pass that produces K draft tokens simultaneously. It does this with two learned placeholders:

embmask — a learned embedding that stands in where previous-token embeddings would be unknown for future positions.
hshared — a shared hidden-state vector that supplies a consistent context for those missing positions.

Think of drafting as sticky notes placed ahead of the writer. Traditional drafting writes each sticky note one by one and asks the editor to check each. P-EAGLE fills a stack of sticky notes in one go; the editor (the target model) still checks each note, so the final text is unchanged.

Acceleration without changing model behavior: every draft token is still verified by the target model, so final outputs remain identical to native autoregressive output.

Why this matters for AI for business

Speculative decoding is a practical lever for reducing inference cost and increasing tokens-per-dollar. P-EAGLE lowers drafter latency by making K cost-flat from the drafter’s perspective: more aggressive speculation no longer multiplies drafter forward passes. For services where throughput and cost-per-response matter—chat systems, automated code generation, bulk content synthesis—P-EAGLE lets teams push deeper speculation and convert that into fewer GPUs or higher capacity on the same cluster.

Example impact: if a baseline system produces ~300 OTPS and a P-EAGLE-accelerated setup produces ~1,200 OTPS for the same model under similar conditions, you can roughly quarter the number of GPU instances required to meet a fixed throughput target (other factors like latency tail and acceptance rate apply).

Benchmarks — what the numbers mean

Benchmarks were run on Qwen3-Coder-30B-A3B-Instruct using NVIDIA B200 GPUs with FP8 quantization and report OTPS as the metric. Key headline results:

P-EAGLE vs EAGLE-3: about 1.05×–1.69× throughput improvement across MT-Bench, HumanEval, and SPEED-Bench.
P-EAGLE vs no-speculation baseline: in several workloads 2.1×–4.2× higher OTPS depending on concurrency and request mix.

Concrete examples (concurrency = 1, K = 11):

HumanEval: P-EAGLE ≈ 1,167 OTPS vs EAGLE-3 ≈ 955 OTPS; baseline ≈ 294 OTPS.
SPEED-Bench (Code): P-EAGLE ≈ 873 OTPS vs EAGLE-3 ≈ 612 OTPS; baseline ≈ 294 OTPS.

P-EAGLE keeps its edge at higher concurrency settings (benchmarks reported scaling up to c = 128). These numbers show two things: (1) parallel drafting directly reduces drafter-side latency, and (2) overall gains depend on the acceptance rate (how many draft tokens the target model accepts without correction) and workload shape.

Experimental notes to consider when interpreting results: GPU type, quantization scheme (FP8), model family, request concurrency, token length, and adversarial or long-stream request patterns all affect OTPS and acceptance rates. Benchmark figures are useful directional evidence but should be validated on representative production traffic.

Deploying P-EAGLE on SageMaker JumpStart and vLLM

AWS has published pre-trained P-EAGLE drafter heads in SageMaker JumpStart for immediate use. Supported models at launch include GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B-Instruct, and Gemma-4-31B-IT. Two deployment paths:

One-click JumpStart deployment that provisions an endpoint with P-EAGLE drafter heads attached.
vLLM integration for custom deployments — enable parallel drafting by setting the SM_VLLM_SPECULATIVE_CONFIG environment variable. Example configuration: {“method”:”eagle3″,”num_speculative_tokens”:3,”parallel_drafting”:true}

Operational tip: SageMaker real-time endpoints incur charges while running—delete endpoints when idle to avoid unnecessary costs.

When to use P-EAGLE — practical guidance

How to evaluate whether P-EAGLE is a good fit:

Measure baseline OTPS and latency tail (p50/p95/p99) for representative traffic.
Estimate acceptance rate sensitivity as you vary K: acceptance rate typically declines as K grows; find the sweet spot where OTPS improves without exploding verification overhead.
Prototype with JumpStart drafter heads or train a drafter head for your model and dataset if you need custom behavior or private data handling.
Test with real traffic including long-context sessions and adversarial patterns—speculative gains can shrink if verification frequency rises.
Monitor memory and GPU utilization: increasing K raises candidate storage and verification volume; ensure hardware sizing matches peak loads.

Training note: drafter heads for P-EAGLE are trained on long sequences (support up to ~20K tokens) to avoid acceptance-rate degradation that arises from short-sequence training. That extra training alignment reduces surprises when your production traffic includes long contexts.

Limitations, risks, and open questions

P-EAGLE is not a free lunch. Important considerations:

Verification still requires the full target model. If acceptance rate drops, more verifications will erode gains.
K has diminishing returns: very large K can produce lower-quality drafts that trigger more verification, negating throughput benefits.
Quantization and hardware matter: FP8 performance on NVIDIA B200 GPUs was used for published benchmarks; other setups may deliver different speedups.
Integration complexity: teams using retrieval-augmented generation (RAG), safety filters, or streaming APIs should validate interactions—parallel drafting can complicate token-level retrieval or real-time safety checks.
Operational cost: producing a custom drafter head requires training and validation work; include that engineering cost in ROI calculations.

Checklist for engineering and product teams

Establish baseline OTPS and latency SLAs.
Run a JumpStart P-EAGLE pilot on representative traffic (include adversarial and long-context samples).
Measure acceptance rate vs K and compute tokens-per-dollar improvement at target concurrency.
Validate interactions with RAG, safety pipelines, and streaming clients.
Plan endpoint lifecycle to avoid idle cloud costs and monitor 99th-percentile latency.

Key takeaways and questions

What is the core advantage of P-EAGLE?

It predicts multiple speculative tokens in a single drafter forward pass by using learned placeholders (embmask and hshared), removing the per-token serial drafting overhead.

How much faster is it in practice?

Benchmarks report roughly 1.05×–1.69× speedup versus EAGLE-3 and up to 2.1×–4.2× throughput versus no-speculation baselines, measured as OTPS on Qwen3-Coder-30B-A3B-Instruct with NVIDIA B200 GPUs and FP8.

Does it change model outputs or degrade quality?

No—the target model still verifies every drafted token, so final outputs match native autoregressive output. Quality is preserved; performance gains depend on acceptance rates and workload.

How do I enable P-EAGLE on SageMaker?

Use JumpStart’s one-click deployment for supported models or configure vLLM with SM_VLLM_SPECULATIVE_CONFIG (e.g., {“method”:”eagle3″,”num_speculative_tokens”:3,”parallel_drafting”:true}).

What should I watch out for operationally?

Performance depends on GPU, quantization, and request patterns; P-EAGLE requires trained drafter heads (ideally on long contexts up to 20K tokens) and careful testing with real traffic. Also remember to delete endpoints when not in use to avoid charges.

Next steps

P-EAGLE is a practical systems-level upgrade: small extra model components (the drafter head) for outsized runtime gains. For teams operating high-throughput generative services—chatbots, code assistants, or bulk content pipelines—it’s a low-risk lever to boost tokens-per-second while preserving the exact behavior of the foundation model. Try the one-click SageMaker JumpStart deployment or configure vLLM with parallel_drafting enabled as a short pilot to measure real-world gains for your workload.

Authors and contributors: Andy Peng, Daniel Quang, Siddharth Shah, Dan Ferguson. Acknowledgments to Kyle Ulrich, Hemant Singh, Ashish Khetan, Evan Kravitz, Mike James, Xu Deng, and Kareem Syed-Mohammed.