Disaggregated LLM Inference on AWS with llm-d: Faster, Cheaper Scaling for AI Agents

Disaggregated LLM Inference on AWS: Faster, Cheaper Scaling for AI Agents

TL;DR: Disaggregated inference separates LLM work into a compute-heavy “prefill” stage and a memory-bound “decode” stage so you can match hardware and scaling to each phase. The open-source llm-d project, built on vLLM and integrated with AWS networking (EFA) and NVIDIA’s NIXL/UCX stack, can increase tokens/sec by as much as ~70% on long-context, high-concurrency workloads—at the cost of added operational complexity. Best fit: large models, long contexts, high prefix reuse, or Mixture-of-Experts (MoE) models.

Who should read this: ML platform engineers, CTOs and product leaders running agentic or multi-turn LLM workloads, and SREs evaluating inference cost/throughput tradeoffs for production AI.

Why disaggregated inference matters for AI agents

As AI agents move beyond single-question replies into multi-step reasoning and tool orchestration, token volumes explode and compute patterns become bursty and uneven. That means raw model quality often isn’t the limiting factor—efficiency of inference is. Treating every inference request the same wastes GPU cycles and memory bandwidth.

“LLM workloads now drive far more tokens and complex reasoning chains than single-shot replies, so inference efficiency is the new bottleneck for scale.”

The fix is surprisingly simple in concept: stop treating prefill and decode as the same thing.

Prefill vs decode: the plain-English explainer

Think of LLM inference like a restaurant. Prefill is the kitchen: you prep many plates in parallel (compute-heavy), building the model’s internal context (the KV cache—key-value cache). Decode is the server: it walks out plates one-by-one to customers, generating tokens autoregressively (memory-bandwidth-bound). If cooks and servers are the same people, you get bottlenecks: chefs are idle waiting to serve, and servers clog the kitchen. Separating roles improves throughput and reduces wasted labor.

Definitions on first use:

  • KV (key-value) cache: model context stored so decodes can reference prior tokens without re-running the whole sequence.
  • TTFT — time-to-first-token: latency until the first token is produced.
  • RDMA: remote direct memory access, for low-latency, high-bandwidth transfers.
  • EFA — Elastic Fabric Adapter: AWS’s RDMA-capable network interface.
  • UCX: a communication transport layer used for RDMA-style transfers.
  • NIXL: NVIDIA Inference Xfer Library for efficient point-to-point KV transfers.
  • NCCL: NVIDIA’s collective communication library (optimized for training, not point-to-point inference transfers).
  • MoE: Mixture-of-Experts models, which route tokens sparsely to expert sub-networks.

How llm-d + vLLM + AWS fit together

llm-d is a Kubernetes-native orchestration layer that extends vLLM (the inference engine) with production scheduling, cache-aware routing, and multi-node serving patterns. On AWS, llm-d integrates with EFA-enabled instances and a UCX/libfabric transport stack. NIXL is used to move KV blocks directly between nodes over RDMA when needed. The combination lets you:

  • Run dedicated prefill servers and dedicated decode servers, each tuned for its workload.
  • Use cache-aware scheduling to route requests to the node that already holds the needed KV blocks (reducing recomputation).
  • Perform low-latency, point-to-point KV transfers with NIXL over UCX/libfabric using EFA for high throughput.
  • Support MoE and expert-parallel patterns that benefit from separating expert routing from autoregressive decoding.
  • Implement tiered prefix caching—offloading less-recent KV state from GPU memory to CPU RAM or disk to expand effective cache capacity.

At a glance:

  • vLLM runs model inference and manages KV cache.
  • llm-d orchestrates pods, performs cache-aware scheduling, and coordinates KV transfers.
  • NIXL + UCX/libfabric move KV blocks between nodes with minimal overhead.
  • AWS EFA supplies the RDMA-capable networking necessary for fast point-to-point transfers.

The AWS-enabled llm-d image is available at ghcr.io/llm-d/llm-d-aws:v0.5.1, and llm-d integrates with Amazon EKS or SageMaker HyperPod for orchestration.

Benchmarks: the numbers and what they mean

Benchmarks are workload-dependent, but one representative test shows where disaggregation shines. On an ml.p6-b200.48xlarge fleet running a GPT-OSS model, a comparison was made between a single-node vLLM layout and an llm-d disaggregated layout. Setup details:

  • Model family: GPT-OSS (tensor parallel degree indicated as “tp”).
  • Single-node (baseline): vLLM with tp=4.
  • Disaggregated layout: four prefill pods at tp=1 and one decode pod at tp=4 (tp = tensor-parallel degree).
  • Test case: long-input/long-output prompts (input = 1,024 tokens, output = 1,024 tokens).
  • Concurrency ramp: up to 128 concurrent requests.

Result: as concurrency increased, the disaggregated path delivered up to roughly a 70% increase in tokens-per-second versus the single-node baseline. That gain reflects better GPU utilization across the two phases and fewer repeated prefill computations when the same prefixes are reused.

Important benchmarking notes for reproducibility: always include warm-up runs, report p50/p95 TTFT, tokens/sec, cache hit rate, instance types and counts, tensor parallel degrees, and standard deviation across runs. Compare both long-context and short single-shot scenarios—disaggregation favors the former.

Operational tradeoffs and prerequisites

Disaggregation isn’t a free lunch. It requires:

  • EFA-enabled instances and driver support on AWS, plus UCX/libfabric stack tuning.
  • Kubernetes expertise (EKS or SageMaker HyperPod), Helm/helmfile for deployments, and gateway configuration (NGINX/Istio/Gateway API extensions for inference routing).
  • Robust observability for KV transfer latency, cache hit rates, pod queue depth, and EFA bandwidth utilization.
  • Security and compliance review: KV transfers move model context across nodes—consider data-in-flight protections and VPC configuration.

When it’s a fit:

  • Large models, long input sequences (≥512–1,024 tokens), and workloads with significant prefix reuse.
  • Sparse MoE architectures that benefit from expert parallelism.
  • High concurrency where GPU utilization without disaggregation becomes inefficient.

When it’s not:

  • Mostly short, single-shot queries where prefill overhead is minimal.
  • Teams without the infra engineering bandwidth to manage RDMA/EFA/UCX and multi-node orchestration.
  • Cost-sensitive environments with low overall token volume—extra infra complexity may not pay off.

Monitoring and metrics to treat as mission-critical

Key signals to collect and alert on:

  • Time-to-first-token (TTFT) — p50/p95. Alerts if p95 increases unexpectedly.
  • Tokens/sec and tokens/sec per dollar for cost-efficiency analysis.
  • KV cache hit rate — low hit rates mean more costly transfers or recomputation.
  • KV transfer latency and EFA bandwidth utilization — watch for network saturation or errors.
  • GPU utilization and memory pressure on prefill and decode hosts.
  • Pod queue depth and request retry/error counts.
  • p95/p99 decode latency to track tail behavior affecting UX.

Decision checklist: should you pilot disaggregation?

  • Pilot if:

    You handle large monthly token volumes, run many concurrent multi-turn sessions or agents, have frequent prefix reuse, or use MoE/sparse models—and you have basic EKS/SageMaker and networking expertise to manage EFA/UCX.

  • Skip or delay if:

    Your workload is dominated by short, single-shot queries, you lack infra bandwidth to manage RDMA/EFA, or projected token volume is too low for gains to offset added costs.

Pilot playbook — three practical steps

  1. Single-node baseline: Measure TTFT, tokens/sec, cache hit rate, and p95/p99 on a representative workload using vLLM on EKS or a HyperPod. This is your control.
  2. Two-node prototype: Deploy llm-d with one or two prefill pods and a decode pod. Use the same model and prompts as the baseline, enable KV-aware routing, and measure the delta in tokens/sec and TTFT. Tune tensor-parallel degrees (tp) on prefill vs decode.
  3. Scale test and cost analysis: Ramp concurrency to production-like levels, measure EFA bandwidth and KV transfer latencies, and run a cost-per-million-tokens estimate. If throughput improvements multiply by more than incremental infra/operational cost increases, expand rollout.

Cost/benefit framework (qualitative)

Estimate whether disaggregation pays by comparing incremental throughput gains against additional infra and engineering costs. A basic formula:

Expected savings = (Baseline tokens/sec × token volume × time) × (1 – 1 / throughput_gain) – additional infra & operational costs

Practical rule of thumb: if your monthly token volume and concurrency amplify the throughput gains such that per-token cost decreases after accounting for EFA instances and orchestration overhead, disaggregation is worth piloting.

Final thoughts and next steps

Disaggregated inference is an operational lever that unlocks real throughput and latency improvements for AI agents and retrieval-augmented flows. When your use cases involve long contexts, frequent prefix reuse, or sparse MoE models, separating prefill and decode and using cache-aware routing with RDMA-enabled KV transfers can yield substantial wins.

If you want a compact runbook or a tailored decision checklist for a specific maturity level—“small team pilot,” “platform team proof-of-concept,” or “enterprise rollout”—say which maturity level you’re targeting and a short profile of your workload (avg context length, concurrency, model family), and a focused playbook can be provided.