NVIDIA cuda-oxide: Compile Rust GPU Kernels Directly to PTX — A Practical Guide for ML Teams

NVIDIA’s cuda-oxide: compile Rust GPU kernels straight to PTX

TL;DR: cuda-oxide from NVlabs lets teams write GPU kernels in idiomatic Rust and compile them directly to NVIDIA PTX (the GPU assembly-like language). It preserves CUDA’s SIMT model and device intrinsics while exposing Rust features like generics and pattern matching. Early benchmarks are promising, but the project is experimental, Linux-only, and requires pinned toolchains and LLVM 21+.

Why this matters for ML/AI teams

Productivity: write custom operators and fused kernels in Rust instead of switching to C++/CUDA.
CI friendliness: single-source builds under Cargo simplify packaging and deployment for Rust-first stacks.
Performance potential: tuned kernels show real throughput (example: gemm_sol reached 868 TFLOPS on a B200—about 58% of cuBLAS Speed‑of‑Light).
Tradeoffs: experimental status, pinned Rust nightly, LLVM and llc dependencies, and NVIDIA-centric output mean extra ops work and vendor lock-in risk.

What cuda-oxide actually is

cuda-oxide is an experimental rustc codegen backend from NVlabs that compiles Rust device functions directly into NVIDIA PTX. A few quick definitions up front:

PTX — NVIDIA’s low-level GPU assembly-like language (device code format).
NVPTX / llc — LLVM’s backend for producing NVPTX/PTX assembly; llc is the tool cuda-oxide uses to emit PTX.
MIR — Rust’s Mid-level Intermediate Representation used inside rustc; cuda-oxide reads Stable MIR.
MLIR — a modular IR ecosystem; cuda-oxide creates a Rust-native MLIR-like IR called Pliron.
SIMT — Single Instruction, Multiple Threads, the CUDA execution model.
TMA — Tensor Memory Accelerator features for tensor cores.
GEMM — General Matrix Multiply, often used as a performance benchmark for BLAS/cuBLAS.

NVlabs frames the effort as “bringing CUDA into Rust”—expressing CUDA’s SIMT model, device intrinsics, and kernel semantics natively in Rust rather than forcing Rust to mimic CUDA’s model.

How the pipeline works (high level)

Think of Pliron as a local dialect sitting between Rust and LLVM: it keeps Rust semantics intact while speaking enough LLVM to let the standard toolchain finish the job.

Rust source → rustc frontend → Stable MIR (via rustc_public)
Stable MIR → Pliron dialects (MIR-like / LLVM-like / NVVM-like)
Pliron → textual LLVM IR → llc (LLVM’s NVPTX backend) → PTX

Build tooling is Cargo-native: add #[kernel] on device functions and use the provided cargo-oxide subcommand to drive build, run, debug, and pipeline tracing. A single cargo-driven build produces a host binary and a companion .ptx file.

Minimal kernel example

A tiny single-source illustration (very small):

#[kernel]
fn vec_add(a: *const f32, b: *const f32, out: *mut f32, n: usize) {
    let i = thread_index();
    if i < n { unsafe { *out.add(i) = *a.add(i) + *b.add(i) } }
}

Build and inspect the pipeline with:

cargo oxide build
cargo oxide pipeline vecadd

Developer ergonomics: what Rust features work on-device?

Generics via monomorphization
Closures with scalar captures (they get scalarized)
User structs and enums, pattern matching
Rich GPU intrinsics: thread indexing, warp ops, shared memory, barriers, atomics, and tensor features (TMA)
Lazy compilation of device functions from dependencies by reading Stable MIR from .rlib metadata

That means much of what Rust developers value—type safety, generics, and expressive control flow—can appear inside kernels without separate CUDA source files.

Safety model (practical)

Safety is organized into three tiers so teams can pick the right abstraction for correctness and performance:

Tier 1 — Safe-by-construction: high-level APIs ensure race-free patterns. Example: DisjointSlice + ThreadIndex make per-thread writes safe without unsafe blocks.
Tier 2 — Scoped unsafe: limited, auditable unsafe blocks for shared-memory patterns or warp intrinsics.
Tier 3 — Raw intrinsics: low-level primitives exposed for performance-critical or experimental code; use with caution.

Example pattern (conceptual): with a DisjointSlice abstraction, the API prevents overlapping writes so each thread writes its own slot without data races—meaning fewer unsafe annotations for typical kernels.

Performance snapshot

Early signs are encouraging but nuanced. The gemm_sol demo reached 868 TFLOPS on an NVIDIA B200, roughly 58% of cuBLAS “Speed‑of‑Light”. That shows the Rust→PTX path can deliver serious throughput. Why not 100%? Vendor-tuned libraries like cuBLAS implement years of algorithmic and micro-kernel tuning; matching them requires time, fusion strategies, and low-level tweaks.

Bottom line: cuda-oxide can produce competitive kernels, but closing the last performance gap will still demand engineering effort equivalent to traditional CUDA tuning.

Platform, requirements, and known limitations

Linux-only (tested on Ubuntu 24.04)
Pinned Rust nightly required (repo pins nightly-2026-04-03) plus rust-src and rustc-dev
CUDA Toolkit 12.x+, LLVM 21+ (for NVPTX & tensor features), and clang-21 or equivalent for bindgen
llc (LLVM’s NVPTX backend) is an external dependency to turn LLVM IR into PTX
Known v0.1.0 limitation: index_2d(stride) can be unsound for some stride uses — workaround: bind stride once and reuse until a fix lands

How cuda-oxide compares to other Rust GPU efforts

cuda-oxide: goal is “bringing CUDA into Rust.” Keeps CUDA semantics, emits PTX via llc. NVIDIA-centric, good for teams who want native CUDA semantics with Rust ergonomics.
rust-cuda: often described as “bringing Rust to NVIDIA GPUs” — higher-level Rust abstractions layered on CUDA; different design tradeoffs and goals.
rust-gpu: targets SPIR-V (Vulkan) rather than NVPTX, enabling broader vendor reach (e.g., non-NVIDIA backends) but different programming model and ecosystem.

Each project targets different tradeoffs: portability vs. fidelity to CUDA vs. developer ergonomics. For NVIDIA-focused production where CUDA semantics matter, cuda-oxide is compelling. For cross-vendor portability, rust-gpu or alternative SPIR-V approaches remain relevant.

Operational considerations and enterprise checklist

Pin toolchains in CI: rust nightly, LLVM 21+, CUDA SDK versions in reproducible container images.
Containerize builds: provide standard Docker images with llc and clang-21 installed for CI and reproducible pipelines.
Benchmark and validate numerics: compare outputs against cuBLAS/cuDNN for correctness and statistical parity.
Pinned nightly risk mitigation: run nightly-compat tests and have a tooling cadence to update the pinned revision after validation.
Fallback plan: keep equivalent C++/CUDA codepaths or vendor libraries for risk mitigation during rollout.
Profiling and observability: check that your profilers and debugging tools integrate with the generated PTX and host binary.

Pilot plan (practical steps)

Pick one hot operator (e.g., a fused matmul or a custom attention kernel).
Implement the kernel in Rust using #[kernel], build with cargo-oxide, and validate numerics.
Benchmark throughput and latency vs. current implementation (cuBLAS/cuDNN or C++ CUDA).
Integrate into CI with pinned container images and nightly toolchain; monitor for regressions over time.
Decide based on perf, maintenance cost, and team velocity whether to expand coverage.

Limitations and open questions

Experimental status: not yet production-grade; expect API churn and a need to re-pin or revalidate with new rustc releases.
NVIDIA-only: final emit depends on llc → NVPTX; no built-in ROCm/AMD or SPIR-V output today.
Maintenance burden: rustc internals and pinned nightlies can impose long-term ops cost unless the project stabilizes.
Tooling maturity: kernel-level debugging, profiling, and integration with existing vendor tools will need work for a smooth developer experience.

Key takeaways and questions

What is cuda-oxide and who built it?

cuda-oxide is an experimental rustc codegen backend from NVlabs that compiles Rust kernels directly to NVIDIA PTX.
How does Rust become PTX under cuda-oxide?

Rust source is processed to Stable MIR, converted into Pliron dialects, emitted as textual LLVM IR, and llc (LLVM’s NVPTX backend) turns that into PTX.
Can I write idiomatic Rust in kernels?

Yes—generics (monomorphized), closures with scalar captures, structs/enums, and pattern matching are supported along with many GPU intrinsics.
What are the safety guarantees?

Three tiers: Tier 1 for race-free-by-construction patterns, Tier 2 for scoped unsafe, and Tier 3 for raw intrinsics.
What should teams pilot first?

Start with a single hot operator, validate numerics and perf against vendor libraries, then measure ops cost for toolchain pinning and CI images.

If you want a one-page pilot roadmap or a side-by-side technical comparison matrix between cuda-oxide, rust-cuda, and rust-gpu for your team, I can draft that next. Practical pilots—small, measurable, and CI-anchored—are the fastest way to see whether this Rust→PTX path pays off for your workload.