Hexagon cuts point-cloud model training 95% — 80 days to 4 with Amazon SageMaker HyperPod

How Hexagon cut point‑cloud model training from ~80 days to ~4 using Amazon SageMaker HyperPod

TL;DR: Hexagon rebuilt its training stack on AWS and used Amazon SageMaker HyperPod, Elastic Fabric Adapter (EFA), and a high‑throughput S3 + FSx for Lustre pipeline to reduce a long point‑cloud training run from roughly 80 days to about 4 — roughly a 95% wall‑clock reduction. The architecture combined scalable multi‑GPU compute (NVIDIA H100s), low‑latency networking, lazy data streaming, automated orchestration, and centralized observability to make large, I/O‑heavy model training predictable, resilient, and fast.

Why point‑cloud training stalls progress

Point‑cloud datasets — the 3D scans and LiDAR captures used for digital twins, mining, construction, and autonomy — are large and I/O bound. That means the limitation is often not raw compute but how quickly terabytes of samples can reach GPUs. When GPUs sit idle waiting for data, iteration cadence slows: experiments that should take days turn into weeks or months, feature rollouts slip, and competitive advantage erodes.

Think of a GPU as a race car. Without a fast pit lane to refuel and reload data, the car spends more time idling in the garage than on the track.

Architecture at a glance

  • Compute: 6 × ml.p5.48xlarge instances (each with eight NVIDIA H100 GPUs), using EFA for low‑latency inter‑node communication.
  • Data pipeline: Amazon S3 for permanent storage + Amazon FSx for Lustre with Data Repository Association (DRA) for lazy loading and automatic checkpoint export.
  • Orchestration: Amazon SageMaker HyperPod for managed multi‑node distributed GPU training with self‑healing and checkpoint resume.
  • Observability: Per‑GPU telemetry from NVIDIA Data Center GPU Manager (DCGM) aggregated into Amazon Managed Service for Prometheus and visualized in Amazon Managed Grafana.
  • Experiment tracking: MLflow integrated with SageMaker for parameter/metric/artifact/lineage tracking.
  • Capacity model: SageMaker training plans with flexible reservations from 1 day up to 6 months for predictable capacity and pricing.

Key components, explained for leaders

Distributed GPU training (ml.p5 + NVIDIA H100)

Hexagon ran six ml.p5.48xlarge instances, each packing eight H100 GPUs. That many GPUs lets teams increase effective batch sizes and run larger parallel workloads — which often improves model convergence and final accuracy for domain‑specialized point‑cloud models.

Elastic Fabric Adapter (EFA)

EFA (Elastic Fabric Adapter) is AWS networking that reduces latency and increases bandwidth between GPUs across nodes. For tightly coupled distributed training, EFA avoids the communication bottlenecks that can make multi‑node scaling inefficient.

S3 + FSx for Lustre with Data Repository Association (DRA)

Training data stays in S3 while FSx for Lustre streams needed files to compute nodes on demand. The DRA (Data Repository Association) lets the cluster lazily load terabytes of data at multi‑GB/s rates and automatically export checkpoints back to S3 — keeping the GPUs fed without manual data staging.

SageMaker HyperPod

HyperPod provides the managed orchestration layer: lifecycle scripting, Helm support, DLAMI (Deep Learning AMI)‑based images, automated health checks, checkpoint‑based job resumption, and single‑spine topologies tuned for EFA. Those features convert fragile, long‑running experiments into resilient jobs that can recover from node failures without human babysitting.

Observability and experiment tracking

Telemetry from NVIDIA DCGM, Kubernetes, EFA, and the file system is routed to Amazon Managed Service for Prometheus and Amazon Managed Grafana. MLflow captures parameters, metrics, artifacts, and lineage so experiments are reproducible and searchable. Together, observability and tracking let teams know exactly where a run is stalled — network, I/O, or compute — and measure improvements.

Hexagon’s approach: specialized, domain‑focused AI models often outperform large general‑purpose ones for point‑cloud tasks.

Infrastructure benefit: HyperPod’s self‑healing and checkpoint resume capabilities let long training runs continue without manual intervention.

Deployment speed: Hexagon completed their first HyperPod‑based training deployment within hours, simplifying setup and governance.

Concrete results and business outcomes

  • Training time: A particular network/configuration dropped from ~80 days on‑premises to ~4 days on HyperPod — about a 95% reduction in wall‑clock time.
  • Throughput: Multi‑GB/s streaming to GPUs eliminated data starvation, letting teams scale batch sizes and improve model training stability.
  • Resilience: Automated health checks and checkpoint resume reduced manual intervention and mean‑time‑to‑recover (MTTR) for multi‑day runs.
  • Operational efficiency: Integrated observability and MLflow cut time spent debugging experiments and improved reproducibility.
  • Faster time‑to‑market: Iteration cycles shortened from months to weeks, enabling quicker feature delivery for aerospace, automotive, construction, mining, manufacturing, and precision agriculture.

Before vs after — the quick snapshot

  • Before: On‑prem training runs ~80 days; GPUs often underutilized due to I/O constraints; fragile long runs with manual recovery.
  • After: HyperPod + FSx streaming reduced runs to ~4 days; GPUs kept near full utilization; jobs auto‑recover and checkpoints auto‑persist to S3.

Key questions leaders should ask (and short answers)

  • How big was the speedup, and what hardware made it possible?

    Hexagon reported about a 95% wall‑clock reduction (from ~80 days to ~4 days) for a specific workload using six ml.p5.48xlarge instances (6 × 8 NVIDIA H100 GPUs) connected with EFA.

  • How did they eliminate data I/O bottlenecks?

    Data remained in Amazon S3 and was streamed via Amazon FSx for Lustre with DRA to deliver multi‑GB/s throughput so GPUs were rarely starved for data.

  • Can multi‑node training be resilient for long runs?

    Yes — HyperPod’s automated health checks, self‑healing, and checkpoint resume let long distributed jobs continue without continuous human oversight.

  • How were experiments and metrics managed?

    MLflow tracked experiments, while DCGM + Kubernetes + EFA metrics were aggregated to Amazon Managed Prometheus and visualized in Amazon Managed Grafana for unified MLOps observability.

Risks, tradeoffs, and what to validate

This approach is powerful but not automatic. Teams should validate these items before committing:

  • Total cost of ownership: Compare reserved cloud capacity vs capital and operational costs of on‑prem clusters. Include storage, data transfer, and long‑term model hosting costs.
  • Energy and carbon footprint: Use provider carbon tools or measure kWh per run to compare cloud vs on‑prem sustainability metrics.
  • Data governance and residency: Geospatial and point‑cloud datasets can be sensitive; check encryption at rest/in transit, access controls, and regulatory requirements.
  • Vendor lock‑in and portability: Ensure containerized workflows, model formats (ONNX), and data versioning make it feasible to rehost or hybridize later.
  • Validation of model gains: Confirm accuracy improvements across production datasets and monitoring pipelines to catch regressions early.

Metrics you should measure during a pilot

  • GPU utilization (%) before and after
  • Data throughput to GPUs (GB/s)
  • Epoch time and total wall‑clock training time
  • Checkpoint and restore time (impact on MTTR)
  • Number of node failures and automated recoveries
  • Cost per training run and per‑GB storage/egress

Practical next steps — a pilot checklist

Two‑week pilot (recommended for CTOs and platform teams):

  1. Reserve a small p5 cluster (e.g., 2–3 nodes) via a short SageMaker training plan window.
  2. Stream a representative subset of one point‑cloud dataset using S3 + FSx for Lustre DRA and measure GB/s to GPU.
  3. Deploy HyperPod with DLAMI‑based images and an MLflow experiment to run a known training recipe.
  4. Instrument Prometheus + Grafana to capture DCGM metrics, network latency, and file system throughput.
  5. Compare GPU utilization, epoch time, and cost per run to your on‑prem baseline; iterate on batch size and distributed config.

What platform teams need to try first

  • Validate checkpoint/export workflow to S3 and ensure resume works from recent checkpoints.
  • Test lifecycle scripts and Helm charts for predictable job startup and teardown.
  • Automate metrics export from DCGM and Kubernetes to Prometheus for immediate visibility.

How to decide whether to move

If your team is spending more time waiting for runs to finish than analyzing results, or if your iteration cadence is measured in months rather than days, that’s a strong signal to evaluate streaming data pipelines and managed distributed training. The productivity gains — faster experiments, better reproducibility, and lower ops overhead — often justify a careful migration or hybrid burst plan.

For executives: commission a 2‑week pilot and a TCO/risks analysis comparing cloud reserved capacity vs upgrading on‑prem storage and networking.

For ML leaders: baseline your current experiment cadence, GPU utilization, and epoch times. Set target improvements (for example, reduce wall‑clock training time by 80% or push GPU utilization above 80%).

For platform teams: build a portable PoC using containerized training code, MLflow tracking, and a scripted FSx DRA import to avoid long manual steps later.

Final thought

Specialized, domain‑focused models for point‑cloud processing can outperform one‑size‑fits‑all models — but they need an infrastructure that keeps GPUs busy and experiments observable. Hexagon’s move illustrates how combining distributed GPU training, low‑latency networking (EFA), high‑bandwidth data streaming (S3 + FSx for Lustre DRA), and managed orchestration (SageMaker HyperPod) turns long, fragile runs into rapid, repeatable pipelines that accelerate product delivery across industries.

If your organization is constrained by I/O, brittle long runs, or slow iteration cycles, use the pilot checklist above and measure the five metrics listed. You’ll quickly know whether a managed distributed training approach is worth the investment.