Fine-tune LLMs on Governed S3: Databricks Unity Catalog + EMR Serverless + SageMaker

Fine‑tune an LLM with Databricks Unity Catalog and Amazon SageMaker AI

TL;DR — The problem and the fix

Enterprises want the speed and scale of cloud ML while keeping tight data governance. The problem: when preprocessing or training runs outside a centralized catalog, you can lose access controls, audit trails, and lineage—turning a production-ready model into an audit headache. This pattern stitches Databricks Unity Catalog (the governance plane) to AWS compute (EMR Serverless for Spark preprocessing and Amazon SageMaker AI for training) so teams can fine‑tune an instruction‑following LLM on governed S3 data without losing authorization, visibility, or reproducibility.

Why governance matters for fine‑tuning LLMs

Think of Unity Catalog as a secure vault with a logbook. SageMaker or EMR can borrow the vault’s contents, but you need a protocol so the borrowing is recorded and governed. When you fail to record who accessed what data to train which model, you create compliance and operational risk—especially in regulated industries (finance, healthcare, legal). This pattern preserves fine‑grained authorization and produces an auditable lineage from raw data to model artifact.

“You must preserve Unity Catalog’s fine‑grained authorization when SageMaker Training accesses S3, otherwise you lose visibility and create compliance risk.”

High‑level architecture: Unity Catalog + EMR Serverless + Amazon SageMaker

At a glance, the flow looks like this: Unity Catalog manages external locations and tables on S3 → EMR Serverless runs Spark jobs that read Delta Lake tables and write processed Delta outputs → SageMaker AI runs the fine‑tune (Ministral‑3‑3B‑Instruct in the reference) using memory‑efficient techniques → model artifacts are stored back to S3 and registered into Unity Catalog’s MLflow registry → External Metadata/Lineage APIs ingest pipeline events so the catalog shows raw → preprocess → model lineage.

Key components (one‑line descriptions):

Databricks Unity Catalog — centralized metadata, fine‑grained access control, and MLflow model registry.
Amazon S3 — object store for raw and processed Delta tables and model artifacts.
EMR Serverless (Elastic MapReduce Serverless) — Spark environment for preprocessing with managed execution.
Amazon SageMaker AI — managed training runtime for fine‑tuning LLMs at scale.
Delta Lake — table format used for raw and processed datasets on S3.
A Databricks OAuth M2M service principal — enables programmatic, auditable access from AWS jobs back to Unity Catalog.
AWS Secrets Manager & IAM — secure secret storage and role-based access for EMR and SageMaker.
External Metadata & External Lineage APIs — bring off‑platform provenance into Unity Catalog.

Example use case

A financial firm fine‑tunes Ministral‑3‑3B‑Instruct (an instruction‑following LLM from Hugging Face) on SEC EDGAR risk factors (10‑K/10‑Q text for S&P 500 firms, 2023–2024) stored as Delta Lake tables in S3. The company needs an explainable audit trail that ties model behavior to specific source documents for compliance and model risk management. Using this pattern, preprocessing runs on EMR Serverless, training runs on SageMaker, and Unity Catalog retains control and lineage metadata throughout.

Step‑by‑step implementation (executive summary + technical checklist)

Short executive view: configure Unity Catalog to govern an S3 external location, create a service principal and store its credentials in AWS Secrets Manager, run preprocessing on EMR Serverless using Spark to produce Delta outputs, launch a SageMaker training job that uses LoRA and FP8 quantization to reduce memory, stash artifacts back to S3, register the model in MLflow, and call Databricks External Metadata/Lineage APIs to record provenance.

Create the governance surface
Register the S3 bucket as an External Location in Unity Catalog and create external tables (Delta Lake). Grant the service principal least‑privilege read access to those external locations.
Provision a Databricks OAuth M2M service principal
Generate client id/secret on Databricks (M2M), and store them in AWS Secrets Manager. Use expiration and rotation policies.
Configure EMR Serverless preprocessing
Run Spark jobs that authenticate to Databricks via the stored OAuth credentials (the job fetches Delta table data from S3 through the governed external location). EMR must be able to fetch Delta Lake JARs at init—either allow internet egress or prepackage the JARs.
Preprocess and write Delta outputs
Produce cleaned, labeled instruction‑tuning examples as Delta tables on S3 under the same External Location so Unity Catalog can see them.
Run SageMaker training
Launch a SageMaker AI training job that pulls the processed Delta data (via S3) and uses memory‑efficient methods like LoRA and FP8 quantization. Provide the Databricks M2M secret in Secrets Manager so the training job can register artifacts back to Unity Catalog.
Register model and record lineage
Push the resulting model artifacts to S3, then register into Databricks Managed MLflow (using the OAuth service principal). Call the External Metadata/Lineage API to create lineage records linking source datasets, preprocessing jobs, and the model artifact.
Audit, monitor, and clean up
Enable CloudWatch logging for EMR and SageMaker, monitor Secrets Manager rotations, and implement cleanup scripts to remove test resources when done.

Minimal example: External Lineage payload (shape)

{
  "sourceDatasetId": "unity_catalog:/prod/edgar/risk_factors:v1",
  "transformationJobId": "aws:emr-serverless:job-12345",
  "targetModelId": "unity_catalog:/models/ministral-risk-v1",
  "principal": "svc-account@databricks",
  "timestamp": "2026-04-25T15:00:00Z",
  "details": {
    "commit": "sha256:abcdef",
    "trainingInstance": "ml.g5.16xlarge",
    "tuningMethod": "LoRA+FP8"
  }
}

Technical notes and gotchas

Acronyms explained: LLM = large language model; S3 = Amazon Simple Storage Service; FP8 = 8‑bit floating point quantization; LoRA = Low‑Rank Adapters (fine‑tuning technique to update a small fraction of weights); EMR = Amazon Elastic MapReduce (Serverless variant runs Spark without managing clusters).
Delta JARs on EMR: EMR Serverless does not include Delta Lake JARs by default (Feb 2026). EMR needs internet access to fetch JARs from Maven during init, or you must bundle the JARs in init scripts or container images.
Authentication flow: EMR and SageMaker jobs assume IAM roles that access Secrets Manager. Those roles retrieve the Databricks M2M client id/secret, which the job uses to call Unity Catalog/MLflow APIs. This ties every action back to an auditable principal.
Lineage ingestion: the External Metadata/Lineage APIs (Databricks public preview as of April 2026) let you bring non‑Databricks job metadata into Unity Catalog so users can query “which datasets trained model X?”

FP8 and LoRA — a concise primer

FP8 quantization reduces the numeric precision of model weights and activations to shrink memory footprint and speed up training. LoRA (Low‑Rank Adapters) freezes most model weights and injects small adapter matrices to learn task‑specific behavior. Combined, these techniques let teams fine‑tune a 3B‑parameter instruction model while training only ~1–2% of the parameters—often enough to fit on a single high‑end GPU instance (example: ml.g5.16xlarge) and to cut cost dramatically versus full‑fine‑tuning.

When to avoid them: tasks that demand full model rewiring or fundamental architecture changes (e.g., training a model to learn entirely new modalities) may require full fine‑tuning or larger architectures and distributed training.

Costs, tradeoffs, and scaling

Tradeoffs are straightforward: this pattern adds operational work (service principals, secret lifecycle, IAM policies, VPC setup) for better compliance and reproducibility. Expect higher engineering overhead up front but lower model‑risk overhead later. Cost components to budget for include EMR Serverless job runtime, SageMaker training instances, S3 storage and egress, ECR for custom containers, and logging (CloudWatch).

Ballpark economics: fine‑tuning a 3B model with LoRA+FP8 on a single GPU typically costs on the order of hundreds to a few thousand dollars per run depending on dataset size and runtime. Moving to 10B+ parameter models forces distributed training, sharding, and more complex orchestration—costs jump accordingly and you must factor in multi‑node networking and checkpoint storage.

Scaling considerations:

For 10B+ models, adopt multi‑node training frameworks, sharded checkpoints, and high‑throughput networking (e.g., AWS Elastic Fabric Adapter).
Consider hybrid strategies: pre‑filter and reduce dataset size in Spark, train LoRA adapters first, then selectively scale up for final full‑model tuning if needed.
Measure ROI: faster model iteration from cheaper LoRA runs often delivers better product outcomes than rare, expensive full‑tune runs.

Security, auditability, and operational best practices

Least privilege and separation of duties: grant the OAuth service principal only the Unity Catalog privileges it needs (read external location, write model registry). Use AWS IAM roles for compute jobs with minimal S3/SecretsManager policies.
Secrets lifecycle: rotate M2M client secrets regularly, and use Secrets Manager rotation features. Ensure training jobs fail fast and surface errors when credentials expire mid‑run.
Logging & monitoring: centralize EMR and SageMaker logs in CloudWatch and ship to SIEM. Record Databricks API calls and lineage events so model risk teams can reconstruct training provenance.
Lineage fidelity: include timestamps, principal identifiers, job IDs, and commit hashes in lineage payloads. Audit processes should verify that recorded lineage matches S3 object versions/checksums.

Quick engineer checklist

Register S3 as Unity Catalog External Location and create Delta external tables.
Create Databricks OAuth M2M client; store client id/secret in AWS Secrets Manager with rotation.
Create IAM roles for EMR Serverless and SageMaker with least privilege to S3, Secrets Manager, ECR, CloudWatch.
Prepare EMR init script to ensure Delta JARs are available (or allow Maven egress).
Write Spark preprocessing jobs to read/write Delta on S3.
Configure SageMaker training with LoRA + FP8; mount S3 artifact paths for output.
Register artifacts to Databricks MLflow (using M2M creds) and call External Lineage API to record provenance.
Automate cleanup for test resources to avoid surprise costs.

Key takeaways and common questions

How do you preserve Unity Catalog’s authorization when SageMaker reads S3?

Use a Databricks OAuth M2M service principal whose client id/secret lives in AWS Secrets Manager. EMR and SageMaker jobs assume IAM roles that fetch those secrets and authenticate to Databricks APIs—ensuring access is traceable to the catalog’s permissions model.
Can you get end‑to‑end lineage when processing and training happen outside Databricks?

Yes. Use Databricks External Metadata and External Lineage APIs to push lineage objects from EMR and SageMaker into Unity Catalog so the catalog shows raw → preprocess → model provenance.
How can resource requirements for fine‑tuning be reduced?

Apply FP8 quantization and LoRA so you fine‑tune only ~1–2% of parameters. That often lets teams use a single GPU instance for a 3B model instead of a distributed cluster.
What operational pain points should you expect?

Service principal lifecycle, complex IAM policies, VPC/network configuration (EMR fetching Delta jars), and building reliable lineage ingestion are common sources of friction.
Will this pattern scale to much larger models?

Conceptually yes—the governance pattern scales. Practically, training 10B+ models requires distributed compute, sharded checkpoints, and higher networking demands; expect architecture and cost changes.

Next steps and resources

Teams that want to adopt this pattern can start with a small pilot: a controlled dataset, an EMR Serverless preprocessing job, and a single SageMaker run with LoRA+FP8 on a 3B model. Use the provided runnable Jupyter notebook (sample repo) as a template to accelerate setup, and instrument lineage calls early so audits are possible from day one.

Helpful links:

FAQ

What if Secrets Manager credentials expire mid‑training?

Design training jobs to retrieve credentials at start and cache short‑lived tokens if needed. Ensure rotation schedules are coordinated with running jobs; use monitoring to fail‑fast and alert on authentication errors.
Does this increase vendor lock‑in?

There is some coupling between Databricks metadata and AWS runtime artifacts. The tradeoff is centralized governance and auditability—teams should weigh portability vs. compliance and can mitigate lock‑in by standardizing metadata schemas and using open formats (Delta, MLflow).
How mature is external lineage ingestion?

External Lineage APIs were in public preview as of April 2026. Treat them as production‑ready with caution—validate lineage fidelity in pilots and include reconciliation steps in governance processes.

Governance need not slow innovation. With the right integration pattern, teams can fine‑tune LLMs on enterprise data, leverage best‑of‑breed AWS compute, and still maintain the audit trail and controls that risk and compliance teams demand. If you’d like a compact implementation checklist or scaling guidance—from a single ml.g5.16xlarge run to distributed training—those are straightforward next steps to bake into your pilot plan.