Fine-tune Amazon Nova with Nova Forge SDK and Data Mixing to Avoid Catastrophic Forgetting

How to fine-tune Amazon Nova without throwing away its general smarts

Fine‑tuning can make a model brilliant at your task — and cause it to forget everything else. Data mixing — blending your proprietary examples with Amazon‑curated examples every training batch — preserves broad capabilities while teaching domain specifics. Using the Nova Forge SDK practical guide with SageMaker HyperPod and LoRA, teams can iterate fast, control cost, and avoid catastrophic forgetting.

TL;DR — What you’ll get

Practical, reproducible workflow for fine‑tuning Amazon Nova and Nova Lite 2 with the Nova Forge SDK.
Data mixing as the main engineering pattern: per‑batch blends of customer + Nova curated data to retain general knowledge.
Hands‑on safety checks: dataset sanitization rules, short validation runs to catch forgetting, MLflow tracking and multi‑axis evaluation (MMLU, IFEval, BYOD).
Start with LoRA (parameter‑efficient) and only escalate to full‑rank SFT if necessary.

Quick meta description (for SEO)

Practical walkthrough for fine‑tuning Amazon Nova using Nova Forge SDK and data mixing to avoid catastrophic forgetting while improving domain accuracy.

Glossary (quick definitions)

Nova Forge SDK: High‑level SDK (aws/nova-forge-sdk) to prepare data, launch fine‑tuning jobs, and evaluate Nova models.
Amazon Nova / Nova Lite 2: The target family of LLMs being customized.
LoRA (Low‑Rank Adaptation): A parameter‑efficient adapter technique that adds low‑rank updates to model weights — faster and cheaper than full fine‑tuning.
SFT (Supervised Fine‑Tuning): Training method that updates many or all model parameters; “full‑rank SFT” means tuning full model weights.
SageMaker HyperPod: AWS runtime for distributed GPU training (example uses ml.p5.48xlarge instances).
MMLU: Public benchmark measuring broad knowledge and reasoning; useful to detect catastrophic forgetting.
IFEval: Benchmark for instruction‑following evaluation.
BYOD: Hold‑out domain test (Bring Your Own Data) for real KPIs.

Business problem — briefly

Teams want models that excel on narrow tasks (medical Q&A, customer sentiment, fraud detection) but still answer general queries, follow instructions, and reason reliably. Naive fine‑tuning on only proprietary data often produces catastrophic forgetting: the model becomes a specialist that fails general customer support, hallucination checks, or regulatory prompts — a costly product risk and a UX disaster.

Data mixing lets you fine‑tune on domain data without giving up the model’s general capabilities.

Stepwise workflow (what to do)

Prereqs & infra checklist
- AWS account with SageMaker quotas and IAM roles for training and S3 access.
- S3 bucket for datasets and artifacts; use KMS encryption for PII/PHI.
- MLflow (SageMaker MLflow integration recommended) for experiment tracking.
- SageMaker HyperPod cluster (example config below uses 4 × ml.p5.48xlarge for train, 1 × ml.p5.48xlarge for eval).
What this means for business: Expect upfront cloud configuration and role setup, but once in place you can reproduce runs and enforce governance.
Dataset sanitization & formatting (do this first)
Domain data often includes tokens or chat markers that clash with Nova chat templates. Clean data before uploading to S3. Typical reserved tokens to neutralize: System:, Assistant:, [EOS], <image>.

Quick fixes:
- Insert a space before colons in reserved markers (e.g., change System: → System :).
- Strip or replace inline tags like <image> if visual context isn’t supported.
- Normalize quotes and whitespace; de‑duplicate near‑duplicates that leak target labels.
Before/after JSONL example (one record):
```
{"input":"System: Provide answer.","output":"Diagnosis: ..."}

{"input":"System : Provide answer.","output":"Diagnosis: ..."}
```
Why it matters: A malformed JSONL or an unchecked reserved token can break parsing at runtime and waste expensive GPU hours.
Data mixing pattern and controls
Each training batch is sampled from two pools: customer examples and Nova curated examples (reasoning, instruction prompts, factuality checks, safety/RAI). The Nova Forge SDK exposes knobs like customer_data_percent and per‑subcategory Nova weights.

Sampling pseudo‑logic (per batch):
```
// Pseudo‑steps
for each batch:
  if rand() < customer_data_percent:
    sample N examples from customer_pool
  else:
    sample N examples from nova_pool (subcategories weighted)
```
Practical starting points:
- Baseline mix: 50% customer / 50% Nova.
- Common successful pilot: 75% customer / 25% Nova preserved general skills while boosting domain F1.
- If you see forgetting, increase Nova proportion first before changing learning rates or steps.
A blended training mix recovered nearly all general knowledge while still delivering a double‑digit F1 lift on a complex classification task.

Business impact: Small changes in the mix are cheap ways to fix catastrophic forgetting compared with re‑training or expensive model replacements.
Training config — LoRA first
Start with LoRA for fast iteration and cost control. Typical defaults to try:
- Learning rate: 1e‑5
- Global batch size: 32
- Context length: max_length = 65536 (if long context needed)
- LoRA rank (r): 8–32 (start with 16)
- Alpha (LoRA scaling): 8–32
- Sanity validation: max_steps = 5 for an initial quick check
Decision guide — LoRA vs full‑rank SFT:
- Use LoRA if you need fast iteration, limited budget, or to preserve base model behavior.
- Consider full‑rank SFT if LoRA cannot reach accuracy targets after mix tuning and you have budget and strong guardrails for forgetting.
Compute example: Pilot used 4 × ml.p5.48xlarge for training and 1 × ml.p5.48xlarge for evaluation. Always run a short validation pass before committing to full training to avoid wasted GPU hours.
Evaluation — multi‑axis checks
Track both domain and general metrics. Minimum evaluation suite:
- MMLU — broad knowledge and reasoning (sensitive to catastrophic forgetting).
- IFEval — instruction‑following fidelity.
- BYOD — held‑out domain test that matches production KPIs (F1, accuracy, AUC, latency).
- Optional: LLM‑as‑judge for complex ranking tasks and human annotation for nuanced cases.
Example experimental results (reproducible pattern):
- Baseline MMLU: 0.75
- Customer‑only fine‑tune (no mixing): MMLU → 0.47 (catastrophic forgetting)
- Mixed 75% customer / 25% Nova: MMLU → 0.74, Voice‑of‑Customer F1 → +12 points
What this means: The right mix can regain almost all general capability while delivering significant domain gains — a direct win for product quality and regulatory safety.
Operational, security & governance checklist
- PII/PHI scrubbing and consent audit before any data leaves secure storage.
- Encrypt S3 artifacts with KMS and restrict SageMaker notebook/role access via IAM.
- Use VPC endpoints for SageMaker and guard training containers.
- Track runs with MLflow: tag seed, mix_ratio, lora_rank, lora_lr, steps, dataset_version, commit_hash.
- Retain logs and model artifacts for audits and incident response.
Suggested MLflow tagging schema:
```
project: nova-domain
experiment: medreason-sft
mix_ratio: 0.75
lora_rank: 16
lr: 1e-5
seed: 42
hyperpod: 4x-p5.48xlarge
```

Troubleshooting & common failure modes

Large MMLU drop after a run: Check mix_ratio first; if customer_data_percent = 1.0 you likely forgot base skills. Restart with 50%+ Nova samples and run a short validate.
Parsing or JSONL errors: Look for reserved tokens (System:, Assistant:, [EOS], <image>) and inconsistent prompt templates.
Slow convergence or overfitting: Reduce LoRA rank or steps, add Nova reasoning examples, or increase regularization.
Unexpected domain drift: Inspect curated Nova subset weights — some Nova curated data may introduce domain‑irrelevant styles; filter if needed.

Reproducibility checklist (pre-flight)

S3 paths and versions for training/dev/test splits committed.
MLflow experiment created with tags (see schema above).
Seed and tokenizer version recorded.
Short validation run configured (max_steps=5) to catch basic failures.
Access controls and KMS encryption enabled for artifacts.

FAQs — quick answers for leaders

How many customer examples do I need?

It varies by task. Low‑shot tasks can benefit from a few thousand high‑quality examples; complex classification often needs tens of thousands (MedReason: ~32.7k Q‑A pairs used in a sample). Start small with LoRA and iterate on mix and data quality.

When should I use full‑rank SFT?

Only after LoRA + mixing fails to meet KPIs and you have budget and guardrails. Full SFT costs more and increases forgetting risk, so treat it as a last‑mile optimization.

Is LLM‑as‑judge reliable?

LLM judges are useful for scaling evaluation but should be calibrated against human raters for nuanced tasks, especially in regulated domains.

Next steps & resources

Repo and samples: aws/nova-forge-sdk and amazon-nova-samples on GitHub.
MedReason dataset: Hugging Face MedReason (~32.7k Q‑A pairs).
LoRA paper and implementations: see original LoRA resources for rank and alpha tradeoffs.
SageMaker HyperPod docs: SageMaker HyperPod.
MLflow docs: MLflow.

Key takeaways

Data mixing is the pragmatic lever to specialize Amazon Nova without catastrophic forgetting: blend domain examples and curated Nova samples per batch.
Start with LoRA and conservative mix ratios (50/50 baseline), validate quickly (max_steps = 5), and track experiments in MLflow.
Always measure multi‑axis: MMLU, IFEval, and BYOD are minimal checks to ensure domain gains haven’t erased general competence.

Always measure both domain gains and general benchmarks—if you only measure one, you can’t tell whether the mixing is working.

Authors and contributors behind these patterns include AWS practitioners and the Nova Forge SDK maintainers. Use the SDK helpers (NovaModelCustomizer, JSONLDatasetLoader) and the sample notebooks as a starting point. If you want a ready‑to‑run notebook, an experimental mixing grid, or a costed pilot plan tailored to your domain, a short pilot will quickly reveal the right mix and budget tradeoffs for your business.