The art and science of hyperparameter optimization on Amazon Nova Forge
TL;DR: Use Nova Forge defaults, iterate fast with LoRA for supervised fine‑tuning (SFT), and only escalate to Full Rank when validated metric gains justify the cost. Prioritize curated data and reward engineering—those move the needle more than clever hyperparameter hacking. For most production use cases follow SFT → RFT, use data mixing to prevent forgetting, and instrument reward and general‑capability regressions from day one.
Why customization matters for AI agents and AI for business
Generic foundation models are great starting points, but business tasks require domain knowledge, tone, safety constraints, and measurable objectives. Amazon Nova Forge provides a managed pipeline to customize frontier language models using three stages: CPT (continued pre‑training), SFT (supervised fine‑tuning), and RFT (reinforcement fine‑tuning). Each stage serves a purpose: CPT shifts base priors using unlabeled text, SFT teaches instruction following with high‑quality examples, and RFT pushes model behavior toward your reward metrics.
Think of model customization like building a precision instrument in a workshop: the blueprint is your data and reward design; the raw material is the checkpoint you start from; and the tools and settings are the hyperparameters you tune. Get the blueprint and material right first—those are the high‑leverage levers.
“Data and reward quality exceed any hyperparameter in importance.”
Pipeline overview: when to use CPT, SFT, and RFT
- CPT (continued pre‑training) — use when you have massive unlabeled corpora (millions to billions of tokens) or need to shift foundational priors before supervised work.
- SFT (supervised fine‑tuning) — the highest‑leverage step for most domain tasks. Curate 1k–10k high‑quality demonstrations to teach instruction following and reasoning patterns.
- RFT (reinforcement fine‑tuning) — refine behavior using a reward function. Effective only if your baseline model already achieves a meaningful positive‑reward rate (roughly >5%).
“RFT refines existing capability but cannot recover from a low baseline.”
Core failure modes and how they show up in business systems
The three recurring technical challenges are:
- Catastrophic forgetting — overly aggressive CPT/SFT or poor data mixing can erode general capabilities. Business risk: your domain assistant starts hallucinating or loses general reasoning, undermining trust and compliance.
- Learning‑rate instability — choosing the wrong learning rate can make training diverge or forget. Business risk: wasted compute and degraded model behavior.
- RFT baseline constraints — RFT can only polish what already exists. Business risk: investing in reward tuning that yields no improvement because the model never earns positive rewards.
Practical defaults and a reproducible validation recipe
Start with service defaults; they exist because they work across many domains. The following table summarizes recommended starting points and when to use them.
| Context | Learning rate (LR) | LoRA settings (rank / alpha) | Warmup | Tokens / step | When to use |
|---|---|---|---|---|---|
| CPT (continued pre‑training) | up to ~1e‑4 (large token budgets); ramp down toward ~1e‑6 before SFT | N/A | ~15% of steps | 2–20M tokens per step (use 20M for very large budgets) | Large unlabeled corpora; foundational shift |
| SFT — LoRA (fast iteration) | ~1e‑5 | rank 32–64, alpha 64 | ~15% | Depends on batch; small‑to‑medium datasets | Validate quickly, low compute, on‑demand inference |
| SFT — Full Rank (production) | ~5e‑6 | Full parameter update | ~15% | Depends on batch; larger compute | When LoRA gains justify full‑parameter cost |
| RFT (reinforcement fine‑tuning) | Follow SFT schedule; monitor KL term | Depends on base (typically after LoRA/Full Rank) | ~15% | N/A | When baseline positive‑reward rate ≳5% |
How to run your first validation experiment (minimal reproducible recipe)
- Dataset: 2k labeled SFT examples (balanced across target intents, include a “reasoning/instruction‑following” slice).
- LoRA settings: rank = 32, alpha = 64, LR = 1e‑5, warmup = 15% steps, epochs = 3.
- Eval: use task accuracy/F1 and measure your business reward to compute a positive‑reward rate.
- Decision rule: if relative lift > ~10% on your primary metric and positive‑reward rate is above your safety threshold, consider a Full Rank run or RFT.
- Compute expectations: LoRA runs can often be validated on a single GPU (or small cluster) and are inexpensive; Full Rank commonly requires multi‑GPU training and is typically several‑times more costly.
LoRA vs Full Rank: choose the right path for AI automation
LoRA (Low‑Rank Adaptation) is your fast feedback loop. It adjusts adapter weights instead of the entire model, so iterations are cheap and inference can be served on‑demand with a small adapter. Default alpha = 64 and ranks 32–64 work broadly. Full Rank updates all parameters and unlocks higher capacity for difficult distribution shifts, but it increases compute, cost, and deployment complexity—Full Rank runs are often 3–5× the compute cost of LoRA and may require provisioned throughput for consistent latency.
Rule of thumb: validate with LoRA, escalate to Full Rank when the measured improvement justifies operational cost and latency tradeoffs.
Data mixing and checkpoint selection: the highest‑impact levers
“Checkpoint selection is the most impactful decision you will make for continued pre‑training.”
Checkpoint choice matters more than most hyperparameters. Pick a checkpoint whose prior aligns with your domain style (instruction‑tuned vs base, multimodal vs text). Data mixing—blending domain data with general reasoning/instruction examples—is essential. For many tasks a 50/50 split between domain-specific and general/instruction data prevents forgetting while teaching domain behavior. Always include a “reasoning/instruction‑following” category in SFT mixes to preserve instruction quality.
“Data mixing should be treated as essential for production workloads, not optional.”
Reward engineering and RFT: do it only once the baseline is solid
RFT optimizes behavior against a reward function, but a weak reward or a low baseline makes RFT brittle. Before RFT:
- Confirm baseline positive‑reward rate (aim for >5%).
- Design rewards that balance correctness, concision, and safety—avoid overly sparse or easily gamed signals.
- Track KL divergence during RFT to limit policy drift away from the reference model.
Key RFT knobs include number_generation (how many samples to evaluate per prompt), KL‑loss coefficient (penalizes policy deviation), reasoning_effort (controls sampling depth), and concurrency limits when invoking external reward evaluators (e.g., AWS Lambda). Instrument reward value distributions and surface cases where rewards spike for undesirable behavior—reward hacking is a real operational hazard.
Representative experiment highlights (context and caveats)
Representative internal experiments illustrate typical gains when disciplined workflows are followed. For context:
- MedReason — a medical reasoning benchmark. A base model at ~57.4% rose to ~63.5% after SFT using LoRA (LR 1e‑5, rank 32), a ~10.8% relative lift in the target metric. This shows how focused SFT can improve clinical reasoning without wholesale CPT.
- LLaVA‑CoT multimodal task — a chain‑of‑thought visual reasoning benchmark. SFT with LoRA (rank 64, alpha 64) lifted a base 16.2% to ~68.5% on the task—an extreme case where high‑quality demonstrations and multimodal alignment unlock large gains.
- Applied example (representative internal test) — an AWS China applied‑science team saw ~17% F1 uplift on a Voice‑of‑Customer classification while maintaining near‑baseline MMLU scores, illustrating that careful mixing and selection can improve domain metrics without destroying general knowledge.
These numbers are representative internal experiments and depend heavily on dataset sizes, label quality, and compute setup. Use them as guidance, not guarantees.
Operational considerations: deployment, cost, and monitoring
Plan for three operational axes: latency, cost, and observability.
- Latency: LoRA adapters support on‑demand inference patterns; Full Rank models may require provisioned throughput solutions for consistent latency (e.g., Bedrock Provisioned Throughput on AWS).
- Cost: Expect LoRA validation runs to be inexpensive and fast. Full Rank is costlier—budget for higher training and possibly inference infrastructure.
- Observability: Monitor reward distribution, positive‑reward rate, KL divergence, and a small suite of critical QA prompts (e.g., MMLU slices, safety prompts) to detect regressions. Automate alerts when KL or negative drift exceeds thresholds.
Maintain rollback plans: keep original checkpoints and an evaluation suite so you can quickly revert tuning runs that introduce harmful behavior.
Troubleshooting decision tree (quick)
- Symptom: Low positive‑reward rate during RFT → Action: Run SFT first with curated examples; improve reward sparsity and signal quality.
- Symptom: General capability regressions after tuning → Action: Increase general data in your mixing ratio, reduce LR, or reduce CPT epochs; revert to checkpoint and re‑run with a 50/50 mix.
- Symptom: Training instability or divergence → Action: Lower learning rate, increase warmup to ~15%, and validate on smaller runs before full‑scale training.
- Symptom: Sudden reward spikes with poor outputs → Action: Inspect reward function for loopholes; add adversarial examples or constrain reward features.
Quick checklist for your first Nova Forge customization
- Choose a checkpoint aligned with your domain (instruction‑tuned vs base).
- Curate 1k–10k high‑quality SFT examples; include reasoning/instruction examples.
- Validate with LoRA (rank 32–64, alpha 64, LR ≈ 1e‑5, warmup ≈ 15%).
- Use data mixing ≈50% domain + 50% general/instruction to avoid forgetting.
- Measure positive‑reward rate, KL divergence, and regression tests (MMLU or domain control prompts).
- If lift is validated and cost is justified, run Full Rank or begin RFT with guarded KL coefficients.
- Instrument production with alerts for drift, and retain rollback checkpoints.
Common questions
When should I use RFT?
Only after your baseline model shows a meaningful positive‑reward rate (roughly above 5%). If the model rarely earns a positive reward out of the box, invest in SFT to raise the baseline first.
LoRA or Full Rank—what should I pick first?
Start with LoRA for fast, low‑cost validation. Move to Full Rank when validated gains require the full parameter capacity and you can accept higher compute and deployment costs.
How aggressive should my learning rate be?
Use service defaults as anchors: LoRA SFT ≈ 1e‑5, Full Rank SFT ≈ 5e‑6. CPT can be as high as ~1e‑4 when you have very large token budgets, but ramp down toward ~1e‑6 before SFT.
How do I avoid catastrophic forgetting?
Mix domain data with general instruction/reasoning data (around 50% domain is common), include reasoning/instruction examples, and pick a checkpoint aligned to your domain. Treat data mixing as essential for production.
What’s the single biggest lever?
Checkpoint selection and the quality of your training data and rewards—these outweigh most hyperparameter tweaking.
Two‑week pilot plan for executives and ML teams
- Week 0 (planning): Pick checkpoint, curate 1–2k SFT examples, sketch a reward function, and define evaluation metrics and safety checks.
- Week 1 (validate with LoRA): Run LoRA SFT experiments using service defaults, measure lift on primary business metrics and positive‑reward rate, and monitor for regressions.
- Week 2 (escalate or iterate): If lift is promising, run a small Full Rank job or begin cautious RFT with KL constraints; if not, iterate on data quality and reward design.
Limitations, risks, and guardrails
Customization brings risks: data privacy and PII leakage, model drift over time, reward hacking, and potential regulatory compliance issues in sensitive domains (health, finance, legal). Mitigate these with careful data governance, human‑in‑the‑loop checks on reward edge cases, continuous monitoring for regressions, and a robust rollback plan.
Final note for leaders
Nova Forge packages best practices—CPT, SFT, RFT—with managed tooling to reduce operational overhead. The fastest path to value is pragmatic: invest first in high‑quality examples and reward design, validate quickly with LoRA, use data mixing to protect general capabilities, and only pay for Full Rank when metrics and ROI are clear. With that discipline, teams can convert a general Nova model into a reliable AI agent that delivers business outcomes without sacrificing the model’s broader utility.
Authors and contributors to these recommendations include AWS solutions architects and customization specialists who tested benchmarks like CoCoHD, MedReason, and LLaVA‑CoT and collaborated with customization science teams—practical lessons grounded in applied experimentation.