How Nova Forge cuts LLM customization from 13% to ~80% — without the DevOps pain
TL;DR
- Using the Nova Forge SDK, a small experiment took an Amazon Nova model from a 13% exact‑match baseline to ~79% after supervised fine‑tuning (SFT) and ~80.6% after adding reinforcement fine‑tuning (RFT).
- Pipeline: baseline evaluation → parameter‑efficient SFT (LoRA adapters) → RFT with an AWS Lambda reward (+1/−1) and KL regularization → evaluation → deploy to SageMaker or Bedrock.
- Nova Forge SDK packages dataset validation, training orchestration (SageMaker), evaluation, logging, and deployment hooks so teams can focus on model behavior, not infra plumbing.
Why this matters for business teams
Enterprises need LLMs that follow business rules, produce tightly formatted outputs, and make reliable decisions. Building the MLOps stack for fine‑tuning and deployment is what usually slows projects down. Nova Forge SDK reduces that overhead: it wires data transforms, managed training jobs, logging, evaluation, and deployment into an opinionated workflow so smaller teams can ship a domain‑specific LLM without building the entire infra from scratch.
Dataset, model, and experimental setup
Use case: classify Stack Overflow questions into three quality labels — HQ (high quality), LQ_EDIT (low quality but fixable with edits), and LQ_CLOSE (low quality and should be closed).
Data: a sample of ~4,700 rows from a larger Kaggle Stack Overflow questions dataset (60k). Split roughly as:
- ~3,500 rows for SFT training (~75%)
- ~500 rows for evaluation (~10%)
- ~700 rows for RFT (plus the SFT samples reused during RFT; ~15%)
Model & compute highlights used in the demo:
- Foundation model family: Amazon Nova (Model.NOVA_LITE_2)
- SageMaker training instances: ml.p5.48xlarge — SFT used 4 instances; RFT used 2 instances
- RFT reward hosted as an AWS Lambda function (+1 for correct, −1 for incorrect); SageMaker granted permission to invoke the function
Baseline evaluation: start with a measurement
Always begin with a baseline. The pretrained Nova model produced a surprisingly low exact‑match (EM) accuracy of 13.0% on the strict three‑way label extraction. That looks worse than random guessing (~33%) only because exact‑match enforces tight formatting: the model often understood intent but didn’t match the exact label string the evaluator expected.
If you relax extraction rules and accept intentally equivalent outputs, accuracy climbed to 52.2%. The baseline shows the model “knows” a lot but needs instruction and formatting taught explicitly.
Example (illustrative):
Input: “How do I fix a NullPointerException when calling get() on a map in Java?”
Desired output (exact match): HQ
Model produced: “This looks like a bug — needs more info” → parsed as LQ_EDIT (intent understood but not formatted)
Business takeaway: never skip a baseline evaluation. It tells you whether you need prompt engineering only, or a true customization pipeline (SFT ± RFT).
Stage 1 — Supervised fine‑tuning (SFT) with parameter‑efficient adapters
Supervised fine‑tuning (SFT) means training the model on labeled examples so it learns the task structure and the exact output format. To keep costs and memory practical, the experiment used parameter‑efficient adapters (LoRA‑style). LoRA injects small trainable layers while keeping most of the base model frozen — a cost‑effective way to teach new behavior without retraining everything.
Result: SFT produced the biggest jump — Exact Match rose to 77.2% and classification‑extracted accuracy landed near 79%. That’s where the model learns:
- the output vocabulary (exact label tokens),
- the expected format, and
- mapping ambiguous inputs to the right label when training examples demonstrate the rule.
Business impact: a classifier with SFT‑level performance is often good enough for automated triage, routing, or first‑pass moderation, drastically reducing manual review load.
Stage 2 — Reinforcement fine‑tuning (RFT) to sharpen decisions
Reinforcement fine‑tuning (RFT) uses a reward signal to optimize for the decision you actually care about. In the demo, the reward function is binary (+1 correct, −1 incorrect) and lives in a Lambda function that scores normalized outputs. During RFT, a KL penalty keeps the policy anchored to the SFT checkpoint so the model doesn’t drift away from the correct output format while chasing reward.
SFT teaches structure and format; RFT optimizes decision calibration when you can encode success as a reward.
Result: RFT gave a modest but useful lift — Exact Match improved to 78.8% and quasi‑metrics reported near 80.6%. That extra polish can matter when you need higher precision on critical classes or when subtle policy preferences are better expressed as rewards than labels.
When to try RFT:
- If you can write a reliable reward function (low noise, clear rule).
- If gains from SFT have plateaued but business value increases with small accuracy improvements.
- When you need to encode preferences that are hard to label exhaustively.
Watch outs: reward collapse (gaming the reward), unstable training signals, and drift — mitigate these with KL regularization, reward validation, and careful monitoring.
Metrics that matter (and which to avoid)
For short, constrained outputs like label strings, choose evaluation metrics that reflect decision correctness:
- Exact Match (EM) — strict token‑level match; useful for format‑sensitive tasks.
- Quasi‑EM — relaxed EM that tolerates minor formatting differences but still requires intent match.
- F1 / precision / recall — class‑level performance, important for unbalanced classes.
- Confusion matrix — reveals systematic misclassifications (for example, editable vs closeable posts are often borderline).
Avoid relying on BLEU or ROUGE for token‑limited label classification — they’re designed for long‑form generation and can be misleading here.
Nova Forge SDK: what it automates for teams
The Nova Forge SDK wraps the repetitive plumbing so teams can iterate on datasets, rewards, and checkpoints:
- CSVDatasetLoader — validation and transformation helpers
- SMTJRuntimeManager — orchestrate SageMaker training jobs
- NovaModelCustomizer — launch training, evaluation, and deploy flows
- EvaluationTask (GEN_QA) — evaluation harness for classification‑style tasks
- CloudWatchLogMonitor and dry_run — validate infra and surface logs during runs
- Deployment adapters — target Amazon Bedrock (serverless/provisioned) or Amazon SageMaker AI Inference
Practical notes: the demo deployed to SageMaker for fine‑grained control, but Bedrock is an option when you prefer a serverless or provisioned managed inference target.
Operational considerations: cost, governance, and cleanup
- Costs: parameter‑efficient SFT reduces compute, but expect multi‑GPU hours for SFT and RFT. Use small pilots first and scale with validated gains.
- Security & IAM: reward Lambdas and SageMaker jobs need least‑privilege IAM roles. Audit which roles can invoke reward functions and access datasets.
- Monitoring: log training metrics, sample outputs, and reward distributions. Watch for signs of reward gaming.
- Cleanup: delete endpoints, training jobs, and unused IAM roles to avoid surprise charges.
Suggested starting hyperparameters (practical guide)
These are sensible starting points for similar experiments; tune them to your dataset and budget:
- LoRA rank: 8–16
- Learning rate (SFT): 1e‑4 to 3e‑4
- Batch size: as large as fits memory (gradient accumulation if needed)
- Epochs (SFT): 3–10 depending on dataset size
- RFT reward shaping: start binary (+1/−1), then consider margin or continuous scores if you need finer signal
- KL penalty: begin with a conservative weight and increase only if you see drift
Reproducibility checklist for teams
- Record dataset split seeds and exact row counts used for SFT/RFT/eval.
- Store the SFT checkpoint used as the RFT init point.
- Version control your reward function and document its logic and edge cases.
- Log sample outputs and misclassifications for human review during training.
- Keep cost and runtime logs per job to estimate TCO for scaling.
Key questions & answers
-
How much did customization improve performance?
Exact‑match rose from 13.0% (pretrained Nova baseline) to 77.2% after parameter‑efficient SFT, and to 78.8% after adding RFT (quasi metrics reported near 80.6%). -
What pipeline produced those gains?
Baseline evaluation → parameter‑efficient SFT (LoRA adapters) → RFT using an AWS Lambda reward (+1/−1) with KL regularization → evaluation → deployment (SageMaker or Bedrock). -
Is RFT always worth it?
RFT is worth it when you have a reliable reward signal and need incremental calibration beyond SFT. It’s not a substitute for high‑quality labeled data and careful SFT. -
What tooling lowered the barrier?
Nova Forge SDK components (CSVDatasetLoader, SMTJRuntimeManager, NovaModelCustomizer, evaluation harness, CloudWatch helpers) reduce the infra and orchestration burden so teams can iterate faster.
Risks and governance
Customization unlocks value but also increases responsibility: data lineage, training data privacy, role‑based access, and model drift monitoring are critical controls. Require approval for reward function changes, keep immutable logs of training runs, and set alerts for sudden shifts in evaluation metrics.
Next steps for practitioners
Clone the GitHub repo nova‑customization‑sdk (contains SDK, docs, and examples used in these experiments), run a small SFT pilot with LoRA adapters, validate baseline vs SFT gains, and only then experiment with RFT if you can define a stable reward. Keep cost and governance checks in your sprint plan.
Practical reminder from the experimenters: Nova Forge ties the plumbing together so teams can program the faucet, not the pipes.
Experiment contributors: Mahima Chaudhary, Anupam Dewan, and Swapneil Singh (Amazon Nova team).