From late entry to regional champion: fine‑tuning LLMs and business lessons from the AWS AI League ASEAN finals
Blix D. Foryasen showed that championships aren’t won by raw compute alone. Arriving late to the AWS AI League ASEAN, he turned careful data design, surgical hyperparameter sweeps and an adapter‑based tuning strategy into a regional win at the Singapore grand finale on May 29, 2025. The story is a compact case study in fine‑tuning LLMs under real constraints — exactly the situation many enterprise teams face when building AI agents, AI automation, or bespoke models for sales and support.
Why the AWS AI League matters for AI teams
The league packed hands‑on LLM fine‑tuning into a competition format across six ASEAN countries (Singapore, Indonesia, Malaysia, Thailand, Vietnam and the Philippines). It used Amazon SageMaker JumpStart and PartyRock (on Amazon Bedrock) to expose students to foundation models, responsible AI and prompt engineering. That setup mirrors practical enterprise tradeoffs: limited compute budgets, strict submission windows, and the need to optimize for both automated and human evaluation. The lessons are directly portable to teams deciding whether to tune small, private models or keep calling large foundation models via API.
The constraints and scoring that shaped strategy
- Tools: SageMaker JumpStart (training), PartyRock/Amazon Bedrock (data generation and model access).
- Target student model: LLaMA 3.2 3B Instruct (a small, cost‑efficient base).
- Teacher/evaluator models: Claude 3.5 Sonnet for question generation, DeepSeek R1 for long answers, and other Bedrock models available during the competition.
- Compute limits: initial cap was 5 hours per participant (later raised to 30), and submissions were limited.
- Finale scoring split: 40% LLM‑as‑judge (automated evaluator), 40% expert judges, 20% live audience.
- Live finale constraints: 200‑token response cap and limited inference knobs (temperature, top‑p, context length, system prompts).
Definitions for quick reference:
- LoRA — a lightweight adapter technique that changes only small, low‑rank parts of a model instead of retraining the whole thing.
- Adapter / target_modules — the specific layers (e.g., attention projections) where LoRA applies its small parameter matrices.
- Epochs / learning rate — how many passes over the data and how big each model update is; both are the main levers for tuning under compute limits.
- Teacher‑student pipeline — generate high‑quality labels from a stronger “teacher” model, then fine‑tune a smaller “student” model with those labels.
- LLM‑as‑judge — using a large model to score or rank submissions automatically.
The teacher‑student playbook and practical tooling
Blix adopted a common, production‑grade pattern: use stronger teacher models to synthesize reasoning‑rich answers, then fine‑tune a smaller model via LoRA for deployment efficiency. For question generation he used PartyRock (Claude 3.5 Sonnet). For long, chain‑of‑thought answers he relied on DeepSeek R1, producing responses that averaged ~900 tokens — a deliberately higher‑quality target than short labels would provide.
Collecting large teacher outputs can be rate‑limited by provider APIs. Blix managed throughput by batching, using available paid tiers, and prioritizing which examples needed long, reasoning‑rich labels. That ethical approach (paying for capacity or batching work) is the same choice enterprise teams should make rather than bypassing limits.
What moved the needle: dataset design, hyperparameters, and LoRA
Dataset size alone didn’t win. Blix’s runs traversed these dataset sizes: 1,500 → 3,500 → 5,500 → 6,700 → 8,500 → 12,000 rows. Judge‑score percentages on the elimination leaderboard moved roughly like this: an early 53% (13th submission), up to 57% (16th), then down to 42% when quantity outpaced label quality, and eventually up to 65% then 77% to top the elimination leaderboard. (Those percentages are the competition judge scores used to rank submissions.)
Key turning points:
- Shift from high‑volume, shorter answers to fewer but longer, reasoning‑rich answers generated by DeepSeek R1 (~900 tokens).
- Lowering the learning rate and increasing effective training time: final successful runs used LR ≈ 0.00008 instead of earlier, higher rates.
- Expanding LoRA capacity and scope: larger lora_r and lora_alpha values and targeting more attention/FFN modules gave the 3B model more expressivity without full fine‑tuning.
- Prompt engineering for the live finale: compact, explicit formats and chain‑of‑thought hints helped models succeed under the 200‑token cap.
“Prioritizing high‑quality, well‑structured training examples beats simply adding more rows.” — Blix D. Foryasen
Advanced hyperparameter snapshot
- Epochs explored: 1–4 (practical sweet spots depended on dataset size).
- Learning rates tried: 0.0001–0.0004; initial best was 2 epochs + LR 0.0003; final winner used LR 0.00008.
- LoRA settings on the winning run: lora_r = 256, lora_alpha = 256, with expanded target_modules including q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
- Synthetic answer strategy: teacher answers averaged ~900 tokens; question generation used Claude 3.5 Sonnet.
Practical LoRA rules of thumb: lora_alpha is often set to ~1–2× lora_r, and target_modules usually focus on attention and feed‑forward layers where representational power matters most.
Live‑final tactics and prompt engineering
The 200‑token cap in Singapore forced teams to compress reasoning into compact formats. For character‑level challenges (the “Strawberry Problem” type), tactics that worked included:
- Explicit separators and spelled‑out tokens to remove ambiguity.
- Short chain‑of‑thought cues: a one‑line reasoning scaffold leading to the final answer.
- Assertive format hints: e.g., “Answer in this exact format: X;Y;Z” to reduce hallucinations.
These prompt patterns are useful for product teams shipping customer‑facing agents where predictable, auditable output formats matter as much as raw intelligence.
When LLM evaluators and humans disagree
LLM‑as‑judge tends to reward comprehensiveness, explicit structure, and coverage of edge cases. Human judges (and customers) often prefer brevity, tone and relatability. That mismatch shaped dataset design: many teams wrote answers optimized for automated scoring, which sometimes reduced human appeal.
Example (illustrative):
LLM evaluator preference: “Step 1: Identify ingredients. Step 2: Count characters A–Z. Step 3: Output counts in table form.”
Human preference: “You need the counts only — here’s a short, friendly tally.”
Recommendation: always validate with both automated evaluators and human review. Use LLM scoring for fast iteration, but keep a human‑sample audit to catch tone, bias and usability issues before deployment.
Business implications: LoRA vs full fine‑tuning vs API calls
Which approach is right depends on cost, latency, governance and scale:
- LoRA / adapter tuning — Best when you need a private, low‑latency, cost‑predictable model tuned to domain tasks. Lower training cost than full fine‑tuning and better governance control than third‑party APIs.
- Full fine‑tuning — Use when you can afford heavy compute and need deep changes to model behavior or custom safety mechanisms baked into weights.
- Large model APIs — Fastest to market and great for varied or one‑off tasks. Economically attractive until call volume and latency requirements favor a tuned student model.
Quick decision guide: if you expect high monthly call volume, strict privacy or low latency, favor adapter‑based tuning (LoRA). If you need occasional complex reasoning and don’t want operational overhead, prefer API access to a large model.
Practical playbook for teams (actionable)
- Start with a teacher‑student prototype: generate 500–2,000 high‑quality Q&A pairs from a stronger model before scaling up.
- Prioritize label quality: long, reasoning‑rich answers beat noisy short labels; sample and human‑review teacher outputs.
- Sweep small, not wide: tune epochs, learning rate and LoRA r/alpha in narrow ranges under your compute cap.
- Track experiments: log dataset version, prompts, hyperparameters and judge scores so improvements are reproducible.
- Combine evaluators: use LLM scoring for fast feedback and human audits for UX and alignment checks.
- Plan for governance: capture origin, model versions, and review samples of synthetic labels for traceability.
Responsible AI and operational checklist
- Record provenance for synthetic labels (model, prompt, timestamp, and sampling parameters).
- Sample teacher outputs for human review to catch bias and hallucinations.
- Use provider‑approved quotas or paid tiers to scale label generation ethically.
- Run bias and safety tests on both teacher and student outputs; document mitigations.
- Keep experiment logs and model cards that record intended use and limitations.
“Exchanging tactics with other finalists was one of the most valuable moves — collaboration amplified practical learning.” — Blix D. Foryasen
Blix’s path from late entry to champion is a useful microcosm for product teams: tooling matters, but discipline matters more. With SageMaker JumpStart, Amazon Bedrock/PartyRock and techniques like LoRA, small teams can build tuned, private models that outperform off‑the‑shelf calls — if they invest in dataset quality, careful hyperparameter tuning, experiment tracking and human evaluation.
Short term: run a small teacher‑student pilot and measure both automated scores and human satisfaction. Medium term: pick an adapter strategy when volume or governance favors it. Long term: build reproducible pipelines that balance synthetic label generation with human audits. That’s how you turn a constrained experiment into repeatable business value.