Train AI agents to search, not transcribe — MMProLong’s QA approach for long-document retrieval

Teach models to search, not transcribe: how MMProLong tames very long documents

Estimated reading time: 6 minutes

TL;DR

  • MMProLong (ByteDance Seed + HKUST) trains multimodal, long‑context models with synthetic question–answer (QA) pairs instead of page‑by‑page transcription, dramatically improving the model’s ability to locate relevant passages.
  • QA‑style supervision teaches retrieval/navigation inside long documents—where most failures happen—while OCR/transcription objectives can actually harm long‑context performance.
  • With a modest training budget (~128k tokens — roughly tens to a few hundred pages or ~10 hours of spoken transcript), MMProLong stayed stable at 256k–512k token inputs and beat much larger open models on the Needle‑in‑a‑Haystack benchmark.
  • For product teams building enterprise search, legal automation, or video indexing, investing in QA‑style labels and a mixed-length training set is often a higher‑ROI lever than bigger models or complex token‑compression architectures.

Why long documents still break AI agents

Enterprise use cases—contracts, compliance reports, training libraries, long meeting transcripts, and videos—demand that AI agents find a single relevant passage inside hundreds or thousands of pages. That’s a different skill than reading or reasoning about a short snippet. Most visual‑language models (multimodal models that understand both images and text) struggle not because they can’t reason, but because they can’t reliably find the right place to reason about.

Definitions for readers who want quick orientation: LMM = long‑context multimodal model (handles text + images across long contexts). QA = question–answer training, where the model is trained to locate and extract answers from a document. OCR = optical character recognition or transcription, a character‑level reading objective. RAG = retrieval‑augmented generation, where a model uses an external search or vector store to pull context at inference.

The core insight: train for search, not full transcription

ByteDance Seed and HKUST built MMProLong on Alibaba’s Qwen2.5‑VL backbone and used Seed 2.0 to auto‑generate QA pairs across documents. Instead of forcing the model to transcribe every page (an OCR objective), they trained it to answer targeted questions that require locating the correct passage. The difference is like teaching a lawyer to cite the clause you need, rather than training them to read every page of a 100‑page contract.

“Transcribing every page doesn’t teach a model where to look in a 100‑page document—asking targeted questions does.”

The result: MMProLong averaged a +29.4 point lift on the Needle‑in‑a‑Haystack long‑context benchmark compared to the Qwen2.5‑VL‑7B base. It also remained stable when fed input sizes of 256k and 512k tokens—regimes where the base model’s performance dropped off. Those gains transferred to other models (e.g., Qwen3‑VL‑8B) and to use cases like long‑video understanding.

Why QA supervision works

  • It teaches retrieval behavior: QA tasks force the model to locate the relevant span inside long context windows, building the search/navigation skill that’s the real bottleneck.
  • Signal is focused: Instead of punishing every character‑level mistake across dozens of pages, QA rewards finding the precise passage that answers a question.
  • Data diversity matters: Training on a mix of short and long examples produces more robust long‑context behavior than training only on maximal‑length documents.

“A model’s failure point is locating the relevant passage, not reasoning about it once found.”

What product teams should do today

For any team building AI for document understanding—contract analysis, compliance monitoring, enterprise search, or long‑form video indexing—this research points to a practical roadmap: prioritize label and task design over immediate architectural rewrites. Smaller models with the right supervision can often outperform much larger ones on long‑document tasks, lowering inference cost and speeding productization.

How to implement: a 7‑step checklist

  1. Build or fine‑tune a synthetic QA generator — start with a small LLM to produce context‑aware QAs for your document corpus (Seed 2.0 played that role in the study).
  2. Create a balanced corpus — include short, medium, and long documents; avoid feeding only maximal‑length examples. A 30–70% mix of short examples helps generalization.
  3. Weight extraction tasks higher — prioritize span‑extraction/QA over calculation or generative objectives during fine‑tuning.
  4. Validate labels with humans — sample 1–5% of synthetic QAs for human review to check domain fit and reduce hallucination risk.
  5. Run ablations early — compare QA‑only vs. OCR‑only vs. mixed objectives at several context sizes (64k/128k/256k tokens).
  6. Integrate light RAG — for production, combine the model’s long‑context skill with vector stores or index sharding to handle extreme scale or low‑latency requirements.
  7. Monitor retrieval KPIs — track span‑extraction F1, retrieval precision@k, end‑to‑end task success, latency, and cost per inference.

Two prompt templates for synthetic QA generation (examples)

  • Extraction template: “Given the following document excerpt, write a question whose answer is a single contiguous sentence in the text, and provide that exact sentence as the answer.”
  • Locational template: “Create a question whose answer refers to a specific clause or figure in the document. Indicate the minimal passage needed to answer and return the passage as the answer.”

Short case studies — how QA supervision changes outcomes

Contract review (legal automation)

Before: an enterprise search agent transcribed entire contracts and tried to reason across them, returning low‑precision results and requiring heavy human triage. After: a QA‑trained model quickly locates the exact clause that answers a compliance question, reducing attorney review time and lowering false positives. The business win: faster turnaround and lower billable hours without upgrading the model size.

Long‑form video indexing

Before: indexing full video transcripts produced noisy, slow search. After: synthetic QA on scene transcripts teaches the model to jump to exact timestamps and frames that answer queries, improving retrieval precision and enabling better downstream summarization and clip extraction.

When QA supervision isn’t enough (limitations and governance)

  • Domain mismatch: Synthetic QA quality matters. For specialized or adversarial corpora (legal, medical, regulated data), automatic labels need careful domain tuning and human validation to avoid noisy or misleading supervision.
  • PII and privacy: Generating QA pairs from sensitive documents can expose private data in training artifacts—apply redaction and strict data governance before generating synthetic labels.
  • Extreme scale and cost tradeoffs: For >1M token windows or when raw token bandwidth is the bottleneck, architecture‑level compression (visual‑token reordering/compression) may still be necessary alongside data‑centric methods.
  • Label noise and hallucinations: Monitor for hallucinated answers produced by synthetic QAs; maintain human‑in‑the‑loop validation for high‑risk outputs.

Suggested KPIs for a production rollout

  • Span‑extraction F1 on held‑out long documents
  • Retrieval precision@k for needle‑finding queries
  • End‑to‑end task success rate (user task completion)
  • Latency and cost per query at target context sizes
  • Human correction rate (for high‑risk docs)

Key takeaways

  • Teaching models to search (QA/extraction) builds the navigation skills needed for long‑context multimodal tasks; transcription alone doesn’t.
  • With a focused, data‑centric recipe and a balanced corpus, even smaller models can remain robust at very large input sizes (256k–512k tokens) and outperform bigger models on long‑document benchmarks.
  • Start with synthetic QA generation, validate with humans for domain fit, and integrate RAG for production scale. Prioritize these steps before investing heavily in model size or complex token compression.

Next step

Pick one high‑value document workflow—an outstanding contract backlog, a compliance audit set, or a library of long training videos—and run a two‑week proof of value: generate synthetic QAs, fine‑tune a small LMM using the checklist above, and compare retrieval precision and task success against your current pipeline. Often the fastest path to reliable long‑document AI is smarter supervision, not a bigger model.

“With a focused training recipe, even smaller models can stay robust at very large input sizes—up to half a million tokens.”

Further reading: Look for the MMProLong work from ByteDance Seed and HKUST, materials on Qwen visual‑language models, and benchmarks like Needle‑in‑a‑Haystack to compare long‑context performance.