PaperOrchestra: Google Cloud AI agents draft submission-ready LaTeX research papers

PaperOrchestra: How AI agents draft submission-ready research papers

TL;DR: PaperOrchestra is a multi-agent AI system from Google Cloud AI Research that converts messy idea notes and raw experiment logs into formatted LaTeX drafts. It coordinates five specialized AI agents (outline, plots, literature review, section writing, and iterative refinement), verifies citations via Semantic Scholar, and runs a simulated peer-review loop. Results on a 200-paper benchmark show citation coverage and manuscript quality much closer to human norms than prior automated systems. It’s a powerful drafting assistant—fast and efficient—but must be used with strict human verification and governance.

Quick glossary

LLM — Large language model (e.g., GPT, Gemini).
VLM — Vision-and-language model, for checking and critiquing images.
Ablation — A study that removes a component to measure its impact.
SxS — Side‑by‑side automated comparisons between two drafts.
P1 Recall — A metric for retrieving “good-to-cite” references (precision/recall style).
LaTeX / BibTeX — Standard formatting and bibliography systems used in academic submissions.
AgentReview — The simulated peer‑review environment used by PaperOrchestra’s refinement agent.

Why this matters for R&D leaders

Researchers don’t hand over perfect datasets and BibTeX files. They hand over half‑formed ideas, messy plots, and experimental logs. That’s the practical gap PaperOrchestra tackles: a modular ensemble of AI agents designed to slot into real research workflows and produce near‑submission drafts in under an hour. For R&D managers and C-suite leaders, that translates into faster drafting cycles, more consistent citation practices, and a measurable reduction in the grunt work of manuscript assembly—provided governance keeps pace.

PaperOrchestra is built to transform the messy pre-writing artifacts real researchers have into a submission-ready manuscript without needing to run experiments itself.

How the AI agents coordinate (the pipeline)

The pipeline splits the job into five focused agents that communicate and iterate:

Outline agent — Structures the paper (sections, headings, and narrative flow).
Plotting agent — Uses PaperBanana plus a VLM critic (an image-and-text model) to generate and refine publication-quality figures.
Literature Review agent — Searches the web and queries the Semantic Scholar API, applies fuzzy matching and temporal cutoffs, and enforces that most retrieved papers are actually cited.
Section Writing agent — Generates LaTeX‑formatted prose for each section from the outline and inputs.
Content Refinement agent — Runs AgentReview, a simulated peer-review loop, and accepts only those revisions that do not reduce the overall score.

Two agents run in parallel to speed throughput. The modular design is deliberate: specialization beats a single monolithic prompt when the task demands different skills (drafting, figure design, citation hunting, and critique).

Multi-agent specialization outperforms single large-prompt solutions; the pipeline’s modular agents and iterative loop achieve results a single prompt cannot.

A simple example

Scenario: a postdoc hands the system a 200‑word idea summary and two messy CSVs of results. In roughly 40 minutes and about 60–70 LLM API calls, PaperOrchestra can produce a formatted LaTeX draft with figures, ~45–48 citations, and a refined draft that has passed an internal simulated-review checkpoint—ready for the authors to verify experiments, refine claims, and submit to Overleaf.

Evidence and metrics that matter

PaperOrchestra was evaluated on PaperWritingBench, a 200‑paper benchmark composed of anonymized accepted CVPR 2025 and ICLR 2025 manuscripts. Each target paper was paired with two reconstructed inputs: a Sparse Idea Summary and a Dense Idea Summary plus experimental logs.

Citation coverage: PaperOrchestra produced ~45.7–48.0 citations per paper (AI baselines averaged ~9.8–14.2; human papers ~59). P1 Recall improved by ~12.6%–13.8% over the best baselines—meaning the system retrieves more of the genuinely relevant literature.
Manuscript quality: In automated side‑by‑side judging using Gemini‑3.1‑Pro and GPT‑5, the system’s literature reviews won by ~88%–99% over AI baselines; overall manuscript quality beat strong baselines by large margins (e.g., ~39%–86% vs AI Scientist‑v2).
Human evaluation: Eleven AI researchers judged 180 pairwise comparisons. PaperOrchestra’s literature reviews were preferred by ~50%–68% margins over baselines; overall manuscripts were preferred by ~14%–38%.
Simulated acceptance: Using ScholarPeer, PaperOrchestra achieved simulated acceptance rates of 84% (CVPR) and 81% (ICLR). For comparison, the human-authored ground truth in the simulation was 86% and 94% respectively.
Refinement is critical: Ablations show the refinement (simulated peer-review) step adds +19% (CVPR) and +22% (ICLR) to simulated acceptance probability; refined drafts beat unrefined ones in ~79%–81% of comparisons.
Input sensitivity: Dense idea summaries with logs yielded much higher overall quality (win rates ~43%–56%) than Sparse summaries (18%–24%). Literature review quality was more stable across input quality, suggesting the citation agent is robust even from thin starts.

What this does—and doesn’t—do

The system emphasizes citation quality and breadth, producing citation counts and coverage closer to human-written papers rather than the low, “must-cite” lists seen in other automated systems.

PaperOrchestra does not run experiments for you. It synthesizes and surfaces results based on logs supplied by humans. It is designed as an assistive drafting tool that automates formatting, figure polish, literature discovery, and iterative critique, while leaving experimental validation and claims verification to the researcher.

Limitations and open questions

Proxy evaluations: Automated judges and simulated acceptance are helpful but imperfect substitutes for real program committees. High simulated acceptance doesn’t guarantee acceptance at a specific conference or journal.
Citation reliability in edge cases: Fields with poor indexing, new preprints, or paywalled metadata will stress citation verification systems that rely on Semantic Scholar and web search.
Fabrication risks: Any system that drafts papers raises the risk of fabricated experimental claims; strict human-in-the-loop policies are essential.
Cost and scale: The pipeline uses ~60–70 LLM calls and runs in about 39–40 minutes per paper. Depending on model selection and API pricing, expect modest compute costs—likely in the tens of dollars per draft on common commercial pricing—but that will vary with token usage and vendor rates.
Ethical and legal questions: Authorship, IP, and how much AI assistance must be disclosed remain unsettled across conferences and journals.

Practical adoption checklist for R&D teams

Start with non‑critical drafts: pilot the system on workshop or internal papers before core conference submissions.
Require raw data and reproducible notebooks for any experimental claim that comes from an AI‑drafted section.
Enforce provenance tracking: export a manifest of all queried sources, retrieved PDFs, and citation matches (keep checksums where practical).
Integrate with existing toolchains: connect to Overleaf/Git, Semantic Scholar API keys, and internal storage for logs and datasets.
Set mandatory human checkpoints: at minimum, require an experimental lead to sign off on claims and a senior author to validate citations and figures.
Budget for compute and review: track API usage and reviewer time saved vs. time spent verifying AI drafts.
Mandate disclosure: adopt a standard sentence in the acknowledgments or methods noting the role of AI-assisted drafting.

Sample disclosure language

“Portions of the manuscript (outline, figure polishing, literature survey) were generated or assisted by an automated drafting system (PaperOrchestra). All experimental claims were verified by the authors and source data is available at [link].”

How to run a pragmatic 4‑week pilot

Week 1: Select two non-critical papers; provision Semantic Scholar API keys and set up Overleaf/Git integration.
Week 2: Run PaperOrchestra on a Dense and a Sparse input for each paper. Log API calls, runtime, and reviewer time spent validating drafts.
Week 3: Conduct internal reproducibility checks on one AI-assisted paper: re-run analysis notebooks, verify figures and metrics.
Week 4: Review outcomes: time saved, quality of literature review, mis-citations found, and whether experimental claims required major edits. Decide on broader rollout or tighter guardrails.

Governance & ethical guardrails

Require clause-level sign-off: the researcher responsible for each result must attest to its validity.
Archival of raw outputs: store the original logs, intermediate drafts, citation matches, and the final LaTeX source.
Reproducibility badge: adopt internal badges for drafts that pass automated workflow checks and separate badges for human-verified reproducibility.
Transparency to reviewers: disclose AI assistance per venue policy and include a provenance appendix with the submission when allowed.

What’s next and strategic view

PaperOrchestra signals a shift: AI agents are becoming practical automation tools for research workflows, not just clever text generators. The modular, multi-agent approach—outline, figures, literature, writing, and simulated peer critique—solves a real, recurring problem: turning messy researcher artifacts into submission-ready drafts. For leaders, the strategic play is clear: experiment, govern, and integrate.

The refinement (simulated peer-review) stage is essential — it converts functional drafts into submission-ready papers and materially improves simulated acceptance rates.

Adopt cautiously: run pilots, require human verification of experiments, and update authorship and disclosure policies. Done right, AI-assisted writing can free researchers to do more experiments, not more formatting—and shift attention back to novelty, replication, and scientific insight rather than polishing citations and LaTeX minutiae.

Actionable next steps

Authorize a 4‑week pilot with two teams and a dedicated project lead.
Create a verification template that each team must complete before submission.
Track time saved, citation accuracy, and reviewer edits across pilots to inform wider adoption.