Unlabeled Video: How RAE + MoE Unlock Multimodal AI Agents for Business

Unlabeled video: the new data frontier for multimodal AI

  • TL;DR
  • High-quality web text is becoming a scarce training resource. Vast amounts of unlabeled video are a practical, powerful alternative for training multimodal models.
  • A single visual encoder (a representation autoencoder or RAE) can support both image generation and comprehension, simplifying architecture and engineering.
  • Mixture-of-Experts (MoE) architectures help reconcile vision’s heavy data needs with language’s different scaling behavior—making large-scale multimodal systems more tractable for businesses.

Why unlabeled video matters for multimodal AI

Text-only training is hitting limits. The Chinchilla-style insight—that at certain scales more data is worth more than more parameters—made sense for language. But text is a compressed, lossy summary of reality. Video supplies raw, continuous signals: motion, depth cues, temporal causality and object interactions that words rarely capture. That makes unlabeled video an attractive next data frontier for multimodal AI and AI agents that need to reason about the physical world.

The recent experiments from Meta FAIR and New York University trained a single multimodal model from scratch (no preloaded language priors) on three data types: plain text, image+caption pairs, and large volumes of unlabeled video (including action clips). For readers unfamiliar with some terms:

  • Representation Autoencoder (RAE) — a model that compresses and reconstructs images into compact codes.
  • Variational Autoencoder (VAE) — a common generative encoder; the paper found RAE outperformed VAE for both generation and understanding.
  • Flow-matching diffusion — a recent generative technique used to learn how to generate images/video by modeling how simple noise transforms into data; think of it as a robust way to teach a model to produce realistic visuals.
  • Mixture-of-Experts (MoE) — an architecture that routes each token to a subset of specialized submodels (experts), letting the network scale capacity efficiently.
  • Chinchilla-style scaling — the observation from language models that optimal performance balances training data and model size; it’s a useful baseline but not the whole story once vision joins the party.
  • SigLIP 2 — an image model used as a base to build the RAE in the experiments.

Key technical findings (high-level)

  • One visual encoder suffices. A RAE built on SigLIP 2 handled both image generation and image understanding tasks, eliminating the need for separate encoders for “comprehension” and “generation.”
  • Raw video doesn’t hurt language learning. Adding large amounts of unlabeled video did not degrade language capability; the multimodal model slightly exceeded a text-only baseline on validation.
  • Emergent world-modeling. From broad multimodal exposure the model learned to predict next visual states from an image plus textual actions, becoming competitive on predictive tasks with only ~1% task-specific data.
  • Data beats homogeneity. A training mix with roughly 20B VQA tokens plus ~80B tokens from video, image-text, and plain text outperformed a model trained on 100B pure VQA tokens—diversity of modalities matters.
  • Vision and language scale differently. Vision needs disproportionately more data as models grow: relative vision-data requirements rise to ~14× at 100B parameters and ~51× at 1T parameters (starting from a 1B base).
  • MoE reduces the mismatch. A 13.5B-parameter MoE model (about 1.5B active parameters per token) outperformed comparable dense models. Experts self-specialized—early layers leaned language, deeper layers became vision/multimodal specialists—and MoE roughly halved the modality-scaling asymmetry compared to dense architectures.

“Text is ultimately a lossy compression of reality—models trained only on words are learning shadows rather than the objects themselves.” (paraphrase of the paper)

Technical deep-dive (short)

Think of VAEs as an older camera: they compress images but struggle to capture fine detail for generation. RAEs paired with modern generative diffusion methods (flow-matching) act more like a high-resolution video codec: they produce richer, more useful visual codes that serve both downstream understanding and high-quality generation. MoE then functions like a smart staffing system—only the right experts are called for each token, so the model can grow capacity without wasteful full-network compute.

Why this changes the business playbook

If unlabeled video becomes a cornerstone training signal, the battleground shifts from model research to data strategy and infrastructure. That has immediate implications for companies across industries.

  • Retail and physical stores. Point-of-sale and CCTV streams can train agents for inventory tracking, loss-prevention, planogram compliance, and better in-store customer assistance—moving beyond static image analytics to predictive, situational agents.
  • Manufacturing and industrial. High-frequency camera feeds from assembly lines are ideal for anomaly detection, predictive maintenance, and training digital twins that predict failure modes from visual sequences.
  • Transportation and fleets. Dashcams and telematics create temporal visual datasets that accelerate driver-assist features and fleet safety analytics for insurance and logistics optimization.
  • Enterprise learning and sales. Demo videos, recorded sales calls with screen capture, and product tutorials can train sales assistants, coach reps, and build automated training agents that understand multi-step visual procedures.

But the payoff isn’t automatic. Large-scale video training requires investments in storage, bandwidth, labeling (where necessary), and—critically—privacy, consent, and IP governance. The technical win in the lab translates to a commercial win only if data is ethically and legally defensible.

How to start today: a practical roadmap

  1. Audit your video assets and consent scope. Map what you own (CCTV, training content, dashcam), what permissions you have, and where sensitive content exists.
  2. Prototype on a small scale. Start with 1–10 million frames (not hundreds of billions). Train a miniature multimodal model using a unified visual encoder to validate task-specific gains (e.g., anomaly detection, next-frame prediction).
  3. Compare architectures. Test a unified RAE-based encoder against your current dual-encoder stack and a dense vs. MoE configuration. Measure not only accuracy but compute cost, latency, and operational complexity.
  4. Build governance and privacy by design. Implement data minimization, opt-in/opt-out workflows, provenance tracking, and legal reviews for public vs. private content.
  5. Plan infrastructure and costs. Model your storage, transfer, and training cost—video drives up bandwidth and storage orders of magnitude more than text. Consider staged ingestion and on-device prefiltering to cut costs.

What this means for different stakeholders

  • C-suite: Treat video as a strategic asset. Build cross-functional teams (legal, ops, ML) to capture value while managing risk.
  • ML/Engineering leaders: Run small MoE pilots and benchmark unified visual encoders. Invest early in efficient storage and data pipelines for continuous ingestion.
  • Product teams: Reconsider product roadmaps for agents and automation: capabilities like next-step prediction and situational awareness become more feasible with video-augmented pretraining.

Risks, unknowns, and governance

There are several important caveats:

  • Legal and privacy risk. Training on video—especially public-facing or surveillance footage—raises consent and GDPR-style issues. Robust governance is non-negotiable.
  • Representativeness and bias. Video data can embed camera placement biases, demographic skews, and cultural blind spots that models will learn unless curated carefully.
  • Compute and energy costs. Video multiplies storage and processing needs. Even with MoE efficiency gains, organizations must budget for greater infrastructure spend.
  • Transfer to downstream products. The experiments focus on pretraining. Fine-tuning, RL, safety testing, and deployment behavior still require careful engineering and evaluation.

“World-modeling abilities arise largely from broad multimodal exposure, not from specialized navigation datasets.” (paraphrase)

Final read: priorities to act on now

  • Start with a focused pilot using ethically sourced video to validate a clear business use case.
  • Design governance and legal guardrails before scaling data collection.
  • Experiment with MoE and a single visual encoder—these give a practical path to scale capacity while keeping costs feasible.
  • Invest in storage and data pipelines early; the data plumbing is the limiting factor, not just model architecture.

For teams that can responsibly collect and manage large video corpora, unlabeled video opens a competitive route to richer multimodal AI and more capable AI agents. The technical work from Meta FAIR and NYU shows it’s both feasible and materially different from text-only scaling—now the question for businesses is who will own and govern that video data responsibly.

Further reading: the paper’s preprint and experimental details are available on arXiv. Explore prior multimodal projects like Janus and BAGEL for historical context, and consider piloting a short MoE experiment to see whether vision-heavy pretraining moves the needle on your product’s KPIs.

Want a short executive briefing or a one-page roadmap tailored to your industry? Reach out to run a data-audit and feasibility estimate.