Xiaomi MiMo V2: Full-Stack AI Agents, Multimodal Perception and Expressive TTS

Xiaomi’s MiMo V2: A full‑stack play for AI agents, multimodal perception, and expressive voice

TL;DR — Xiaomi launched three complementary models designed to work as an integrated platform for AI agents, browser automation, media workflows and future robotics control: MiMo‑V2‑Pro (a trillion‑parameter Mixture‑of‑Experts LLM), MiMo‑V2‑Omni (multimodal perception + native UI/action tooling), and MiMo‑V2‑TTS (fine‑grained expressive speech and singing). The Pro model posts competitive coding and agent scores while undercutting API pricing, Omni shows strong perception and web navigation but trails on long‑horizon agent reasoning, and TTS offers detailed, plain‑language control of emotion and paralinguistics. For businesses: run narrow, instrumented pilots (browser automation, contact‑center TTS, dashcam analytics) and prioritize safety, governance and auditability before production autonomy.

Quick primer: what the key terms mean

  • Mixture‑of‑Experts (MoE): think of a large team of specialist models where only a small subset are “called in” for each request to save compute.
  • Total vs active parameters: total parameters describe the full model size; active parameters are the subset actually used per inference (MiMo‑V2‑Pro reports ~42B active params per request).
  • Context window: how much prior content the model can keep “in mind” at once — Xiaomi advertises theoretical windows up to 1,000,000 tokens, with current pricing tiers covering up to 256,000 tokens.
  • ClawEval and SWE‑bench: public benchmarks that measure agent reasoning/workflow performance and coding ability respectively; useful for relative comparison across models.

What Xiaomi released and why it matters for AI agents

Xiaomi’s three MiMo V2 models are designed as parts of a unified stack that sees, reasons, acts and speaks. That matters because real‑world automation requires more than just language understanding — it needs perception (video/audio/image), reliable interactions with software and devices, and human‑facing outputs that feel natural. Xiaomi’s strategy is to own each of those layers and glue them into agent frameworks that can automate actual workflows.

MiMo‑V2‑Pro: MoE scale, long‑context, and cost economics

MiMo‑V2‑Pro is the flagship LLM. Key facts:

  • Architecture: Mixture‑of‑Experts with more than one trillion total parameters and roughly 42 billion active parameters used per request.
  • Context & speed: Xiaomi claims hybrid attention and multi‑token generation techniques that enable very large context windows (the vendor references up to 1,000,000 tokens theoretically) while offering a current pricing tier for contexts up to 256,000 tokens.
  • Benchmarks: placed 7th on the Artificial Analysis Intelligence Index, scored 78 on SWE‑bench Verified for coding (Anthropic’s Claude Opus 4.6 ≈ 80.8) and posted a ClawEval agent score of 81 (Claude Opus 4.6 = 81.5; GPT‑5.2 = 77).
  • Pricing: launch API pricing is $1 per million input tokens and $3 per million output tokens, with Xiaomi temporarily waiving cache‑writing fees.

Why that matters: the MoE approach gives Xiaomi big scale without forcing the full parameter set into every request — a cost/performance tradeoff that is attractive for high‑volume agent workloads (automated coding, batch analysis, long‑context document reasoning). The benchmark results place Pro in the same neighborhood as top Western LLMs for coding and agent performance, but it’s the price point that threatens to accelerate adoption among cost‑sensitive teams.

Cost example (practical)

Scenario: a browser‑agent call with a 1,000‑token input and a 5,000‑token output.

  • Input cost: 1,000 tokens = 0.001 million → $0.001
  • Output cost: 5,000 tokens = 0.005 million → $0.015
  • Total per call ≈ $0.016

At scale, 100,000 such calls would cost roughly $1,600 — a useful baseline when sizing automation pilots. Real workloads will vary by token counts and frequency, but this highlights how lower token pricing changes the ROI on agent automation experiments.

MiMo‑V2‑Omni: perception plus action

Omni is the multimodal arm: a shared backbone for images, video and audio that also supports native tool calls and UI actions. Xiaomi demoed several practical use cases:

  • Real‑time dashcam hazard detection (continuous audio/video ingestion).
  • End‑to‑end browser shopping agent that researches, compares prices, negotiates via chat and completes checkout.
  • Media pipelines that generate, debug and publish content using agent frameworks for clicks and file operations.

Benchmarks and limitations:

  • Xiaomi reports Omni outperforms Gemini 3 Pro on some audio metrics and edges out Anthropic on select image measures; it performs well on web navigation tasks (MM‑BrowserComp).
  • Omni’s ClawEval agent score is 54.8, noticeably below Claude Opus’s 66.3 and GPT‑5.2’s 59.6 — an indicator that perception plus action is not yet equivalent to robust, long‑horizon agent reasoning.

Bottom line: Omni is strong for perception, UI automation and constrained navigation tasks, but it needs further work on planning, safety filtering and durable multi‑step execution before replacing human operators in messy web or physical environments.

MiMo‑V2‑TTS: expressive, plain‑language voice control

TTS gets its own engine. Highlights:

  • Training data: over 100 million hours of speech.
  • Design: layered modeling of timbre, rhythm and emotion to provide fine control.
  • Control interface: users describe voice and feeling in plain language (e.g., “sleepy, slightly hoarse” or “restrained anger”) and the model renders nuanced prosody and paralinguistic sounds (coughs, sighs, hesitations, laughter). It also supports singing through the same API.

Business opportunities include more natural IVR and conversational agents, branded voice experiences for media and gaming, and faster production of audiobooks or audio ads where emotion matters. Legal and ethical checks are crucial when cloning or simulating voices.

“The route to general intelligence is found in the real world — a model that only reads text lives in a library, while one that sees, hears, reasons and acts lives in the world.”

— MiMo team (paraphrased)

Access, the leak, and the developer ecosystem

Xiaomi opened public API access at launch, integrated with agent frameworks such as OpenClaw, OpenCode, KiloCode, Blackbox and Cline, and offered a week of free developer access. There was also a pre‑launch leak: MiMo‑V2‑Pro appeared on OpenRouter under the name “Hunter Alpha,” topping daily usage charts and accumulating over one trillion tokens, with coding being the most frequent use case. That episode underscores high community demand — and the reputational risk of unmanaged pre‑release distribution.

Enterprise playbook: how to pilot MiMo V2 safely and practically

Recommended first pilots (narrow, measurable, recoverable):

  • Browser automation for research and price comparison (use Omni + Pro): measure time‑to‑decision and conversion lift.
  • Contact center or IVR voices (TTS): run A/B tests on CSAT and handle rates with human escalation for edge cases.
  • Media production pipeline: automated draft creation + human editing + TTS narration for faster content cycles.
  • Dashcam or fleet analytics (Omni): start with offline batch detection before any live‑action automation.

Pilot checklist (concrete)

  • Define success metrics up front (time saved, conversion lift, cost per interaction).
  • Limit scope: one workflow, one channel, explicit boundaries for UI actions or purchases.
  • Implement human‑in‑the‑loop gates for transactions; require confirmation for purchases or payments.
  • Run adversarial UI tests and recovery scenarios to exercise flaky web states and redirects.
  • Enable full action logging, audit trails and replayable traces for debugging and compliance.
  • Verify data residency and privacy for voice and log data; involve legal before voice cloning or payment automation.
  • Measure latency and cost for expected volumes; include token budget controls.

When not to use MiMo V2

These models are not yet a good fit for fully autonomous, unsupervised decision‑making in high‑stakes contexts (unsupervised payments, safety‑critical robotics, medical diagnosis without clinician oversight). Avoid granting broad web control or device actuation until you’ve proven robust planning, rollback mechanics and legal cover.

Competitive landscape and what’s next

China’s AI ecosystem is moving quickly. Notable peers include Zhipu AI (GLM‑5), Moonshot AI (Kimi K2.5 for agent swarms), and Alibaba’s Qwen 3.5 family; Western competitors are represented by Anthropic (Claude Opus) and OpenAI (GPT‑5.2). Xiaomi’s differentiator is an integrated multimodal + TTS stack plus aggressive token pricing — a combination that will likely accelerate experimentation and force faster iterations on safety and reliability.

Key business questions and short answers

  • What did Xiaomi actually release?

    Three models: MiMo‑V2‑Pro (MoE LLM for coding and agents), MiMo‑V2‑Omni (multimodal perception + action), and MiMo‑V2‑TTS (expressive speech and singing).

  • How does MiMo‑V2‑Pro perform and cost?

    It’s a trillion+ parameter MoE with ≈42B active params/request, competitive on SWE‑bench (78) and ClawEval (81), and priced at $1/million input tokens and $3/million output tokens for the 256k context tier (launch waiver on cache fees).

  • Can Omni already replace human‑driven web tasks or robots?

    Not completely—Omni is strong for perception and constrained web navigation, but it trails on long‑horizon agent reasoning (ClawEval ~54.8). Use it for focused pilots (dashcam analysis, guided browser agents) rather than open‑ended autonomy.

  • What makes the TTS model interesting for businesses?

    Plain‑language, fine‑grained control of emotion and paralinguistic elements, plus singing capability — valuable for IVR, branded voice, media and accessibility use cases.

Recommendation for leaders

Run focused pilots that are easy to measure and easy to roll back. Prioritize pilots that deliver clear ROI (time saved, conversion lift) and keep humans in the loop for any transaction or safety‑critical action. Treat the launch as an opportunity: lower token prices and integrated multimodal tooling make experiments cheaper and faster, but the risk surface grows as models act on external systems. Invest early in logging, adversarial testing, legal review and governance so you capture efficiency without getting surprised by edge‑case failures.

Xiaomi’s MiMo V2 is a clear signal that the next wave of competition will be full‑stack: models that see, hear, decide and speak. For businesses, the practical path is simple: experiment—safely—measure, and build governance before scaling agent autonomy into production.