Xiaomi Launches MiMo-Audio: Revolutionary 7B Speech AI Model
Xiaomi has unveiled MiMo-Audio, a powerful 7-billion-parameter speech language model that many researchers are calling a “GPT-3 moment” for the audio world.
This marks the first time an AI system has shown human-like few-shot learning abilities in speech, a major breakthrough for artificial intelligence.
Learning Capabilities of MiMo-Audio
MiMo-Audio stands apart from earlier speech models because of its few-shot learning skills.
Unlike most speech AI systems that need specific training for every new task, MiMo-Audio can adapt to brand-new audio tasks with only a handful of examples or even simple instructions similar to how people learn.
Xiaomi trained the model on more than 100 million hours of audio data.
During this massive training process, the model showed “emergent behavior,” meaning it naturally developed the ability to generalize across many tasks without needing extra fine-tuning.
Advanced Technical Design
MiMo-Audio are two main components:
MiMo-Audio-Tokenizer: A 1.2-billion-parameter Transformer that breaks audio into discrete tokens at 25 Hz.
- It uses an eight-layer residual vector quantization (RVQ) stack, producing 200 tokens per second.
- This tokenizer was trained on more than 10 million hours of speech and balances semantic meaning with high-quality sound reconstruction.
MiMo-Audio-7B Model: This larger model combines a patch encoder, a core language model, and a patch decoder.
- By grouping four timesteps into one patch, it reduces the processing rate from 25 Hz to 6.25 Hz, making it faster and more efficient without losing quality.
- Unlike previous models that rely on lossy tokens, MiMo-Audio uses high-fidelity, lossless tokens.
- This preserves important sound details, allowing the model to process speech alongside text seamlessly and produce more natural results.
Benchmark Performance
On industry tests, MiMo-Audio has quickly risen to the top:
MiMo-Audio-7B-Base sets new standards on speech intelligence and audio understanding benchmarks among open-source systems.
The instruction-tuned version (MiMo-Audio-7B-Instruct) beats Google’s Gemini-2.5-Flash on the MMAU audio benchmark, and even outperforms OpenAI’s GPT-4o-Audio-Preview on the Big Bench Audio S2T test.
Perhaps more impressively, MiMo-Audio can perform tasks it was never trained for, such as:
- Voice conversion.
- Emotional voice cloning.
- Dialect mimicry.
- Speech editing and style transfer.
- Generating natural continuations for live debates, recitations, and talk shows.
Real-World Uses
MiMo-Audio can handle a wide variety of tasks, making it useful for both consumers and businesses:
Automatic Speech Recognition (ASR), Text-to-Speech (TTS), Audio captioning, Speaker identification and language detection, Music and environmental sound analysis.
The model not only understands voices but also picks up ambient noise, background music, and environmental sounds, making it highly adaptable to real-world situations.
Availability
Xiaomi has fully open-sourced MiMo-Audio under the Apache 2.0 license, giving researchers and companies the freedom to use it for commercial applications.
The release includes:
- MiMo-Audio-7B-Base (general model).
- MiMo-Audio-7B-Instruct (optimized for instructions).
- The Tokenizer model.
- MiMo-Audio-Eval, a new evaluation toolkit covering more than 10 audio tasks.
- Detailed technical documentation.
- A special feature is the “thinking mechanism” in the Instruct version. This allows the model to switch between standard mode and reasoning mode, enabling it to handle more complex problem-solving in audio tasks.
News Gist
Xiaomi has launched MiMo-Audio, a groundbreaking 7B speech AI model with human-like few-shot learning.
Open-sourced under Apache 2.0, it outperforms Google Gemini and OpenAI GPT-4o in benchmarks, redefining speech AI for smart devices, accessibility, and global research.
FAQs
Q1: What is Xiaomi MiMo-Audio?
MiMo-Audio is a 7B speech AI model that understands and generates audio with human-like few-shot learning.
Q2: When was it released?
Xiaomi released MiMo-Audio on September 19, 2025.
Q3: What makes MiMo-Audio unique?
It uses lossless tokens, enabling high-quality audio and natural adaptation to new tasks with minimal examples.
Q4: Is MiMo-Audio open source?
Yes. It’s available under the Apache 2.0 license on Hugging Face and GitHub.
Q5: What can MiMo-Audio do?
It supports speech recognition, text-to-speech, voice cloning, style transfer, speaker identification, music analysis, and more.