AI Content Creation News Generative AI News

Xiaomi Launches MiMo-Audio: Revolutionary 7B Speech AI Model

September 20, 2025 Ai Binger News Desk

Xiaomi has unveiled MiMo-Audio, a powerful 7-billion-parameter speech language model that many researchers are calling a “GPT-3 moment” for the audio world.

This marks the first time an AI system has shown human-like few-shot learning abilities in speech, a major breakthrough for artificial intelligence.

Learning Capabilities of MiMo-Audio

MiMo-Audio stands apart from earlier speech models because of its few-shot learning skills.

Unlike most speech AI systems that need specific training for every new task, MiMo-Audio can adapt to brand-new audio tasks with only a handful of examples or even simple instructions similar to how people learn.

Xiaomi trained the model on more than 100 million hours of audio data.

During this massive training process, the model showed “emergent behavior,” meaning it naturally developed the ability to generalize across many tasks without needing extra fine-tuning.

Advanced Technical Design

MiMo-Audio are two main components:

MiMo-Audio-Tokenizer: A 1.2-billion-parameter Transformer that breaks audio into discrete tokens at 25 Hz.

It uses an eight-layer residual vector quantization (RVQ) stack, producing 200 tokens per second.
This tokenizer was trained on more than 10 million hours of speech and balances semantic meaning with high-quality sound reconstruction.

MiMo-Audio-7B Model: This larger model combines a patch encoder, a core language model, and a patch decoder.

By grouping four timesteps into one patch, it reduces the processing rate from 25 Hz to 6.25 Hz, making it faster and more efficient without losing quality.
Unlike previous models that rely on lossy tokens, MiMo-Audio uses high-fidelity, lossless tokens.
This preserves important sound details, allowing the model to process speech alongside text seamlessly and produce more natural results.

Benchmark Performance

On industry tests, MiMo-Audio has quickly risen to the top:

MiMo-Audio-7B-Base sets new standards on speech intelligence and audio understanding benchmarks among open-source systems.

The instruction-tuned version (MiMo-Audio-7B-Instruct) beats Google’s Gemini-2.5-Flash on the MMAU audio benchmark, and even outperforms OpenAI’s GPT-4o-Audio-Preview on the Big Bench Audio S2T test.

Perhaps more impressively, MiMo-Audio can perform tasks it was never trained for, such as:

Voice conversion.
Emotional voice cloning.
Dialect mimicry.
Speech editing and style transfer.
Generating natural continuations for live debates, recitations, and talk shows.

Real-World Uses

MiMo-Audio can handle a wide variety of tasks, making it useful for both consumers and businesses:

Automatic Speech Recognition (ASR), Text-to-Speech (TTS), Audio captioning, Speaker identification and language detection, Music and environmental sound analysis.

The model not only understands voices but also picks up ambient noise, background music, and environmental sounds, making it highly adaptable to real-world situations.

Availability

Xiaomi has fully open-sourced MiMo-Audio under the Apache 2.0 license, giving researchers and companies the freedom to use it for commercial applications.

The release includes:

MiMo-Audio-7B-Base (general model).
MiMo-Audio-7B-Instruct (optimized for instructions).
The Tokenizer model.
MiMo-Audio-Eval, a new evaluation toolkit covering more than 10 audio tasks.
Detailed technical documentation.
A special feature is the “thinking mechanism” in the Instruct version. This allows the model to switch between standard mode and reasoning mode, enabling it to handle more complex problem-solving in audio tasks.

News Gist

Xiaomi has launched MiMo-Audio, a groundbreaking 7B speech AI model with human-like few-shot learning.

Open-sourced under Apache 2.0, it outperforms Google Gemini and OpenAI GPT-4o in benchmarks, redefining speech AI for smart devices, accessibility, and global research.

FAQs

Q1: What is Xiaomi MiMo-Audio?

MiMo-Audio is a 7B speech AI model that understands and generates audio with human-like few-shot learning.

Q2: When was it released?

Xiaomi released MiMo-Audio on September 19, 2025.

Q3: What makes MiMo-Audio unique?

It uses lossless tokens, enabling high-quality audio and natural adaptation to new tasks with minimal examples.

Q4: Is MiMo-Audio open source?

Yes. It’s available under the Apache 2.0 license on Hugging Face and GitHub.

Q5: What can MiMo-Audio do?

It supports speech recognition, text-to-speech, voice cloning, style transfer, speaker identification, music analysis, and more.

Cookie	Domain	Description	Duration	Type
_ga_*	.aibinger.com	Google Analytics sets this cookie to store and count page views.	1 year 1 month 4 days	Analytics
_ga	.aibinger.com	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.	1 year 1 month 4 days	Analytics

AI Binger

Xiaomi Launches MiMo-Audio: Revolutionary 7B Speech AI Model

Learning Capabilities of MiMo-Audio

Advanced Technical Design

Benchmark Performance

Real-World Uses

Availability

News Gist

FAQs

Figure AI Introduces Figure 03: New Humanoid Robot

Google DeepMind Unveils Gemini Robotics-ER 1.5: Smarter Robots

Boston Dynamics Atlas Robot Receives Major AI Brain Upgrade

Fourier Unveils GR-3 “Care-bot”: A Friendly Humanoid with Heart

Unitree Unveils A2 “Stellar Hunter” Robot Dog

Zhipu Launches RoboOS 2.0 and RoboBrain 2.0 to Power Smarter Robots

Leave a Reply Cancel reply