AI News: Microsoft Introduces Magma, a Multimodal AI That Takes Action
Microsoft has unveiled Magma, a new AI model that integrates visual and language processing to control software interfaces and robotic systems.
Unlike traditional AI models, Magma processes different types of data all at once.
Key Points
Microsoft states that Magma can formulate plans and execute actions to achieve it.
Microsoft claims Magma is the first AI model that not only processes multimodal data (text, images, and video) but also takes actions, such as navigating user interfaces or manipulating physical objects.
According to Microsoft Magma-8B performs competitively across benchmarks, showing strong results like it scored 80.0 on the VQAv2 benchmark and leads with an 87.4 POPE score. It also outperforms OpenVLA in multiple robot manipulation tasks.
Microsoft Magma researcher Jianwei Yang explained that “Magma” stands for “Multimodal Agentic Model at Microsoft Research.”
Next week, Microsoft will release Magma’s training and inference code on GitHub, enabling external researchers to build upon it.
Magma vs. Traditional AI: How Microsoft’s Model Stands Out
Microsoft’s Magma is a unique AI model that integrates perception and control into a single foundation model, unlike previous multimodal systems like Google’s PALM-E and RT-2 or Microsoft’s ChatGPT for Robotics.
Built on Transformer-based LLM technology, Magma extends beyond “verbal intelligence” to include “spatial intelligence” for planning and action execution.
It trains on images, videos, robotics data, and UI interactions, making it a true multimodal agent.
Key features include Set-of-Mark, for identifying interactive objects, and Trace-of-Mark, for learning movement patterns.
News Gist
Microsoft introduced Magma, an AI model that integrates visual and language processing to control software and robotics.
It formulates plans, executes actions, and processes multimodal data simultaneously. Magma-8B excels in benchmarks, outperforming OpenVLA in robotics.
Microsoft will release its training and inference code on GitHub next week for researchers.