ByteDance Launches BAGEL: A Powerful Open-Source Multimodal AI Model
ByteDance has introduced BAGEL, an open-source multimodal foundation model designed to handle and and generate text, images, and videos.
With 7 billion active parameters (14 billion total), BAGEL demonstrates advanced capabilities in image generation, editing, and complex reasoning tasks.
Key Features
Unified Multimodal Processing: BAGEL can understand and create text, images, and videos all at once.
This makes it useful for tasks like back-and-forth conversations, making pictures, and understanding video content.
Creates and Edits with High Quality: It can produce sharp, realistic images and video frames, and also handle advanced editing like changing styles or creating 3D effects.
World Modeling and Navigation: Trained on large-scale video and web data, BAGEL exhibits capabilities in multi-view synthesis and world navigation tasks, extending beyond traditional image-editing models.
Chain-of-Thought Reasoning: The model enables multi-turn multimodal dialogue and features Chain-of-Thought reasoning, allowing it to generate detailed and logically consistent outputs from short prompts.
Advanced Architecture: BAGEL uses a Mixture-of-Transformer-Experts (MoT) design with two visual encoders to capture detailed and meaningful image features.
It’s trained on large mixed data using a “next token group” prediction method, boosting its ability to understand and generate text, images, and video.
Benchmark Performance
BAGEL outperforms current top-tier open-source vision-language models like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding benchmarks.
Its text-to-image generation quality is competitive with specialized generators such as Stable Diffusion 3.
Additionally, BAGEL demonstrates superior qualitative results in classical image-editing scenarios compared to leading open-source models.
Availability
The BAGEL model is open-sourced under the Apache 2.0 license and is available on GitHub and Hugging Face.
Developers can also experiment with the model via the Replicate platform.
This model costs approximately $0.091 to run on Replicate, or 10 runs per $1, but this varies depending on your inputs.
News Gist
ByteDance has launched BAGEL, a powerful open-source multimodal AI model capable of processing and generating text, images, and videos.
With 7 billion active parameters, it excels in editing, dialogue, and reasoning tasks.
BAGEL outperforms top models like Qwen2.5-VL and is freely available on GitHub, Hugging Face, and Replicate.