The Beijing Academy of Artificial Intelligence (BAAI) has introduced Emu3, a groundbreaking multimodal model that seamlessly integrates text, image, and video processing through next-token prediction. This innovative approach marks a significant advancement in AI, moving beyond traditional language models to deliver superior performance across various multimodal tasks.
Emu3 is designed by tokenizing different data types—text, images, and videos—into a discrete space, allowing a single transformer to be trained from scratch on a diverse mix of multimodal sequences. Wang Zhongyuan, director of BAAI, highlighted in a press release that "By tokenizing images, text and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences," adding that Emu3 eliminates the need for diffusion or compositional approaches entirely.
Performance-wise, Emu3 surpasses several established task-specific models in both generation and perception tasks, demonstrating the efficacy of next-token prediction as a unified paradigm for multimodal AI. In a move to foster global collaboration, BAAI has open-sourced the core technologies and models of Emu3, inviting the international tech community to explore and build upon this advancement.
Technology experts are excited about the potential of Emu3, noting that its unified architecture opens new avenues for multimodality without the need to combine disparate models. Looking ahead, Wang envisions Emu3 playing a pivotal role in applications such as robot cognition, autonomous driving, multimodal dialogue systems, and enhanced inference capabilities.
"In the future, the multimodal world model will promote scenario applications such as robot brains, autonomous driving, multimodal dialogue and inference," Wang said.
With Emu3, BAAI is not only pushing the boundaries of what AI can achieve but also making significant strides in making advanced AI accessible to developers and researchers worldwide.
Reference(s):
Developer launches Emu3 multimodal model unifying video, image, text
cgtn.com