
At the 2025 World Computing Conference, Kunlun Yuan AI officially launched BaiZe-Omni-14b-a2b, a multimodal fusion model based on the Ascend platform, marking a new stage in multimodal AI technology. This model possesses comprehensive understanding and generation capabilities for text, audio, images, and video. Through innovative modal decoupling encoding, unified cross-modal fusion, and a dual-branch functional design, it provides robust support for complex application scenarios. Its technical architecture adopts the MoE+TransformerX framework, introducing multi-linear attention layers and a single-layer hybrid attention aggregation layer, significantly improving computational efficiency and ensuring efficient execution of large-scale multimodal tasks.
Regarding training data, BaiZe-Omni-14b-a2b relies on over 3.57 trillion tokens of text data, 300,000 hours of audio, 400 million images, and 400,000 hours of video, optimizing single-modal purity and cross-modal alignment quality through differentiated allocation. In terms of performance, the model achieves a text understanding accuracy of 89.3%, and in the 32,768-token long sequence summarization task, ROUGE-L scores 0.521, surpassing GPT-4's 0.487. Furthermore, it supports multilingual generation and cross-modal creation of images, audio, and video, comprehensively covering 10 task categories and demonstrating industry-leading generalization potential.
Kunlun Yuan AI stated that the dual-branch design of BaiZe-Omni-14b-a2b balances understanding and generation capabilities, and will drive innovation in areas such as intelligent customer service and content creation in the future. This release not only strengthens the technological competitiveness of the Ascend ecosystem but also provides a new benchmark for the large-scale deployment of multimodal AI.