1. 📘 Topic and Domain: The paper introduces ERNIE 5.0, a trillion-parameter autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio.
2. 💡 Previous Research and New Ideas: Building on previous large language and vision-language models like ERNIE 4.5, Gemini, and GPT, the paper proposes training all modalities from scratch under a unified next-group-of-tokens prediction objective with ultra-sparse mixture-of-experts and novel elastic training paradigm.
3. ❓ Problem: The paper aims to solve the limitation of existing multimodal models that decouple generation from understanding and rely on modality-specific components, which hinders deep cross-modal integration and forces trade-offs between multimodal capabilities and core language performance.
4. 🛠️ Methods: The authors use an ultra-sparse MoE architecture with modality-agnostic expert routing, elastic training for flexible deployment, unified multimodal reinforcement learning with techniques like unbiased replay buffer and multi-granularity importance sampling, and specialized tokenization strategies for each modality.
5. 📊 Results and Evaluation: ERNIE 5.0 achieves competitive or state-of-the-art performance across text, vision, and audio benchmarks, with elastic variants maintaining near-full performance using only 53.7% activated parameters, and demonstrates successful expert specialization patterns despite modality-agnostic routing.