1. 📘 Topic and Domain: The paper addresses native multimodal modeling, proposing a unified framework that represents text, vision, and audio within a shared discrete token space for autoregressive generation.
2. 💡 Previous Research and New Ideas: Based on the Next-Token Prediction (NTP) paradigm and prior multimodal systems that treat non-linguistic modalities as external attachments, the paper introduces the Discrete Native Autoregression (DiNA) paradigm and dNaViT (Discrete Native Resolution Vision Transformer) for unified multimodal understanding and generation.
3. ❓ Problem: The paper solves the challenge of representing non-linguistic modalities (vision and audio) within a discrete token space, overcoming the performance ceiling of discrete visual modeling and reconciling the traditionally conflicting objectives of understanding and generation.
4. 🛠️ Methods: The paper employs Semantic-and-Aligned Encoders (SAE) for semantic completeness, Residual Vector Quantization (RVQ) for hierarchical discrete tokenization, a modality-agnostic MoE backbone, and internal linguistic guidance for audio generation with parallel/serial strategies.
5. 📊 Results and Evaluation: LongCat-Next achieves competitive performance on visual understanding benchmarks (MMMU 70.6, MathVista 83.1), strong visual generation (GenEval 84.44), state-of-the-art audio capabilities, and maintains robust text capabilities, outperforming unified multimodal models while reconciling understanding and generation objectives.
1. 📘 主题与领域: 该论文聚焦于原生多模态建模,提出一个统一框架,将文本、视觉和音频表示在共享的离散 token 空间中,实现自回归生成。
2. 💡 先前研究与新思路: 基于 Next-Token Prediction(NTP)范式和将非语言模态作为外部附件处理的多模态系统,本文提出离散原生自回归(DiNA)范式和 dNaViT(离散原生分辨率视觉 Transformer),实现统一的多模态理解与生成。
3. ❓ 问题: 本文解决了如何将非语言模态(视觉和音频)有效表示在离散 token 空间中的挑战,突破了离散视觉建模的性能上限,协调了传统上相互冲突的理解与生成目标。
4. 🛠️ 方法: 论文采用语义对齐编码器(SAE)确保语义完整性,使用残差向量量化(RVQ)进行分层离散 token 化,结合模态无关的 MoE 主干网络,以及用于音频生成的内部语言引导(并行/串行策略)。
5. 📊 结果与评估: LongCat-Next 在视觉理解基准(MMMU 70.6、MathVista 83.1)上表现竞争力,在视觉生成(GenEval 84.44)上表现强劲,音频能力达到 SOTA,并保持强大的文本能力,在协调理解与生成目标的同时超越了统一多模态模型。