2026-02-05 Papers

1/2

Paper 1

ERNIE 5.0 Technical Report

Published: 2026-02-04

Link: http://arxiv.org/pdf/2602.04705

1. 📘 Topic and Domain: The paper introduces ERNIE 5.0, a trillion-parameter autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio.
2. 💡 Previous Research and New Ideas: Building on previous large language and vision-language models like ERNIE 4.5, Gemini, and GPT, the paper proposes training all modalities from scratch under a unified next-group-of-tokens prediction objective with ultra-sparse mixture-of-experts and novel elastic training paradigm.
3. ❓ Problem: The paper aims to solve the limitation of existing multimodal models that decouple generation from understanding and rely on modality-specific components, which hinders deep cross-modal integration and forces trade-offs between multimodal capabilities and core language performance.
4. 🛠️ Methods: The authors use an ultra-sparse MoE architecture with modality-agnostic expert routing, elastic training for flexible deployment, unified multimodal reinforcement learning with techniques like unbiased replay buffer and multi-granularity importance sampling, and specialized tokenization strategies for each modality.
5. 📊 Results and Evaluation: ERNIE 5.0 achieves competitive or state-of-the-art performance across text, vision, and audio benchmarks, with elastic variants maintaining near-full performance using only 53.7% activated parameters, and demonstrates successful expert specialization patterns despite modality-agnostic routing.

ERNIE 5.0 Technical Report

ERNIE 5.0 Technical Workflow Unified Architecture Ultra-Sparse MoE Modality-Agnostic Routing Visual Modeling NFSP Prediction Audio Modeling NCP Prediction Next-Group-of-Tokens Elastic Training Depth/Width/Sparsity Pre-Training Multimodal Data Training Recipe 8K → 32K → 128K WSD Schedule Elastic Training Once-For-All Sub-models Infrastructure Hybrid Parallelism FlashMask Post-Training SFT Instruction Following Chain-of-Thought UMRL U-RB Buffer MISC & WPSM AHRL RL Infrastructure Disaggregated Control Unified FP8 Stack Elastic CPU Pooling Verifier System Multimodal Rewards Evaluation & Analysis Language Tasks Vision Tasks Audio Tasks Expert Routing Analysis Key Features: Unified Autoregressive Elastic Training Stabilized RL Multimodal Excellence
Q1
1. What is the key innovation of ERNIE 5.0's elastic training paradigm?
It allows a single pre-training run to produce a family of sub-models with varying depths, expert capacities, and routing sparsity
It requires separate training runs for each model size but shares the same training data
It only adjusts the learning rate dynamically during training to improve convergence
Q2
2. How does ERNIE 5.0 handle the challenge of different attention patterns across modalities?
It uses separate attention mechanisms for each modality with independent parameters
It employs FlashMask to efficiently handle per-sample heterogeneous attention masks, where vision uses bidirectional attention while text/audio use unidirectional
It converts all modalities to use the same causal attention pattern throughout the model
Q3
3. What surprising behavior emerges from ERNIE 5.0's modality-agnostic expert routing?
All experts become equally activated across all modalities, showing no specialization
The routing completely fails and requires manual intervention to assign experts to modalities
Despite no explicit modality boundaries, experts naturally develop specialization patterns based on task requirements rather than modality types
1/2

Paper 2

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

Published: 2026-02-04

Link: http://arxiv.org/pdf/2602.04634

1. 📘 Topic and Domain: The paper explores width scaling in Large Language Models through multi-agent systems for broad information-seeking tasks.
2. 💡 Previous Research and New Ideas: Building on depth scaling approaches (sequential multi-turn reasoning) and existing multi-agent frameworks with hand-crafted workflows, the paper proposes WIDESEEK-R1, a lead-agent–subagent system trained via multi-agent reinforcement learning (MARL) for synergized orchestration and parallel execution.
3. ❓ Problem: The paper addresses the bottleneck of single-agent systems in broad information-seeking tasks, where context pollution and sequential execution limit performance when gathering and synthesizing information about multiple entities.
4. 🛠️ Methods: The authors use a shared LLM with isolated contexts for lead agent and subagents, train via MARL with group-normalized advantages and dual-level reweighting, and construct a 20k broad information-seeking dataset for training.
5. 📊 Results and Evaluation: WIDESEEK-R1-4B achieves 40.0% item F1 score on WideSearch benchmark (comparable to single-agent DeepSeek-R1-671B with 170× fewer parameters) and shows consistent performance gains as parallel subagents increase, demonstrating effective width scaling.

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

WideSeek-R1: Multi-Agent Reinforcement Learning Workflow Query Input Lead Agent Task Decomposition Multi-turn Orchestration call_subagent Subagent 1 Parallel Execution Subagent 2 Parallel Execution ... Subagent N Parallel Execution Subagent N+1 Parallel Execution search/access search/access search/access search/access Multi-Agent Reinforcement Learning (MARL) Group Normalization Advantage Assignment Dual-Level Reweighting GRPO Loss Update Training Data Construction Pipeline Query Generation Answer Generation QA Pair Filtering 20k Dataset Final Answer
Q1
1. What unique tool does the lead agent in WIDESEEK-R1 exclusively use to avoid context pollution?
search and access tools for direct information retrieval
call_subagent tool for delegating subtasks to parallel agents
browser tool for comprehensive web navigation
Q2
2. How does WIDESEEK-R1's performance scale compared to traditional depth scaling when increasing computational resources?
Both width and depth scaling show similar linear improvements
Depth scaling quickly plateaus while width scaling shows consistent gains with more subagents
Width scaling immediately outperforms depth scaling but then degrades rapidly
Q3
3. What key innovation in WIDESEEK-R1's training approach enables effective multi-agent coordination?
Using separate specialized models for each agent role with different architectures
Implementing turn-taking sequential interactions between agents
Jointly optimizing lead and subagents via MARL with group-normalized advantages
1/2

Paper 3

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Published: 2026-02-04

Link: http://arxiv.org/pdf/2602.04804

1. 📘 Topic and Domain: The paper focuses on token compression for Omni-modal Large Language Models (Omni-LLMs) that process audio, video, and text simultaneously.
2. 💡 Previous Research and New Ideas: The paper builds on existing token compression methods like OmniZip and DyCoke but proposes a modality-asymmetric approach where video tokens guide audio token selection, unlike previous symmetric methods.
3. ❓ Problem: The paper aims to solve the computational overhead caused by long multimodal token sequences in Omni-LLMs, where a typical 20-second clip can generate over 20K tokens.
4. 🛠️ Methods: The authors use OmniSIFT, a two-stage framework with Spatio-Temporal Video Pruning (STVP) to remove video redundancy and Vision-Guided Audio Selector (VGAS) to filter audio tokens based on visual cues.
5. 📊 Results and Evaluation: OmniSIFT achieved superior performance across five benchmarks, maintaining or exceeding full-token model accuracy while using only 25% of original tokens and reducing inference time by 40%.

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

OmniSIFT: Modality-Asymmetric Token Compression Workflow Video Input V Audio Input A Vision Encoder Φv Audio Encoder Φa Visual Tokens Zv Audio Tokens Za STVP Module Spatio-Temporal Video Pruning Spatial Temporal Compressed Visual Ẑv VGAS Module Vision-Guided Audio Selector Cross-Attention Compressed Audio Ẑa Multimodal Chunks Ct = [Ẑv(t); Ẑa(t)] Large Language Model (LLM Backbone) Removes spatial & temporal redundancy Visual-guided selection Modality-asymmetric compression
Q1
1. What is the key insight behind OmniSIFT's modality-asymmetric compression approach?
Audio and video tokens should be compressed equally because they carry the same semantic importance
Visual redundancy can be resolved independently, while audio saliency depends on visual context
Audio tokens should guide video compression since sound carries more temporal information
Q2
2. In the badminton game case study comparing OmniSIFT and OmniZip, what critical error did OmniZip make?
It retained too many audio tokens, causing computational overhead
It pruned the scoreboard visual patches, leading to incorrect score interpretation
It failed to synchronize audio and video tokens properly
Q3
3. What surprising result did OmniSIFT achieve when retaining only 25% of the original tokens?
It maintained 25% of the original model's accuracy across all benchmarks
It required 75% more parameters than the full-token baseline
It outperformed the full-token model on several benchmarks