2026-02-05 Papers

1/2

Paper 1

ERNIE 5.0 Technical Report

Published: 2026-02-04

Link: http://arxiv.org/pdf/2602.04705

1. 📘 Topic and Domain: The paper introduces ERNIE 5.0, a trillion-parameter autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio.

2. 💡 Previous Research and New Ideas: Building on previous large language and vision-language models like ERNIE 4.5, Gemini, and GPT, the paper proposes training all modalities from scratch under a unified next-group-of-tokens prediction objective with ultra-sparse mixture-of-experts and novel elastic training paradigm.

3. ❓ Problem: The paper aims to solve the limitation of existing multimodal models that decouple generation from understanding and rely on modality-specific components, which hinders deep cross-modal integration and forces trade-offs between multimodal capabilities and core language performance.

4. 🛠️ Methods: The authors use an ultra-sparse MoE architecture with modality-agnostic expert routing, elastic training for flexible deployment, unified multimodal reinforcement learning with techniques like unbiased replay buffer and multi-granularity importance sampling, and specialized tokenization strategies for each modality.

5. 📊 Results and Evaluation: ERNIE 5.0 achieves competitive or state-of-the-art performance across text, vision, and audio benchmarks, with elastic variants maintaining near-full performance using only 53.7% activated parameters, and demonstrates successful expert specialization patterns despite modality-agnostic routing.

ERNIE 5.0 Technical Report

1/2

Paper 2

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

Published: 2026-02-04

Link: http://arxiv.org/pdf/2602.04634

1. 📘 Topic and Domain: The paper explores width scaling in Large Language Models through multi-agent systems for broad information-seeking tasks.

2. 💡 Previous Research and New Ideas: Building on depth scaling approaches (sequential multi-turn reasoning) and existing multi-agent frameworks with hand-crafted workflows, the paper proposes WIDESEEK-R1, a lead-agent–subagent system trained via multi-agent reinforcement learning (MARL) for synergized orchestration and parallel execution.

3. ❓ Problem: The paper addresses the bottleneck of single-agent systems in broad information-seeking tasks, where context pollution and sequential execution limit performance when gathering and synthesizing information about multiple entities.

4. 🛠️ Methods: The authors use a shared LLM with isolated contexts for lead agent and subagents, train via MARL with group-normalized advantages and dual-level reweighting, and construct a 20k broad information-seeking dataset for training.

5. 📊 Results and Evaluation: WIDESEEK-R1-4B achieves 40.0% item F1 score on WideSearch benchmark (comparable to single-agent DeepSeek-R1-671B with 170× fewer parameters) and shows consistent performance gains as parallel subagents increase, demonstrating effective width scaling.

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

1/2

Paper 3

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Published: 2026-02-04

Link: http://arxiv.org/pdf/2602.04804

1. 📘 Topic and Domain: The paper focuses on token compression for Omni-modal Large Language Models (Omni-LLMs) that process audio, video, and text simultaneously.

2. 💡 Previous Research and New Ideas: The paper builds on existing token compression methods like OmniZip and DyCoke but proposes a modality-asymmetric approach where video tokens guide audio token selection, unlike previous symmetric methods.

3. ❓ Problem: The paper aims to solve the computational overhead caused by long multimodal token sequences in Omni-LLMs, where a typical 20-second clip can generate over 20K tokens.

4. 🛠️ Methods: The authors use OmniSIFT, a two-stage framework with Spatio-Temporal Video Pruning (STVP) to remove video redundancy and Vision-Guided Audio Selector (VGAS) to filter audio tokens based on visual cues.

5. 📊 Results and Evaluation: OmniSIFT achieved superior performance across five benchmarks, maintaining or exceeding full-token model accuracy while using only 25% of original tokens and reducing inference time by 40%.