2025-10-23 Papers

1/2

Paper 1

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Published: 2025-10-22

Link: http://arxiv.org/pdf/2510.19338

1. 📘 Topic and Domain: The paper presents Ring-linear models (Ring-mini-linear-2.0 and Ring-flash-linear-2.0), which are hybrid architecture language models combining linear and softmax attention for efficient long-context reasoning.
2. 💡 Previous Research and New Ideas: Based on previous research in Linear Attention (Mamba, Gated Linear Attention) and hybrid architectures, the paper proposes a new hybrid architecture that effectively balances between linear and softmax attention with systematic training-inference alignment.
3. ❓ Problem: The paper addresses the challenge of efficiently processing long text sequences in language models while maintaining performance, as traditional attention mechanisms have quadratic computational complexity and linear I/O overhead with increasing sequence length.
4. 🛠️ Methods: The authors implement a hybrid architecture combining linear and softmax attention, optimize FP8 training with fused kernels (LingHe), and develop systematic training-inference alignment for stable reinforcement learning training.
5. 📊 Results and Evaluation: The models achieve comparable or better performance than larger counterparts across various reasoning benchmarks while reducing inference costs by 90% compared to dense models and 50% compared to the original Ring series, with Ring-flash-linear-2.0 scoring particularly well on mathematical reasoning tasks.

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Ring-linear Model Workflow Model Architecture Hybrid Linear Attention MoE Architecture Lightning Attention Grouped RMSNorm RoPE + QK Norm Computation Optimization GPU Kernel Fusion FP8 Training LingHe Library State-aware Recompute Speculative Decoding Continued Pre-training Initialize from Ling-base Restore Capabilities Context Extension 4K → 32K → 128K WSM Scheduler Post-training Supervised Fine-tuning (SFT) + Reinforcement Learning (RL) Training-Inference Alignment for Stable RL Supervised Fine-tuning Comprehensive Reasoning Function Calling De-noising & Safety 128K Context Window Reinforcement Learning Multiple Domains 64K Context Window PPO with Rollout Probs Long-term Stable Training Training-Inference Alignment KV Cache Precision LM Head FP32 RMSNorm & RoPE MoE Deterministic Order Ring-mini-linear-2.0 16B params, 957M active Hybrid Ratio 1:4 128K Context Ring-flash-linear-2.0 104B params, 6.1B active Hybrid Ratio 1:7 128K Context Evaluation Mathematical Reasoning Coding & Agent Tasks General Reasoning Key Innovations Hybrid Linear + Softmax Attention, MoE 1/32 Activation Ratio 50% Training Efficiency Improvement, 90% Inference Speedup Performance Results 10x Cost Reduction vs 32B Dense Model 50% Cost Reduction vs Original Ring Series Efficiency Gains Linear O(nd²) vs Quadratic O(n²d) Complexity Constant KV Cache vs Linear Growth
Q1
1. What is the main innovation that helps Ring-linear models reduce inference costs compared to dense models?
Using only linear attention mechanisms
Combining linear attention with softmax attention in a hybrid architecture
Implementing larger model parameters
Q2
2. What was identified as the root cause of training collapse in Reinforcement Learning for these models?
Insufficient training data
Model size limitations
Training-inference disparity
Q3
3. How much improvement in training efficiency was achieved through the self-developed FP8 operator library (linghe)?
25% improvement
50% improvement
75% improvement
1/2

Paper 2

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

Published: 2025-10-22

Link: http://arxiv.org/pdf/2510.19363

1. 📘 Topic and Domain: Long-context reasoning in large language models through reinforcement learning, focusing on enhancing models' ability to reason over extensive text contexts.
2. 💡 Previous Research and New Ideas: Based on previous research in short-context reasoning and chain-of-thought prompting, proposes new KeyChain data synthesis method to transform short multi-hop QA into challenging long-context tasks.
3. ❓ Problem: Addresses the challenge of improving LLMs' ability to reason over long contexts (up to 128K tokens) while maintaining short-context capabilities and avoiding prohibitive training costs.
4. 🛠️ Methods: Implements LoongRL with KeyChain data construction, which inserts UUID chains to hide questions in long contexts, uses Group Relative Policy Optimization for training, and employs a multi-stage curriculum approach.
5. 📊 Results and Evaluation: Achieved significant improvements in long-context reasoning (+23.5% for 7B and +21.1% for 14B models), with LoongRL-14B reaching 74.2 score rivaling larger models, while maintaining short-context capabilities and generalizing effectively to 128K contexts.

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts KeyChain Data Construction 1. Seed Dataset Curation 2. Long Context Filling 3. UUID Chain Insertion HotpotQA + MuSiQue + 2WikiMultiHopQA Training Data Mix • KeyChain Data (Hard) • Multi-hop QA (Medium) • Needle Retrieval (Easy) • Math Data (Preserve) Multi-Stage RL Training Stage 1: Warm-up Stage 2: KeyChain Augmentation Stage 3: Difficulty-focused GRPO + Rule-based Reward Emergent Reasoning Pattern Plan → Retrieve Reason → Recheck Generalizes to 128K Group Relative Policy Optimization (GRPO) Advantage: A_i,t = (r_i - mean(rewards)) / std(rewards) Two-way Substring Exact Match Reward Training Length: 16K → Generalizes to 128K KeyChain Innovation Transform short multi-hop QA into high-difficulty tasks • Insert UUID chains hiding true questions • Add distracting documents (16K tokens) • Require step-by-step chain tracing • Force long-context reasoning over retrieval Key Results LoongRL-7B: +23.5% improvement LoongRL-14B: +21.1% improvement Score 74.2 (rivals o3-mini 74.5, DeepSeek-R1 74.9) Preserves short-context abilities Perfect Needle-in-Haystack performance Evaluation Benchmarks Long-context: LongBench v1/v2 Short-context: MMLU, MATH, IFEval Retrieval: RULER, Needle-in-Haystack Multi-hop QA: HotpotQA, MuSiQue NarrativeQA, QASPER Technical Setup Models: Qwen2.5-7B/14B-Instruct Group Size: 8, LR: 1e-6 Temperature: 0.6, Top-p: 0.95 Max Output: 4096 tokens Training Context: ~16K tokens Baseline Comparison vs. o3-mini, DeepSeek-R1 vs. R1-Distill variants vs. QwenLong-L1-32B vs. GPT-4o, QwQ-32B Frontier-level at smaller scale LoongRL Workflow Summary KeyChain Data Construction → Multi-Stage RL Training → Emergent Plan-Retrieve-Reason-Recheck Pattern Results: Frontier-level long-context reasoning with 16K training generalizing to 128K contexts
Q1
1. What is the key innovation of the KeyChain data synthesis method?
It simply adds random text to make contexts longer
It inserts UUID chains that hide the true question among distracting documents
It combines multiple short questions into one long question
Q2
2. What unique reasoning pattern emerged from models trained with LoongRL?
Quick answer generation without explanation
Random trial and error approach
Plan-retrieve-reason-recheck systematic thinking
Q3
3. How did LoongRL achieve efficient training for 128K context length tasks?
By using massive computing resources to train directly on 128K contexts
By training on 16K contexts and leveraging natural generalization of the learned reasoning patterns
By gradually increasing context length during training
1/2

Paper 3

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Published: 2025-10-22

Link: http://arxiv.org/pdf/2510.19430

1. 📘 Topic and Domain: A Vision-Language-Action (VLA) model called GigaBrain-0 for robotic manipulation tasks, operating in the domain of robotics and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous VLA models and world model research, introduces a novel approach using world model-generated data (video generation, real2real transfer, human transfer, view transfer, sim2real transfer) instead of relying heavily on real robot data.
3. ❓ Problem: Addresses the challenge of collecting large-scale real-world robot data, which is expensive, time-consuming, and limited in diversity, hindering the development of robust, general-purpose robotic systems.
4. 🛠️ Methods: Employs a mixture-of-transformers architecture combining Vision-Language Model (VLM) and action Diffusion Transformer (DiT), enhanced with RGB-D input modeling, embodied Chain-of-Thought supervision, and Knowledge Insulation for better spatial reasoning and action generation.
5. 📊 Results and Evaluation: Achieved superior performance across various tasks (dexterous manipulation, long-horizon tasks, mobile manipulation), with significantly improved generalization in appearance, object placement, and camera viewpoint variations, while also offering a lightweight variant (GigaBrain-0-Small) for efficient on-device deployment.

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0 Methodology Flow Data Sources Real World Robot Data Video Generation Data Real2Real Transfer Data View Transfer Data Sim2Real Transfer Data Human Transfer Data Egocentric Human Data Input Processing RGB-D Input Language Instructions Vision-Language Expert PaliGemma2 VLM Knowledge Insulation Training Embodied Chain-of-Thought Manipulation Trajectories Subgoal Language Discrete Action Tokens Action Expert Diffusion Transformer Flow Matching Action Chunk Prediction GRU Decoder Trajectory Robot Actions & Outputs Dexterous Manipulation Long-horizon Tasks Mobile Manipulation Generalization Real-time Control Evaluation Platforms AgiBot G1 Platform PiPER Robot Platform GigaBrain-0-Small
Q1
1. What is the primary innovation of GigaBrain-0 that addresses the challenge of collecting real-world robot data?
Using multiple cameras to capture more training data
Leveraging world model-generated data for training
Implementing faster data collection robots
Q2
2. What unique architectural feature does GigaBrain-0 use to enhance spatial reasoning in robotic tasks?
Embodied Chain-of-Thought supervision with intermediate reasoning tokens
Standard transformer architecture with attention layers
Simple convolutional neural networks
Q3
3. How does GigaBrain-0-Small achieve efficient on-device performance?
By reducing the robot's physical components
By using cloud computing for all processing
By optimizing memory transfers and using mixed-precision inference