2026-01-29 Papers

1/2

Paper 1

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.20614

1. 📘 Topic and Domain: The paper focuses on enhancing mathematical reasoning capabilities in large language models through reinforcement learning techniques.
2. 💡 Previous Research and New Ideas: The paper builds on Group Relative Policy Optimization (GRPO) but identifies its implicit bias against harder questions, proposing Difficulty-Aware Group Policy Optimization (DGPO) with balanced advantage estimation and Multi-Aspect Question Reformulation (MQR) for data augmentation.
3. ❓ Problem: The paper addresses the systematic lack of emphasis on challenging questions in existing reinforcement learning methods, where GRPO's update magnitudes are suppressed for both easier and harder questions.
4. 🛠️ Methods: The authors use DGPO algorithm with difficulty-balanced group advantage estimation using mean absolute deviation and difficulty-aware question-level weighting, combined with MQR strategy that reformulates questions by adding story backgrounds, introducing abstract terminology, and nesting sub-problems.
5. 📊 Results and Evaluation: MathForge achieves 42.17% average accuracy across six benchmarks (4.56% improvement over GRPO baseline) when tested on Qwen2.5-Math-7B, with consistent improvements demonstrated across different model sizes and types including multimodal domains.

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

MathForge: Boosting Mathematical Reasoning Two-Dual Framework Algorithmic Perspective DGPO Difficulty-Balanced Group Advantage Estimation (DGAE) Rectifies update magnitude imbalance using MAD normalization Difficulty-Aware Question-Level Weighting (DQW) Prioritizes harder questions using exponential weighting with temperature control Data Perspective MQR Background Reformulation Adds story context to increase complexity Term Reformulation Introduces abstract mathematical terminology Sub-Problem Nesting Converts conditions into independent sub-problems Key Constraint: All reformulations preserve original gold answer Synergistic Loop MQR expands data frontier DGPO learns from challenges
Q1
1. What mathematical insight did the authors discover about GRPO's update magnitudes?
Update magnitudes are highest for the easiest questions and decrease linearly with difficulty
Update magnitudes peak at moderate difficulty (p=0.5) and are suppressed for both easier and harder questions
Update magnitudes remain constant across all difficulty levels due to normalization
Q2
2. How does the Multi-Aspect Question Reformulation (MQR) strategy ensure the validity of augmented training data?
It uses GPT-4 to regenerate new solutions for each reformulated question
It constrains all reformulations to preserve the original gold answer while increasing difficulty
It validates each question through multiple rounds of human expert review
Q3
3. What key mathematical change does DGPO make to GRPO's advantage estimation function?
It replaces the standard deviation denominator with mean absolute deviation (MAD)
It adds a logarithmic scaling factor to amplify harder questions
It introduces a learnable neural network to estimate advantages dynamically
1/2

Paper 2

Advancing Open-source World Models

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.20540

1. 📘 Topic and Domain: The paper presents LingBot-World, an open-source world model for interactive video generation that bridges video synthesis and actionable simulation in computer vision and machine learning.
2. 💡 Previous Research and New Ideas: Building on video generation models and world simulators like Genie 3 and Wan2.2, the paper proposes a multi-stage evolution strategy (pre-training, middle-training, post-training) with hierarchical data captioning and mixture-of-experts architecture for long-term consistency.
3. ❓ Problem: The paper addresses the challenge of transitioning from passive video generation to interactive world simulation, tackling issues of scarce interactive data, maintaining long-term temporal coherence, and achieving real-time controllable generation.
4. 🛠️ Methods: The authors employ a scalable data engine with game/synthetic data acquisition, progressive curriculum training with MoE architecture, and causal architecture adaptation with few-step distillation for real-time inference.
5. 📊 Results and Evaluation: LingBot-World achieves superior performance on VBench metrics (0.8857 dynamic degree vs 0.7612/0.7217 for baselines), maintains minute-level temporal consistency, supports real-time interaction at 16fps, and demonstrates emergent spatial memory and 3D consistency capabilities.

Advancing Open-source World Models

LingBot-World Pipeline Data Engine Data Acquisition • General Videos • Game Data • Synthetic (UE) Data Profiling • Basic Filtering • Semantic Analysis • Camera Pose Labels Data Captioning • Narrative Caption • Scene-static Caption • Temporal Caption Stage I: Pre-training General Video Prior Open-domain generation Spatiotemporal coherence Stage II: Middle-training World Knowledge Injection Action control (MoE) Long-term consistency Stage III: Post-training Real-time Interaction Causal attention Few-step distillation LingBot-World Architecture DiT Blocks with Action Injection + Plücker Encoder 28B parameters (MoE: high-noise & low-noise experts) Applications Promptable World Events Global & local dynamics via text control Action Agent Autonomous exploration policies learning 3D Reconstruction Geometric consistency validation
Q1
1. What emergent capability does LingBot-World demonstrate without relying on explicit 3D representations?
It can maintain spatial memory and preserve landmark integrity even after objects are out of view for 60 seconds
It can generate photorealistic textures using only 2D convolutions
It can automatically convert any video into a playable game without additional training
Q2
2. Which architectural innovation allows LingBot-World to transform from a bidirectional model to a real-time interactive system?
Replacing all attention layers with convolutional layers for faster processing
Using block causal attention that combines local bidirectional dependencies with global causality constraints
Implementing a completely new transformer architecture from scratch
Q3
3. What unique data acquisition strategy does LingBot-World employ to overcome the scarcity of interactive training data?
It only uses publicly available YouTube videos with manual annotations
It relies exclusively on reinforcement learning in simulated environments
It combines real-world footage, game engine recordings with paired control inputs, and synthetic data from Unreal Engine
1/2

Paper 3

DeepSeek-OCR 2: Visual Causal Flow

Published: 2026-01-28

Link: http://arxiv.org/pdf/2601.20552

1. 📘 Topic and Domain: The paper presents DeepSeek-OCR 2, a vision-language model for document reading and optical character recognition with a novel encoder that dynamically reorders visual tokens based on image semantics.
2. 💡 Previous Research and New Ideas: The paper builds on DeepSeek-OCR, DETR's parallelized queries, and BLIP-2's Q-former, proposing DeepEncoder V2 which replaces CLIP with an LLM-style architecture using causal attention to enable semantic-aware visual token reordering.
3. ❓ Problem: The paper addresses the limitation of conventional VLMs that process visual tokens in rigid raster-scan order, which contradicts human visual perception that follows flexible, semantically coherent scanning patterns driven by causal reasoning.
4. 🛠️ Methods: The authors use a vision tokenizer with SAM-base architecture, an LLM-style encoder (Qwen2-0.5B) with dual-stream attention (bidirectional for visual tokens, causal for learnable queries), and a DeepSeek-3B MoE decoder, trained in three stages.
5. 📊 Results and Evaluation: DeepSeek-OCR 2 achieves 91.09% overall performance on OmniDocBench v1.5 (3.73% improvement over baseline) with reduced Edit Distance for reading order (0.057 vs 0.085), while using fewer maximum visual tokens (1120 vs 1156).

DeepSeek-OCR 2: Visual Causal Flow

DeepSeek-OCR 2: Visual Causal Flow Workflow Input Image Vision Tokenizer SAM-ViTDet 80M 16x compression DeepEncoder V2 Visual Tokens Non-causal attention LM Vision Encoder Qwen2 500M Causal Flow Queries Learnable tokens with causal attention Reordering visual information Attention Mask Design Non-causal Causal Combined DeepSeek-MoE Decoder 3B params (500M active) Training Pipeline Stage 1: Encoder Pretraining • Vision tokenizer + LM encoder • Language modeling objective • 40k iterations Stage 2: Query Enhancement • Freeze vision tokenizer • Optimize LM encoder + decoder • 15k iterations Stage 3: Decoder Specialization • Freeze all encoder params • Update only decoder • 20k iterations OCR Output (Reordered Tokens) Key Innovation Cascade Causal Reasoning: 1. Encoder reorders visual tokens 2. Decoder performs causal reasoning
Q1
1. What is the key architectural innovation in DeepEncoder V2 that enables visual causal flow?
Replacing CLIP with an LLM-style architecture that uses dual-stream attention mechanisms
Increasing the number of visual tokens to capture more image details
Using a larger vision transformer with more parameters than CLIP
Q2
2. How does DeepSeek-OCR 2's visual token processing differ from conventional vision-language models?
It processes more visual tokens per image for better accuracy
It dynamically reorders visual tokens based on semantic understanding rather than rigid raster-scan order
It uses a faster tokenization algorithm to reduce computational cost
Q3
3. What performance improvement did DeepSeek-OCR 2 achieve on reading order (R-order) Edit Distance compared to the baseline?
From 0.085 to 0.057, indicating better semantic ordering of visual content
From 0.057 to 0.085, showing increased processing speed
From 1156 to 1120, reducing the number of tokens needed