2025-09-30 Papers

1/2

Paper 1

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Published: 2025-09-28

Link: http://arxiv.org/pdf/2509.24006

1. 📘 Topic and Domain: Improving efficiency of attention mechanisms in Diffusion Transformers through a hybrid sparse-linear attention approach called SLA.
2. 💡 Previous Research and New Ideas: Based on sparse attention and linear attention methods, proposes a novel fusion of both by observing that attention weights can be decomposed into high-rank large weights and low-rank remaining weights.
3. ❓ Problem: The quadratic computational complexity and high latency of attention in Diffusion Transformers, particularly for video generation with long sequences.
4. 🛠️ Methods: Classifies attention weights into critical (computed with O(N²) attention), marginal (computed with O(N) linear attention), and negligible (skipped), implemented in a single GPU kernel with fine-tuning.
5. 📊 Results and Evaluation: Achieved 20× reduction in attention computation, 13.7× speedup in attention kernel, and 2.2× end-to-end speedup in video generation on Wan2.1-1.3B model without quality degradation, evaluated using multiple video quality metrics.

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

SLA: Sparse-Linear Attention Workflow Input Q, K, V Attention Weight Prediction Softmax(pool(Q)pool(K)ᵀ) Weight Classification Critical/Marginal/Negligible Critical Weights Top 5% Sparse FlashAttention O(N²) complexity Marginal Weights Middle 85% Linear Attention O(N) complexity Negligible Weights Bottom 10% Skip Computation O(0) complexity Sparse Computation S_ij = Q_i K_j^T / √d P_ij = OnlineSoftmax(S_ij) O_s = P_ij V_j FlashAttention kernel Linear Computation H = φ(K)^T V Z = rowsum(φ(K)^T) O_l = φ(Q)H / φ(Q)Z φ = softmax activation Output Fusion O = O_s + Proj(O_l) Learnable projection Fine-tuning Process • Replace original attention • Train for 2000 steps • Batch size: 64 • <0.1% of pretraining cost Results 95% sparsity • 20× attention reduction 13.7× kernel speedup • 2.2× E2E speedup Lossless generation quality GPU Kernel Unified forward/backward Efficient implementation Efficiency 95% Sparsity vs 85% baseline Quality Lossless Video Gen
Q1
1. What key observation about attention weights led to the development of SLA?
Attention weights are always uniformly distributed
Attention weights can be decomposed into high-rank large weights and low-rank remaining weights
Attention weights follow a normal distribution
Q2
2. How does SLA achieve better efficiency compared to previous methods?
By completely replacing attention with linear operations
By using only sparse attention for all computations
By classifying weights into critical (O(N²)), marginal (O(N)), and negligible (skipped) computations
Q3
3. What was the impact of fine-tuning with SLA on the Wan2.1-1.3B model?
95% sparsity with severe quality degradation
95% sparsity with maintained quality and 13.7× kernel speedup
50% sparsity with moderate speedup
1/2

Paper 2

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Published: 2025-09-29

Link: http://arxiv.org/pdf/2509.24897

1. 📘 Topic and Domain: The paper focuses on evaluating unified multimodal AI models that combine visual understanding and generation capabilities through a new comprehensive benchmark called RealUnify.
2. 💡 Previous Research and New Ideas: Previous research focused on evaluating understanding and generation separately or superficially combined, while this paper proposes the first benchmark specifically designed to test if unified models can achieve true synergy between these capabilities.
3. ❓ Problem: The paper aims to determine whether unified multimodal models genuinely benefit from architectural unification and can effectively leverage synergy between understanding and generation capabilities.
4. 🛠️ Methods: The authors created RealUnify benchmark with 1,000 human-annotated instances across 10 categories and 32 subtasks, using dual-evaluation protocols (direct end-to-end and diagnostic stepwise) to test both Understanding Enhances Generation (UEG) and Generation Enhances Understanding (GEU).
5. 📊 Results and Evaluation: Through evaluation of 12 unified models and 6 specialized baselines, results showed current unified models struggle to achieve effective capability synergy, with best open-source models scoring only 37.5% on UEG tasks, indicating architectural unification alone is insufficient.

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

RealUnify: Methodology Flow Chart Understanding Enhances Generation (UEG) Generation Enhances Understanding (GEU) World Knowledge Commonsense Reasoning Mathematical Reasoning Logical Reasoning Scientific Reasoning Code-to-Image Mental Reconstruction Mental Tracking Attentional Focusing Cognitive Navigation Dual-Evaluation Protocol Direct Evaluation End-to-end task completion without decomposition Stepwise Evaluation Task decomposed into understanding and generation stages 12 Unified Models 11 Open-source + 1 Proprietary (BAGEL, OmniGen2, Ovis-U1, etc.) 6 Specialized Baselines 3 Understanding + 3 Generation (Gemini-2.5-Pro, GPT-Image-1, etc.) Key Findings • Current unified models struggle to achieve effective synergy • Architectural unification alone is insufficient • Stepwise evaluation reveals capability dissociation • Need for advanced training strategies and inductive biases 1,000 instances 10 categories 32 subtasks
Q1
1. What is the main innovation of RealUnify compared to previous benchmarks?
It tests more models than previous benchmarks
It evaluates whether understanding and generation capabilities truly enhance each other
It only focuses on image generation capabilities
Q2
2. In the stepwise evaluation of UEG tasks, what surprising pattern was observed?
Performance remained the same as direct evaluation
Performance significantly decreased
Performance improved significantly, showing models have knowledge but struggle to integrate it
Q3
3. What was the performance gap between the best open-source unified model and the 'oracle' model (combination of specialist models) on UEG tasks?
About 35 percentage points (37.5% vs 72.7%)
About 10 percentage points
No significant gap
1/2

Paper 3

Visual Jigsaw Post-Training Improves MLLMs

Published: 2025-09-29

Link: http://arxiv.org/pdf/2509.25190

1. 📘 Topic and Domain: Visual Jigsaw is a self-supervised post-training framework for improving visual understanding capabilities in multimodal large language models (MLLMs) across image, video and 3D modalities.
2. 💡 Previous Research and New Ideas: Based on previous work in self-supervised visual learning and MLLM post-training, it proposes a novel jigsaw-style task that enhances visual perception without requiring additional model architecture changes.
3. ❓ Problem: Current MLLM post-training approaches are predominantly text-centric and undervalue deep visual understanding, while existing visual enhancement methods require architectural changes or additional generative components.
4. 🛠️ Methods: Implements visual jigsaw tasks where inputs are partitioned, shuffled, and the model must reconstruct the correct order using natural language, applied across three modalities: image patches, video clips, and 3D depth points.
5. 📊 Results and Evaluation: Achieved significant improvements across multiple benchmarks: enhanced fine-grained perception and spatial understanding in images, improved temporal reasoning in videos, and better 3D spatial comprehension, while maintaining the model's original reasoning capabilities.

Visual Jigsaw Post-Training Improves MLLMs

Visual Jigsaw Post-Training Framework COCO Images 118K samples LLaVA Videos 100K samples ScanNet RGB-D 300K samples Image Jigsaw 3×3 patches Spatial ordering Video Jigsaw 6 temporal clips Chronological order 3D Jigsaw 6 depth points Near to far order Qwen2.5-VL-7B Base Model Reward Design Perfect match: 1.0 Partial correct: γ × fraction Invalid: 0.0 GRPO Training Group Relative Policy Optimization Batch: 256/128, LR: 1e-6 Fine-grained Perception • MMVP: +6.00 • MMStar: +6.06 • HR-Bench: +3.75 Temporal Understanding • AoTBench: +5.23 • TempCompass: +0.76 • CVBench: +3.00 3D Spatial Reasoning • SAT-Real: +15.34 • DA-2K: +17.11 • ViewSpatial: +2.10 Compositional Understanding • Winoground: +2.00 • SugarCrepe++: +1.43 • VSR: +2.68 Key Insights • Self-supervised ordering tasks enhance visual understanding without architectural changes • RL training shows superior generalization compared to SFT for visual jigsaw tasks • Partial accuracy rewards crucial for learning complex jigsaw configurations • Framework generalizes across image, video, and 3D modalities effectively
Q1
1. What is the main innovation of Visual Jigsaw compared to previous MLLM enhancement approaches?
It requires complex architectural changes to the model
It uses text-based reasoning to improve visual understanding
It improves visual understanding without requiring model architecture changes
Q2
2. In the 3D jigsaw task, how does the model handle depth ordering?
By reconstructing 3D voxel representations
By ordering points from nearest to farthest in RGB-D images
By matching multiple camera viewpoints
Q3
3. Why did the authors choose to implement Visual Jigsaw as post-training rather than pre-training?
Because it requires less computational resources
Because it needs the model to already have basic visual understanding
Because it's easier to implement as post-training