2025-09-30 Papers

1/2

Paper 1

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Published: 2025-09-28

Link: http://arxiv.org/pdf/2509.24006

1. 📘 Topic and Domain: Improving efficiency of attention mechanisms in Diffusion Transformers through a hybrid sparse-linear attention approach called SLA.

2. 💡 Previous Research and New Ideas: Based on sparse attention and linear attention methods, proposes a novel fusion of both by observing that attention weights can be decomposed into high-rank large weights and low-rank remaining weights.

3. ❓ Problem: The quadratic computational complexity and high latency of attention in Diffusion Transformers, particularly for video generation with long sequences.

4. 🛠️ Methods: Classifies attention weights into critical (computed with O(N²) attention), marginal (computed with O(N) linear attention), and negligible (skipped), implemented in a single GPU kernel with fine-tuning.

5. 📊 Results and Evaluation: Achieved 20× reduction in attention computation, 13.7× speedup in attention kernel, and 2.2× end-to-end speedup in video generation on Wan2.1-1.3B model without quality degradation, evaluated using multiple video quality metrics.

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

1/2

Paper 2

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Published: 2025-09-29

Link: http://arxiv.org/pdf/2509.24897

1. 📘 Topic and Domain: The paper focuses on evaluating unified multimodal AI models that combine visual understanding and generation capabilities through a new comprehensive benchmark called RealUnify.

2. 💡 Previous Research and New Ideas: Previous research focused on evaluating understanding and generation separately or superficially combined, while this paper proposes the first benchmark specifically designed to test if unified models can achieve true synergy between these capabilities.

3. ❓ Problem: The paper aims to determine whether unified multimodal models genuinely benefit from architectural unification and can effectively leverage synergy between understanding and generation capabilities.

4. 🛠️ Methods: The authors created RealUnify benchmark with 1,000 human-annotated instances across 10 categories and 32 subtasks, using dual-evaluation protocols (direct end-to-end and diagnostic stepwise) to test both Understanding Enhances Generation (UEG) and Generation Enhances Understanding (GEU).

5. 📊 Results and Evaluation: Through evaluation of 12 unified models and 6 specialized baselines, results showed current unified models struggle to achieve effective capability synergy, with best open-source models scoring only 37.5% on UEG tasks, indicating architectural unification alone is insufficient.

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

1/2

Paper 3

Visual Jigsaw Post-Training Improves MLLMs

Published: 2025-09-29

Link: http://arxiv.org/pdf/2509.25190

1. 📘 Topic and Domain: Visual Jigsaw is a self-supervised post-training framework for improving visual understanding capabilities in multimodal large language models (MLLMs) across image, video and 3D modalities.

2. 💡 Previous Research and New Ideas: Based on previous work in self-supervised visual learning and MLLM post-training, it proposes a novel jigsaw-style task that enhances visual perception without requiring additional model architecture changes.

3. ❓ Problem: Current MLLM post-training approaches are predominantly text-centric and undervalue deep visual understanding, while existing visual enhancement methods require architectural changes or additional generative components.

4. 🛠️ Methods: Implements visual jigsaw tasks where inputs are partitioned, shuffled, and the model must reconstruct the correct order using natural language, applied across three modalities: image patches, video clips, and 3D depth points.

5. 📊 Results and Evaluation: Achieved significant improvements across multiple benchmarks: enhanced fine-grained perception and spatial understanding in images, improved temporal reasoning in videos, and better 3D spatial comprehension, while maintaining the model's original reasoning capabilities.