2025-07-29 Papers

1/2

Paper 1

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Published: 2025-07-28

Link: http://arxiv.org/pdf/2507.20939

1. 📘 Topic and Domain: A multimodal model (ARC-Hunyuan-Video-7B) for comprehensive understanding of real-world short videos, focusing on video comprehension and analysis.

2. 💡 Previous Research and New Ideas: Based on Hunyuan-7B vision-language model, introduces new features like audio-visual synchronization and timestamp overlay mechanism for temporal awareness, moving beyond traditional video-only or general-purpose multimodal models.

3. ❓ Problem: Addressing the challenge of understanding complex real-world short videos with dense visual elements, high-information audio, and rapid pacing that focuses on emotional expression and viewpoint delivery.

4. 🛠️ Methods: Employs a multi-stage training approach including pre-training on millions of videos using an automated annotation pipeline, instruction fine-tuning, cold start initialization, reinforcement learning post-training, and final instruction fine-tuning using high-quality human-annotated data.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance on ShortVid-Bench (74.3% accuracy), outperforms baselines in temporal video grounding tasks, and demonstrates strong versatility in downstream applications with significant improvements in user engagement metrics.

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

1/2

Paper 2

Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

Published: 2025-07-28

Link: http://arxiv.org/pdf/2507.21049

1. 📘 Topic and Domain: Multi-task learning optimization in computer vision, focusing on improving how neural networks learn multiple related tasks simultaneously.

2. 💡 Previous Research and New Ideas: Based on existing multi-task optimization methods that focus on loss scaling and gradient manipulation; proposes a novel representation-level approach that examines task interactions in the shared feature space.

3. ❓ Problem: Addresses the challenge of negative transfer in multi-task learning, where optimizing one task can harm the performance of others, while also aiming to better exploit positive complementarity between tasks.

4. 🛠️ Methods: Introduces Rep-MTL with two components: Task-specific Saliency Regulation (TSR) to preserve task-specific patterns through entropy-based regularization, and Cross-task Saliency Alignment (CSA) to promote beneficial information sharing through contrastive learning.

5. 📊 Results and Evaluation: Achieves competitive performance gains on four challenging benchmarks (NYUv2, Cityscapes, Office-31, Office-Home), with faster training than most gradient manipulation methods (~26% faster than Nash-MTL), while maintaining effectiveness across different hyperparameter settings.

Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

1/2

Paper 3

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Published: 2025-07-28

Link: http://arxiv.org/pdf/2507.20984

1. 📘 Topic and Domain: Development of efficient large language models (SmallThinker) specifically designed for local deployment on resource-constrained devices.

2. 💡 Previous Research and New Ideas: Based on traditional approaches of compressing cloud-based models, but introduces a novel ground-up architecture designed specifically for local deployment constraints rather than post-hoc adaptation.

3. ❓ Problem: The challenge of running powerful LLMs on local devices with limited computational power, memory, and storage, without compromising model performance.

4. 🛠️ Methods: Implements a two-level sparse structure combining fine-grained Mixture-of-Experts with sparse feed-forward networks, pre-attention router for parameter prefetching, and NoPE-RoPE hybrid sparse attention mechanism for memory efficiency.

5. 📊 Results and Evaluation: SmallThinker models achieve 20+ tokens/s on consumer CPUs using minimal memory (1GB-8GB), outperforming larger models while matching or exceeding their performance on benchmarks like MMLU, with up to 86× speed improvement over comparable models.