2025-06-30 Papers

1/2

Paper 1

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Published: 2025-06-26

Link: http://arxiv.org/pdf/2506.21862

1. 📘 Topic and Domain: A token compression strategy called LLaVA-Scissor for video large language models (VLLMs) in the domain of computer vision and natural language processing.

2. 💡 Previous Research and New Ideas: Based on previous attention-based token compression methods, it proposes a novel Semantic Connected Components (SCC) approach that better preserves semantic regions without redundancy.

3. ❓ Problem: The problem of efficiently compressing video tokens while maintaining semantic information, as VLLMs generate many tokens when processing video frames sequentially.

4. 🛠️ Methods: Uses a two-step compression strategy: first applies SCC to identify unique semantic regions within each frame spatially, then applies SCC again temporally across frames to remove redundancy.

5. 📊 Results and Evaluation: Outperformed other token compression methods on video question-answering, long video understanding, and MVBench benchmarks, especially at low token retention ratios (achieving 95.7% of original performance at 10% retention).

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

1/2

Paper 2

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Published: 2025-06-24

Link: http://arxiv.org/pdf/2506.19697

1. 📘 Topic and Domain: This paper focuses on preventing outlier activation features during pre-training of Large Language Models (LLMs) for improved quantization performance.

2. 💡 Previous Research and New Ideas: Based on research about outlier formation in LLMs due to channel-wise operations and adaptive gradient scaling, it proposes a novel pre-training framework called OSP that prevents outliers proactively rather than mitigating them after training.

3. ❓ Problem: The paper addresses how extreme activation outliers in LLMs severely degrade quantization performance, making efficient deployment on resource-constrained devices difficult.

4. 🛠️ Methods: The authors developed Outlier-Safe Pre-Training (OSP) framework combining three components: Muon optimizer to eliminate privileged bases, Single-Scale RMSNorm to prevent channel-wise amplification, and learnable embedding projection to redistribute activation magnitudes.

5. 📊 Results and Evaluation: Testing on a 1.4B-parameter model trained on 1 trillion tokens, OSP achieved a 35.7 average score across 10 benchmarks under 4-bit quantization (compared to 26.5 for Adam-trained models), with near-zero excess kurtosis (0.04) and only 2% training overhead.

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

1/2

Paper 3

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Published: 2025-06-22

Link: http://arxiv.org/pdf/2506.18095

1. 📘 Topic and Domain: The paper presents ShareGPT-4o-Image, a dataset and model for multimodal image generation in the domain of artificial intelligence and computer vision.

2. 💡 Previous Research and New Ideas: Based on previous research in multimodal generative models and large language models, the paper proposes a new dataset synthesized from GPT-4o's image generation capabilities to democratize advanced image generation abilities.

3. ❓ Problem: The paper aims to solve the problem of proprietary and inaccessible advanced image generation systems by creating an open-source alternative with comparable capabilities.

4. 🛠️ Methods: The authors created a 91K dataset (45K text-to-image and 46K text-and-image-to-image pairs) using GPT-4o's generation capabilities, then fine-tuned the Janus-Pro model on this dataset to create Janus-4o.

5. 📊 Results and Evaluation: Janus-4o achieved significant improvements over its predecessor, with 4 and 1.6 point improvements on EvalGen and DPG-Bench benchmarks respectively, while also enabling text-and-image-to-image generation capabilities with only 6 hours of training.