2025-06-30 Papers

1/2

Paper 1

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Published: 2025-06-26

Link: http://arxiv.org/pdf/2506.21862

1. 📘 Topic and Domain: A token compression strategy called LLaVA-Scissor for video large language models (VLLMs) in the domain of computer vision and natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous attention-based token compression methods, it proposes a novel Semantic Connected Components (SCC) approach that better preserves semantic regions without redundancy.
3. ❓ Problem: The problem of efficiently compressing video tokens while maintaining semantic information, as VLLMs generate many tokens when processing video frames sequentially.
4. 🛠️ Methods: Uses a two-step compression strategy: first applies SCC to identify unique semantic regions within each frame spatially, then applies SCC again temporally across frames to remove redundancy.
5. 📊 Results and Evaluation: Outperformed other token compression methods on video question-answering, long video understanding, and MVBench benchmarks, especially at low token retention ratios (achieving 95.7% of original performance at 10% retention).

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

LLaVA-Scissor: Token Compression Workflow Video Input V = {v₁, ..., vₙ} Visual Encoder + Projector Visual Tokens T ∈ R^(n×m×d) Step 1: Spatial Compression SCC(t₁) SCC(t₂) ... SCC(tₙ) t'ᵢ = SCC(tᵢ) for each frame Spatial Compressed Tokens T' = concat[t'₁...t'ₙ] Step 2: Temporal Compression SCC(T') Remove temporal redundancy Representative Tokens Tᵣ ∈ R^(M×d) M << n×m Token Merging Similarity Matching & Average Final Compressed T^fin ∈ R^(M×d) Semantic Connected Components (SCC) Similarity Matrix A Connected Components Token Aggregation A = (K·K^T / ||K|| > τ) Union-Find with path compression Non-overlapping semantic regions Key Benefits Training-free token compression Comprehensive semantic coverage Two-step spatio-temporal compression Superior performance at low retention ratios Non-overlapping semantic tokens
Q1
1. What is the main limitation of previous attention-based token compression methods that LLaVA-Scissor aims to address?
They are too computationally expensive
They tend to select redundant key regions while missing other semantic areas
They can only work with short videos
Q2
2. How does LLaVA-Scissor's two-step compression process work?
It first compresses temporally then spatially
It compresses audio and video separately
It first identifies semantic regions within frames spatially, then removes redundancy across frames temporally
Q3
3. What impressive performance did LLaVA-Scissor achieve at low token retention?
75% of original performance at 5% retention
95.7% of original performance at 10% retention
85% of original performance at 15% retention
1/2

Paper 2

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Published: 2025-06-24

Link: http://arxiv.org/pdf/2506.19697

1. 📘 Topic and Domain: This paper focuses on preventing outlier activation features during pre-training of Large Language Models (LLMs) for improved quantization performance.
2. 💡 Previous Research and New Ideas: Based on research about outlier formation in LLMs due to channel-wise operations and adaptive gradient scaling, it proposes a novel pre-training framework called OSP that prevents outliers proactively rather than mitigating them after training.
3. ❓ Problem: The paper addresses how extreme activation outliers in LLMs severely degrade quantization performance, making efficient deployment on resource-constrained devices difficult.
4. 🛠️ Methods: The authors developed Outlier-Safe Pre-Training (OSP) framework combining three components: Muon optimizer to eliminate privileged bases, Single-Scale RMSNorm to prevent channel-wise amplification, and learnable embedding projection to redistribute activation magnitudes.
5. 📊 Results and Evaluation: Testing on a 1.4B-parameter model trained on 1 trillion tokens, OSP achieved a 35.7 average score across 10 benchmarks under 4-bit quantization (compared to 26.5 for Adam-trained models), with near-zero excess kurtosis (0.04) and only 2% training overhead.

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Outlier-Safe Pre-Training (OSP) Framework Problem Extreme activation outliers degrade quantization Component 1 Muon Optimizer Eliminates privileged bases Newton-Schulz orthogonalization Component 2 Single-Scale RMSNorm Prevents channel-wise amplification Component 3 Embedding Projection Redistributes activation magnitudes Training Process 1.4B parameters, 1T tokens Only 2% training overhead Outlier Measurement Excess Kurtosis OSP: 0.04 Adam: 1818.56 Quantization 4-bit W4A4 Evaluation Benchmarks 10 downstream tasks PTQ Compatibility Complementary benefits Key Results OSP: 35.7 avg score (4-bit) Adam: 26.5 avg score (4-bit) Near-zero outliers maintained Analysis Insights • Attention sinks persist without outliers • Different attention logit distributions • Outliers not inherent
Q1
1. What is the main innovation of the OSP framework in handling outliers compared to previous approaches?
It uses post-training quantization to remove outliers
It prevents outlier formation during pre-training rather than mitigating after
It ignores outliers completely during model training
Q2
2. What surprising finding did the researchers make about attention sinks?
Attention sinks completely disappeared when outliers were eliminated
Attention sinks caused more outliers than previously thought
Attention sinks persisted even without outliers, suggesting they aren't inherently responsible for outlier formation
Q3
3. What was the training overhead cost of implementing the OSP framework?
It increased training time by 25%
It increased training time by 2%
It decreased training time by 10%
1/2

Paper 3

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Published: 2025-06-22

Link: http://arxiv.org/pdf/2506.18095

1. 📘 Topic and Domain: The paper presents ShareGPT-4o-Image, a dataset and model for multimodal image generation in the domain of artificial intelligence and computer vision.
2. 💡 Previous Research and New Ideas: Based on previous research in multimodal generative models and large language models, the paper proposes a new dataset synthesized from GPT-4o's image generation capabilities to democratize advanced image generation abilities.
3. ❓ Problem: The paper aims to solve the problem of proprietary and inaccessible advanced image generation systems by creating an open-source alternative with comparable capabilities.
4. 🛠️ Methods: The authors created a 91K dataset (45K text-to-image and 46K text-and-image-to-image pairs) using GPT-4o's generation capabilities, then fine-tuned the Janus-Pro model on this dataset to create Janus-4o.
5. 📊 Results and Evaluation: Janus-4o achieved significant improvements over its predecessor, with 4 and 1.6 point improvements on EvalGen and DPG-Bench benchmarks respectively, while also enabling text-and-image-to-image generation capabilities with only 6 hours of training.

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

ShareGPT-4o-Image Workflow Dataset Construction Text-to-Image Data 45K samples Prompt-First Image-First 6 dimensions: Objects, Background, Style... Text-and-Image-to-Image 46K samples 14 editing tasks 5 categories: Object manipulation Style transfer Background change... GPT-4o Image Generation Synthesizes 91K high-quality images Gemini-Pro-2.5 Text synthesis and prompt generation Model Development Janus-Pro-7B (Base Model) Original text-to-image capability Training Setup 3 epochs LR: 5×10⁻⁶ Batch size: 128 6 hours on 8×A800 New Capabilities Image encoder E(Î) Semantic embedding 50% token masking Joint training Janus-4o Text-to-Image + Text-and-Image-to-Image Unified multimodal generation Autoregressive token prediction Evaluation & Results Text-to-Image GenEval: +4 points DPG-Bench: +1.6 points Overall: 80% accuracy vs Janus-Pro Text-and-Image-to-Image ImgEdit-Bench: 3.26 Outperforms baselines Only 91K samples Strong in Style Transfer Human Evaluation 52 T2I examples 35 T2I2I examples Higher preference vs Janus-Pro & UltraEdit Key Achievements From scratch training Competitive performance Efficient training Open-source release Key Contributions First GPT-4o Distillation Dataset 91K high-quality samples Text-to-Image + Text-and-Image-to-Image Unified MLLM Architecture Dual-capability model Autoregressive generation Efficient Training Protocol 6 hours on 8×A800 Strong performance gains
Q1
1. What was the main innovation of Janus-4o compared to its predecessor Janus-Pro?
It achieved faster training speed
It added text-and-image-to-image generation capability while improving text-to-image performance
It reduced the model size while maintaining performance
Q2
2. How long did it take to train Janus-4o on an 8×A800 GPU machine?
24 hours
12 hours
6 hours
Q3
3. What unique approach did the authors use to create their training dataset?
They collected real-world images from the internet
They generated synthetic images using multiple open-source models
They synthesized images using GPT-4o's image generation capabilities