2025-05-22 Papers

1/2

Paper 1

Scaling Law for Quantization-Aware Training

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14302

1. 📘 Topic and Domain: The paper explores scaling laws for Quantization-Aware Training (QAT) in Large Language Models (LLMs), focusing on understanding how model quantization performance scales with different parameters.
2. 💡 Previous Research and New Ideas: Based on previous scaling laws like Kaplan and Chinchilla, the paper proposes a new unified scaling law that uniquely incorporates model size, training data volume, and quantization granularity, unlike previous work that only considered model size.
3. ❓ Problem: The paper addresses the lack of understanding of how QAT behaves at 4-bit precision (W4A4), particularly how quantization error relates to model size, training data, and quantization granularity.
4. 🛠️ Methods: The authors conducted 268 QAT experiments with various model sizes and training configurations, decomposed quantization error into weight and activation components, and developed a mathematical model to predict quantization error.
5. 📊 Results and Evaluation: The study found that quantization error decreases with larger models but increases with more training tokens and coarser quantization granularity, and identified that activation quantization in the FC2 layer is the primary bottleneck for W4A4 QAT performance.

Scaling Law for Quantization-Aware Training

Scaling Law for Quantization-Aware Training Flow Training Setup Model Architecture Llama3-style Models Dataset OLMo2-Mix-1124 Quantization Settings W4A4, W4A16, W16A4 268 QAT Experiments Model Size Analysis 74M to 973M params Training Tokens 10B to 200B tokens Quantization Group G∈{32,64,128,256} Unified Scaling Law Error Decreases with Model Size Error Increases with Training Tokens Error Decreases with Smaller Groups
Q1
1. What is the main innovation of the paper's scaling law compared to previous approaches?
It only considers model size and ignores other factors
It incorporates model size, training data volume, and quantization granularity together
It focuses exclusively on activation quantization error
Q2
2. According to the paper's findings, what happens to quantization error as the number of training tokens increases?
The error decreases linearly
The error remains constant
The error increases
Q3
3. What did the researchers identify as the primary bottleneck in W4A4 QAT performance?
The weight quantization in all layers
The activation quantization in the FC2 layer
The model size limitations
1/2

Paper 2

IA-T2I: Internet-Augmented Text-to-Image Generation

Published: 2025-05-21

Link: http://arxiv.org/pdf/2505.15779

1. 📘 Topic and Domain: Text-to-image generation with internet-augmented knowledge integration, in the domain of computer vision and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on existing text-to-image models like Stable Diffusion and ControlNet, proposes a novel framework to augment these models with real-time internet-retrieved reference images.
3. ❓ Problem: Addresses the challenge of T2I models failing to generate accurate images when text prompts contain uncertain knowledge (rare, unknown, or ambiguous concepts).
4. 🛠️ Methods: Implements an IA-T2I framework with active retrieval, query generation, hierarchical image selection, augmented generation, and self-reflection mechanisms to integrate internet-sourced reference images.
5. 📊 Results and Evaluation: Outperformed baseline GPT-4o by approximately 30% in human evaluation on their Img-Ref-T2I dataset, with automated GPT-4o evaluation achieving comparable results to human preference evaluation.

IA-T2I: Internet-Augmented Text-to-Image Generation

IA-T2I Framework Text Prompt Input Active Retrieval Module Query Generator Search Engine Hierarchical Image Selection Diversity Selection + Re-Rank Augmented T2I Generation Self-Reflection Mechanism Text Following + Reference Usage + Quality Check Generated Image Output
Q1
1. What is the main challenge that IA-T2I framework aims to solve?
Poor image quality in text-to-image generation
Inability to handle uncertain knowledge in text prompts
Slow processing speed of image generation
Q2
2. Which component of the IA-T2I framework determines if a reference image is needed?
Self-reflection mechanism
Query generator
Active retrieval module
Q3
3. In the Img-Ref-T2I dataset, what is NOT one of the three types of uncertain knowledge categories?
Known but rare
Unknown
Frequently used
1/2

Paper 3

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Published: 2025-05-20

Link: http://arxiv.org/pdf/2505.14231

1. 📘 Topic and Domain: Universal visual grounding with reinforcement learning, focusing on localizing objects in images based on complex textual instructions.
2. 💡 Previous Research and New Ideas: Based on traditional visual grounding methods and recent large language models, proposing new ideas of combining reasoning-guided multimodal language models with reinforcement learning for better cross-image understanding.
3. ❓ Problem: Addressing the limitation of current visual grounding methods that struggle with implicit and complex instructions across multiple images due to lack of advanced reasoning capabilities.
4. 🛠️ Methods: Two-stage approach: (1) Chain-of-Thought supervised fine-tuning using a high-quality annotated dataset, and (2) Group Relative Policy Optimization with a novel difficulty-aware weight adjustment strategy.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance on MIG-Bench with 9.1% improvement over previous methods, and demonstrated strong zero-shot generalization with 23.4% average improvement across four image and video reasoning grounding benchmarks.

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

UniVG-R1 Workflow Stage 1: Cold Start Data Construction and CoT-SFT MGrounding-630k Dataset Generate CoT with Qwen-VL-MAX 76k CoT Samples CoT-SFT Training Stage 2: Reinforcement Learning GRPO Algorithm IoU-based Reward Format Reward Difficulty-Aware Weight Adjustment Final UniVG-R1 Model
Q1
1. What was the key innovation in addressing the difficulty bias during GRPO training?
Increasing the dataset size
A difficulty-aware weight adjustment strategy that dynamically scales gradients based on sample difficulty
Using multiple language models for cross-validation
Q2
2. Why did the authors choose a two-stage training approach instead of pure reinforcement learning?
Because it was computationally cheaper
Because other papers recommended this approach
Because pure RL struggled with exploring the reasoning space due to the model's limited initial grounding ability
Q3
3. What is most impressive about the model's performance improvement on zero-shot tasks?
It achieved this with only 8.3% of the training data compared to previous methods
It only worked on image tasks but not video tasks
It required extensive task-specific fine-tuning