2025-04-03 Papers

Paper 1

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Published: 2025-04-01

Link: http://arxiv.org/pdf/2504.00999

1. 📘 Topic and Domain: A unified framework called MergeVQ for both visual generation and representation learning, combining token merging techniques with vector quantization in computer vision.
2. 💡 Previous Research and New Ideas: Based on Vector Quantization (VQ) and Masked Image Modeling (MIM) research, proposes new ideas of disentangled token merging and quantization to bridge the gap between generation and representation learning tasks.
3. ❓ Problem: Addresses the trade-off between generation quality and representation learning capabilities in shared latent space, while improving efficiency in both tasks.
4. 🛠️ Methods: Uses token merging with Look-up Free Quantization (LFQ) for compression, introduces Source Recovery for preserving spatial information, and employs MergeAR with KV Cache compression for efficient generation.
5. 📊 Results and Evaluation: Achieves competitive performance in both representation learning (79.8% linear probe accuracy) and image generation (gFID of 2.24) on ImageNet-1K, while maintaining favorable token efficiency and inference speed.
Q1
1. What is the main novel contribution of MergeVQ that helps balance generation and representation learning?
Using larger model architectures
Disentangling semantics from latent space via token merging
Increasing the training dataset size
Q2
2. How does MergeVQ achieve efficient token recovery during reconstruction?
By simply discarding less important tokens
Through random token selection
Using source matrix to preserve positional information
Q3
3. What performance did MergeVQ achieve for linear probe accuracy on ImageNet-1K?
69.5%
79.8%
89.8%

Paper 2

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Published: 2025-04-02

Link: http://arxiv.org/pdf/2504.01724

1. 📘 Topic and Domain: Human image animation using diffusion transformers for generating realistic videos from single images, within the computer vision and deep learning domain.
2. 💡 Previous Research and New Ideas: Based on previous GAN and diffusion-based animation methods, proposing new hybrid guidance combining implicit facial representations, 3D head spheres, and body skeletons along with complementary appearance guidance.
3. ❓ Problem: Addressing limitations in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence in human image animation.
4. 🛠️ Methods: Uses a DiT-based framework with hybrid motion guidance, progressive training strategy, and complementary appearance guidance through multi-reference protocols and bone length adjustment.
5. 📊 Results and Evaluation: Outperforms state-of-the-art methods across metrics (FID, SSIM, PSNR, LPIPS, FVD), demonstrating better fine-grained motions, identity preservation, temporal consistency and high fidelity in both portrait and full-body animations.
Q1
1. What is the main innovation in DreamActor-M1's approach to controlling facial expressions compared to traditional methods?
Using only facial landmarks for expression control
Combining implicit facial representations with 3D head spheres
Relying solely on 3D mesh models
Q2
2. How does DreamActor-M1 handle the challenge of long-term video generation consistency?
By using a single reference image throughout the generation
By generating complementary pseudo-references from multiple viewpoints
By limiting the video length to short segments
Q3
3. What unique training strategy does DreamActor-M1 employ to handle different image scales?
Single-stage training with fixed resolution
Dual-stage training with separate models
Progressive three-stage training with varying resolutions and scales

Paper 3

Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Published: 2025-04-01

Link: http://arxiv.org/pdf/2504.00883

1. 📘 Topic and Domain: Improving visual-spatial reasoning capabilities in multimodal large language models (MLLMs), specifically focusing on video-based visual intelligence.
2. 💡 Previous Research and New Ideas: Based on DeepSeek-R1-Zero's training approach, introduces the application of GRPO (Group Relative Policy Optimization) training specifically for visual-spatial reasoning tasks, with a newly created VSI-100k dataset.
3. ❓ Problem: Small to medium-sized MLLMs' inability to perform effective visual-spatial reasoning, even with Chain of Thought (CoT) prompting.
4. 🛠️ Methods: Implemented GRPO training using a custom VSI-100k dataset (created from ScanNet), with format and accuracy rewards, and compared performance using different prompting strategies (think-mode, observe-mode, and vanilla-mode).
5. 📊 Results and Evaluation: The vsGRPO-2B model outperformed the base model by 12.1% and surpassed GPT-4o, while vsGRPO-7B achieved performance comparable to LLaVA-NeXT-Video-72B, demonstrating superior results compared to supervised fine-tuning and direct preference optimization approaches.
Q1
1. What was the key finding regarding Chain of Thought (CoT) prompting in small to medium-sized Qwen2-VL models?
CoT prompting significantly improved visual-spatial reasoning
CoT prompting was ineffective and performed worse than vanilla prompting
CoT prompting only worked for numerical answer tasks
Q2
2. Why did the researchers leave out 'route planning' and 'appearance order' topics when creating the VSI-100k dataset?
These topics were too complex for the model to handle
They wanted to test the model's generalization ability to unseen tasks
These topics required expensive manual annotation and couldn't be constructed from static 3D information
Q3
3. What unexpected challenge did the researchers encounter during GRPO training regarding reward functions?
The model learned to exploit format rewards without meaningful thinking
The accuracy rewards were too low to be effective
The KL penalty prevented the model from learning