2025-04-03 Papers

Paper 1

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Published: 2025-04-01

Link: http://arxiv.org/pdf/2504.00999

1. 📘 Topic and Domain: A unified framework called MergeVQ for both visual generation and representation learning, combining token merging techniques with vector quantization in computer vision.

2. 💡 Previous Research and New Ideas: Based on Vector Quantization (VQ) and Masked Image Modeling (MIM) research, proposes new ideas of disentangled token merging and quantization to bridge the gap between generation and representation learning tasks.

3. ❓ Problem: Addresses the trade-off between generation quality and representation learning capabilities in shared latent space, while improving efficiency in both tasks.

4. 🛠️ Methods: Uses token merging with Look-up Free Quantization (LFQ) for compression, introduces Source Recovery for preserving spatial information, and employs MergeAR with KV Cache compression for efficient generation.

5. 📊 Results and Evaluation: Achieves competitive performance in both representation learning (79.8% linear probe accuracy) and image generation (gFID of 2.24) on ImageNet-1K, while maintaining favorable token efficiency and inference speed.

Paper 2

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Published: 2025-04-02

Link: http://arxiv.org/pdf/2504.01724

1. 📘 Topic and Domain: Human image animation using diffusion transformers for generating realistic videos from single images, within the computer vision and deep learning domain.

2. 💡 Previous Research and New Ideas: Based on previous GAN and diffusion-based animation methods, proposing new hybrid guidance combining implicit facial representations, 3D head spheres, and body skeletons along with complementary appearance guidance.

3. ❓ Problem: Addressing limitations in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence in human image animation.

4. 🛠️ Methods: Uses a DiT-based framework with hybrid motion guidance, progressive training strategy, and complementary appearance guidance through multi-reference protocols and bone length adjustment.

5. 📊 Results and Evaluation: Outperforms state-of-the-art methods across metrics (FID, SSIM, PSNR, LPIPS, FVD), demonstrating better fine-grained motions, identity preservation, temporal consistency and high fidelity in both portrait and full-body animations.

Paper 3

Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Published: 2025-04-01

Link: http://arxiv.org/pdf/2504.00883

1. 📘 Topic and Domain: Improving visual-spatial reasoning capabilities in multimodal large language models (MLLMs), specifically focusing on video-based visual intelligence.

2. 💡 Previous Research and New Ideas: Based on DeepSeek-R1-Zero's training approach, introduces the application of GRPO (Group Relative Policy Optimization) training specifically for visual-spatial reasoning tasks, with a newly created VSI-100k dataset.

3. ❓ Problem: Small to medium-sized MLLMs' inability to perform effective visual-spatial reasoning, even with Chain of Thought (CoT) prompting.

4. 🛠️ Methods: Implemented GRPO training using a custom VSI-100k dataset (created from ScanNet), with format and accuracy rewards, and compared performance using different prompting strategies (think-mode, observe-mode, and vanilla-mode).

5. 📊 Results and Evaluation: The vsGRPO-2B model outperformed the base model by 12.1% and surpassed GPT-4o, while vsGRPO-7B achieved performance comparable to LLaVA-NeXT-Video-72B, demonstrating superior results compared to supervised fine-tuning and direct preference optimization approaches.