2025-11-26 Papers

1/2

Paper 1

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20635

1. 📘 Topic and Domain: A unified image generation model called iMontage that can handle multiple input and output images while maintaining consistency and dynamic content generation in computer vision.

2. 💡 Previous Research and New Ideas: Based on previous video diffusion models but introduces a novel approach to inject image data diversity into temporal frameworks, proposing a unified framework that repurposes video models for flexible image generation.

3. ❓ Problem: How to generate multiple highly dynamic output images while maintaining both temporal and semantic consistency across the generated images, which existing models struggle with.

4. 🛠️ Methods: Developed a video-based framework with a novel rotary positional embedding strategy, created a data curation pipeline for motion diversity, and implemented a three-stage training scheme (pre-training, supervised fine-tuning, and high-quality annealing).

5. 📊 Results and Evaluation: Achieved state-of-the-art performance across various image generation tasks including one-to-one editing, many-to-one generation, and many-to-many generation, with strong quantitative metrics on benchmarks and convincing qualitative results in visualization tests.

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

1/2

Paper 2

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20561

1. 📘 Topic and Domain: The paper investigates whether understanding capabilities truly inform generation in unified multimodal models (UMMs) through controlled evaluation frameworks.

2. 💡 Previous Research and New Ideas: The paper builds on existing UMM research but introduces a novel decoupled evaluation framework called UniSandbox using synthetic data to isolate and analyze specific capabilities.

3. ❓ Problem: The paper aims to systematically investigate and quantify the gap between understanding and generation capabilities in UMMs, particularly in reasoning and knowledge transfer.

4. 🛠️ Methods: The authors developed UniSandbox to evaluate models using controlled synthetic datasets, implemented Chain-of-Thought (CoT) prompting, and proposed a self-training framework called STARS to internalize reasoning capabilities.

5. 📊 Results and Evaluation: Results revealed significant understanding-generation gaps in current UMMs, but showed that CoT dramatically improved performance (e.g., BAGEL's score increased from 0.0283 to 0.5100), and the STARS framework successfully internalized reasoning capabilities through self-training.

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

1/2

Paper 3

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20256

1. 📘 Topic and Domain: Reinforcement learning (RL) with adversarial rewards for text-to-image generation, focusing on improving image quality and human preference alignment.

2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO) for language models, proposing a novel adversarial reward framework that uses reference images and visual foundation models instead of scalar rewards.

3. ❓ Problem: Addressing reward hacking in existing text-to-image generation systems where higher reward scores don't necessarily correspond to better image quality or human preferences.

4. 🛠️ Methods: Introduces Adv-GRPO framework that iteratively updates both reward model and generator using adversarial training, incorporating high-quality reference images as positive samples and leveraging visual foundation models like DINO for dense visual rewards.

5. 📊 Results and Evaluation: Achieved 70.0% and 72.4% win rates in image quality and aesthetics respectively compared to baselines in human evaluation, while maintaining comparable benchmark performance scores and enabling flexible style customization.