2025-11-26 Papers

1/2

Paper 1

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20635

1. 📘 Topic and Domain: A unified image generation model called iMontage that can handle multiple input and output images while maintaining consistency and dynamic content generation in computer vision.
2. 💡 Previous Research and New Ideas: Based on previous video diffusion models but introduces a novel approach to inject image data diversity into temporal frameworks, proposing a unified framework that repurposes video models for flexible image generation.
3. ❓ Problem: How to generate multiple highly dynamic output images while maintaining both temporal and semantic consistency across the generated images, which existing models struggle with.
4. 🛠️ Methods: Developed a video-based framework with a novel rotary positional embedding strategy, created a data curation pipeline for motion diversity, and implemented a three-stage training scheme (pre-training, supervised fine-tuning, and high-quality annealing).
5. 📊 Results and Evaluation: Achieved state-of-the-art performance across various image generation tasks including one-to-one editing, many-to-one generation, and many-to-many generation, with strong quantitative metrics on benchmarks and convincing qualitative results in visualization tests.

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

iMontage: Many-to-Many Image Generation Workflow Input Processing Variable-length reference images + text prompts 3D VAE Encoder Text Tokenizer Marginal RoPE Strategy Head-Tail Temporal Indexing Input: {0...7} Output: {24...31} Preserves spatial geometry Prevents positional interference MMDiT Architecture Dual-stream to Single-stream blocks Variable-length attention maps From HunyuanVideo Dataset Creation Pipeline Pretraining Data 5M image-edit pairs 15M video frame pairs Motion filtering Multi-task SFT Data Multi CRef (90k) SRef (35k) Multi-turn (100k) Conditioned CRef (50k) Multi-view (90k) Storyboard (29k) VLM-assisted curation HQ Data Manual review VLM scoring High aesthetic Three-Stage Training Strategy Stage 1: Pre-training Dynamic resolution bucketing Instruction following Stage 2: CocktailMix SFT Difficulty-ordered curriculum Gradual task integration Stage 3: HQ Annealing High-quality subset Learning rate annealing Multi-Modal Output Applications One-to-One Editing Motion-aware editing Instruction following Many-to-One Generation Multi-reference fusion Content preservation Many-to-Many Generation Storyboard generation Temporal consistency Unified Output High dynamics Single inference
Q1
1. What is the key innovation in iMontage's positional embedding strategy?
Using separate embeddings for spatial and temporal dimensions
Assigning early temporal positions to inputs and late positions to outputs with a margin between them
Randomly distributing temporal indices across all images
Q2
2. Which training strategy proved most effective for the multi-task supervised fine-tuning stage?
FlatMix - training all tasks together simultaneously
StageMix - grouping tasks by type and training in phases
CocktailMix - ordering tasks by difficulty and gradually introducing harder ones
Q3
3. What is the main advantage of building iMontage on top of a video model instead of a pure image model?
It runs faster during inference time
It inherits strong temporal coherence and motion priors from video training
It requires less training data overall
1/2

Paper 2

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20561

1. 📘 Topic and Domain: The paper investigates whether understanding capabilities truly inform generation in unified multimodal models (UMMs) through controlled evaluation frameworks.
2. 💡 Previous Research and New Ideas: The paper builds on existing UMM research but introduces a novel decoupled evaluation framework called UniSandbox using synthetic data to isolate and analyze specific capabilities.
3. ❓ Problem: The paper aims to systematically investigate and quantify the gap between understanding and generation capabilities in UMMs, particularly in reasoning and knowledge transfer.
4. 🛠️ Methods: The authors developed UniSandbox to evaluate models using controlled synthetic datasets, implemented Chain-of-Thought (CoT) prompting, and proposed a self-training framework called STARS to internalize reasoning capabilities.
5. 📊 Results and Evaluation: Results revealed significant understanding-generation gaps in current UMMs, but showed that CoT dramatically improved performance (e.g., BAGEL's score increased from 0.0283 to 0.5100), and the STARS framework successfully internalized reasoning capabilities through self-training.

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

UniSandbox: Understanding-Generation Gap Analysis UniSandbox Framework Decoupled Evaluation with Synthetic Data Understanding Decomposed into: Knowledge + Reasoning Prevents Data Leakage • Enables Attribution Analysis Reasoning Generation Task Design Mathematical Operations: 3-2=? → Generate 1 object Symbolic Mapping: A→1→cat (multi-step reasoning) Baseline Results Open-source models: ~0% success rate CoT Activation BAGEL + CoT: 0.028 → 0.510 (18x improvement) STARS Framework 1. Generate CoT data → 2. Filter with rejection sampling 3. Fine-tune for implicit reasoning ✓ Math: Cross-difficulty generalization ✓ Symbolic: Curriculum Learning essential Knowledge Transfer Knowledge Injection Virtual Character Profiles (unseen during pre-training) Forward: Name → Portrait | Inverse: Attributes → Name Transfer Failure All models fail knowledge transfer Query-based Blip3o: Best baseline (0.16) CoT as Knowledge Activator BAGEL + CoT: Forward 0.10 → 0.63 Architecture Insights Query-based architectures show implicit CoT-like properties Queries progressively retrieve target knowledge Key Findings CoT bridges understanding-generation gap • Query architectures show promise • Self-training enables internalization Models: AR | AR+Diffusion | Query-based Two-stage MLLM evaluation
Q1
1. What was the main purpose of introducing Chain-of-Thought (CoT) in the study?
To speed up the model's processing time
To bridge the gap between understanding and generation capabilities
To reduce the model's memory requirements
Q2
2. What was unique about the UniSandbox evaluation framework?
It used only real-world data
It focused solely on image generation tasks
It used synthetic, leak-proof data to isolate specific capabilities
Q3
3. What was the key finding about query-based architectures in knowledge transfer tasks?
They performed worse than all other architectures
They exhibited implicit CoT-like properties
They required more computational resources
1/2

Paper 3

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Published: 2025-11-25

Link: http://arxiv.org/pdf/2511.20256

1. 📘 Topic and Domain: Reinforcement learning (RL) with adversarial rewards for text-to-image generation, focusing on improving image quality and human preference alignment.
2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO) for language models, proposing a novel adversarial reward framework that uses reference images and visual foundation models instead of scalar rewards.
3. ❓ Problem: Addressing reward hacking in existing text-to-image generation systems where higher reward scores don't necessarily correspond to better image quality or human preferences.
4. 🛠️ Methods: Introduces Adv-GRPO framework that iteratively updates both reward model and generator using adversarial training, incorporating high-quality reference images as positive samples and leveraging visual foundation models like DINO for dense visual rewards.
5. 📊 Results and Evaluation: Achieved 70.0% and 72.4% win rates in image quality and aesthetics respectively compared to baselines in human evaluation, while maintaining comparable benchmark performance scores and enabling flexible style customization.

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Adv-GRPO: Adversarial Reward Framework for Image Generation Input Text Prompts Reference Images Base Model (SD3) Foundation Models Generator G_θ Text-to-Image Model GRPO Optimization LoRA Fine-tuning Group Samples G=16 Reward Model R_φ Discriminator Reference vs Generated Binary Classification Adversarial Training Reward Types 1. Human Preference (PickScore, HPS, Aesthetic) 2. Rule-based (OCR, GenEval) 3. Foundation Model Adversarial Training Loop Generator Loss GRPO Objective: J_gen = E[f(r, Â, θ, ε, β)] Group Advantage  Reward Model Loss Binary Classification: J_reward = -E[log R(x_r)] - E[log(1-R(x_g))] Reference (+) vs Generated (-) Foundation Model DINO/SigLIP Features: R = λ_g R_global + λ_l R_local Global + Local Rewards Key Innovations 1. Reward Hacking Mitigation Reference images as supervision Dynamic reward model updates 2. Visual Foundation Rewards Dense visual signals vs scalars DINO global + local features 3. Style Customization RL-based distribution transfer Reference-guided style transfer Results & Evaluation Human Evaluation 70.0% win rate (quality) Benchmark Metrics PickScore, OCR, GenEval Comprehensive Gains Quality, Aesthetics, Alignment Style Transfer Anime, Sci-fi styles
Q1
1. What is the main innovation of Adv-GRPO compared to traditional reward models in text-to-image generation?
It uses multiple scalar rewards simultaneously
It introduces an adversarial reward system with reference images
It removes the need for any reward function
Q2
2. When using DINO as a visual foundation model reward, how does the system process the images?
It only looks at global image features
It only analyzes local patch-level features
It combines both global and local features with a 7:3 weighting ratio
Q3
3. What practical problem in text-to-image generation does this paper primarily address?
Slow generation speed of images
Reward hacking where higher scores don't mean better images
High computational costs of training