2025-12-08 Papers

1/2

Paper 1

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Published: 2025-12-05

Link: http://arxiv.org/pdf/2512.05965

1. 📘 Topic and Domain: Iterative image editing using multimodal language models to improve instruction-following capabilities through a "think while edit" framework.
2. 💡 Previous Research and New Ideas: Based on previous work in instruction-based image editing and reinforcement learning, introduces a novel iterative reasoning approach where an MLLM critiques and refines editing instructions over multiple rounds.
3. ❓ Problem: Current single-turn image editing models have limited instruction-following capabilities due to lack of deliberation and inability to self-correct intermediate errors.
4. 🛠️ Methods: Implements a "Think-while-Edit" framework using an MLLM (EditThinker) trained via supervised learning and reinforcement learning to iteratively critique results, refine instructions, and repeat editing until satisfactory.
5. 📊 Results and Evaluation: Achieved significant performance improvements across multiple editing models and benchmarks, with EditThinker boosting scores by large margins (e.g., improving FLUX.1-Kontext from 3.44 to 3.98 on ImgEdit-Bench) through its iterative refinement approach.

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

EditThinker: Think-while-Edit Framework THINKEDIT-140K Dataset Construction Trajectory Generation Trajectory Filter Step-wise Filter SFT Dataset 140K RL Dataset 27K EditThinker Training Supervised Fine-Tuning (SFT) Reinforcement Learning (RL) EditThinker Model Think-while-Edit Iterative Process Input Source Image I_src Instruction T_s EditThinker Critique Score S_t Reasoning R_t Refined T_t Image Editor (FLUX/OmniGen/ Qwen-Image-Edit) I_edit = Editor(I_src, T_t) Output Edited Image I_edit Score ≥ Threshold? Iterative Refinement Critique-Refine-Repeat Final Result Satisfactory Edit Key Innovations Dual-role MLLM: Joint critique scoring and instruction refinement RL alignment: Bridge thinking-editing gap with practical feedback Universal framework: Works with any existing image editor YES NO - Iterate
Q1
1. What is the main limitation of current single-turn image editing models that EditThinker addresses?
Poor image quality and resolution
Inability to self-correct intermediate errors and lack of deliberation
High computational cost and slow processing speed
Q2
2. How does EditThinker improve the editing process compared to traditional approaches?
By using larger and more complex neural networks
By applying more aggressive image filters and effects
By implementing an iterative critique-refine-repeat cycle using an MLLM
Q3
3. What type of training approach does EditThinker use to align its thinking with actual editing outcomes?
Supervised learning only
Reinforcement learning only
Both supervised and reinforcement learning
1/2

Paper 2

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04926

1. 📘 Topic and Domain: Image generation using latent diffusion models, specifically focusing on harmonizing semantic and texture modeling through asynchronous diffusion processes.
2. 💡 Previous Research and New Ideas: Based on previous latent diffusion models and semantic enhancement methods; proposes a novel asynchronous denoising approach where semantic features are generated before texture details.
3. ❓ Problem: Traditional latent diffusion models denoise semantic and texture features simultaneously, leading to slow convergence and suboptimal generation quality.
4. 🛠️ Methods: Introduces Semantic-First Diffusion (SFD) with a dedicated Semantic VAE to compress high-level features, and implements a three-phase asynchronous denoising schedule where semantics lead texture generation.
5. 📊 Results and Evaluation: Achieves state-of-the-art FID scores (1.04-1.06) on ImageNet 256×256, with up to 100× faster convergence than original DiT models while maintaining high reconstruction fidelity.

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Semantic-First Diffusion (SFD) Workflow Input Image x₁ Vision Foundation Model (DINOv2) Extract f_s Semantic VAE (SemVAE) Encoder E_s Decoder D_s Output: s₁ Texture VAE (SD-VAE) Encoder E_z Output: z₁ Composite Latent c = [s₁, z₁] Concatenation Three-Stage Asynchronous Denoising Stage I Semantic Initialization t_s ∈ [0, Δt) t_z = 0 Only semantic latents are denoised Stage II Asynchronous Generation t_s ∈ [Δt, 1] t_z ∈ [0, 1-Δt) Both latents denoise asynchronously Stage III Texture Completion t_s = 1 t_z ∈ [1-Δt, 1] Only texture latents continue refining Diffusion Transformer Input: [s_ts, z_tz], [t_s, t_z], y Output: [v̂_s, v̂_z] Dual timestep embedding Training Objective L_total = L_vel + λL_REPA L_pred = ||v̂_z - (z₁ - z₀)||² + β||v̂_s - (s₁ - s₀)||² β = 2.0, Δt = 0.3 Final Output Decode only z₁ Discard s₁ Generated Image Key Benefits • 100× faster convergence • FID 1.04 on ImageNet 256×256 • Semantic guidance for textures • Preserves reconstruction quality • Natural coarse-to-fine generation • Compatible with existing methods • Minimal computational overhead Δt = 0.3
Q1
1. What is the key innovation in the SFD's denoising process compared to traditional latent diffusion models?
It uses a completely sequential denoising process
It performs asynchronous denoising where semantics lead texture generation
It eliminates texture modeling entirely to focus on semantics
Q2
2. What improvement in training efficiency did SFD achieve compared to the original DiT model?
10x faster convergence
50x faster convergence
100x faster convergence
Q3
3. How does SFD handle the temporal offset between semantic and texture denoising?
Through a dynamic adaptive schedule
Using a fixed temporal offset Δt of 0.3
By randomly varying the offset during training
1/2

Paper 3

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04678

1. 📘 Topic and Domain: The paper focuses on efficient streaming video generation using rewarded distribution matching distillation, falling within the domain of computer vision and deep learning.
2. 💡 Previous Research and New Ideas: Based on previous work in video diffusion models and distribution matching distillation, the paper introduces two new ideas: EMA-Sink for maintaining long-term context and Rewarded Distribution Matching Distillation (Re-DMD) for enhancing motion dynamics.
3. ❓ Problem: The paper addresses the challenge of generating high-quality streaming videos in real-time while maintaining both visual fidelity and dynamic motion, as current methods often result in diminished motion dynamics and over-dependence on initial frames.
4. 🛠️ Methods: The authors implement EMA-Sink to maintain compressed global states through exponential moving average updates, and Re-DMD which uses a vision-language model to rate and prioritize samples with greater dynamics during the distillation process.
5. 📊 Results and Evaluation: The method achieves state-of-the-art performance on standard benchmarks, enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU, with superior scores in both visual quality (4.82) and dynamic complexity (4.18) compared to baselines.

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Reward Forcing: Streaming Video Generation Workflow Text Prompt & Initial Frames Reward Forcing Framework Bidirectional Teacher Model → Autoregressive Student Real-time Generation EMA-Sink State Packaging Global Context Memory Compression Re-DMD Reward-weighted Distribution Matching Motion Enhancement EMA-Sink Details • Sliding Window Attention • Exponential Moving Average • Key-Value Cache Update • Prevents Initial Frame Bias • O(1) Time Complexity Re-DMD Details • Vision-Language Reward • Motion Quality Rating • Weighted Gradient Updates • High-Reward Region Bias • Preserves Data Fidelity Training Process Self-Forcing Autoregressive KV Cache Simulation Generated Video 23.1 FPS Real-time Enhanced Motion Dynamics Key Achievements • Real-time streaming at 23.1 FPS on single H100 GPU • State-of-the-art VBench score: 84.13 • Enhanced object motion and scene navigation • Superior long-range temporal consistency • Reduced computational complexity O(w²) vs O(n²) Key Equations EMA Update: S^i_K = α·S^(i-1)_K + (1-α)·K_(i-w) Re-DMD Loss: J = E[exp(r(x₀,c)/β) · log(p_fake/p_real)] Attention: K^i_global = [S^i_K; K_(i-w+1:i)] Denoising Steps: [1000, 750, 500, 250] Evaluation Results VBench Total: 84.13 | Quality: 84.84 | Semantic: 81.32 Long Video Score: 81.41 | Dynamic Degree: 66.95 Qwen3-VL: Visual 4.82 | Dynamic 4.18 | Text 4.04 Quality Drift: 2.505 (lower is better) Architecture Overview Input: Text Prompt → Causal Student DiT → EMA-Sink (KV Cache) → Re-DMD (Reward Weighting) → VAE Decoder → Output Video (832×480, 5-second clips) → Real-time Streaming at 23.1 FPS Teacher Model: Wan2.1-T2V-1.3B | Training: 600 steps on 64 H200 GPUs | Window Size: 9 frames 1 2 Key Innovations: ① EMA-Sink for efficient state packaging ② Re-DMD for motion-aware distillation
Q1
1. What is the main problem that the EMA-Sink mechanism aims to solve?
High computational costs during video generation
Over-dependence on initial frames and diminished motion dynamics
Poor video quality and resolution
Q2
2. What is the real-time frame rate achieved by the Reward Forcing method on a single H100 GPU?
17.0 FPS
20.7 FPS
23.1 FPS
Q3
3. How does Re-DMD improve upon vanilla distribution matching distillation?
By increasing the speed of video generation
By using multiple GPUs in parallel
By prioritizing samples with greater dynamics using a vision-language model