2025-12-08 Papers

1/2

Paper 1

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Published: 2025-12-05

Link: http://arxiv.org/pdf/2512.05965

1. 📘 Topic and Domain: Iterative image editing using multimodal language models to improve instruction-following capabilities through a "think while edit" framework.

2. 💡 Previous Research and New Ideas: Based on previous work in instruction-based image editing and reinforcement learning, introduces a novel iterative reasoning approach where an MLLM critiques and refines editing instructions over multiple rounds.

3. ❓ Problem: Current single-turn image editing models have limited instruction-following capabilities due to lack of deliberation and inability to self-correct intermediate errors.

4. 🛠️ Methods: Implements a "Think-while-Edit" framework using an MLLM (EditThinker) trained via supervised learning and reinforcement learning to iteratively critique results, refine instructions, and repeat editing until satisfactory.

5. 📊 Results and Evaluation: Achieved significant performance improvements across multiple editing models and benchmarks, with EditThinker boosting scores by large margins (e.g., improving FLUX.1-Kontext from 3.44 to 3.98 on ImgEdit-Bench) through its iterative refinement approach.

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

1/2

Paper 2

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04926

1. 📘 Topic and Domain: Image generation using latent diffusion models, specifically focusing on harmonizing semantic and texture modeling through asynchronous diffusion processes.

2. 💡 Previous Research and New Ideas: Based on previous latent diffusion models and semantic enhancement methods; proposes a novel asynchronous denoising approach where semantic features are generated before texture details.

3. ❓ Problem: Traditional latent diffusion models denoise semantic and texture features simultaneously, leading to slow convergence and suboptimal generation quality.

4. 🛠️ Methods: Introduces Semantic-First Diffusion (SFD) with a dedicated Semantic VAE to compress high-level features, and implements a three-phase asynchronous denoising schedule where semantics lead texture generation.

5. 📊 Results and Evaluation: Achieves state-of-the-art FID scores (1.04-1.06) on ImageNet 256×256, with up to 100× faster convergence than original DiT models while maintaining high reconstruction fidelity.

Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

1/2

Paper 3

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Published: 2025-12-04

Link: http://arxiv.org/pdf/2512.04678

1. 📘 Topic and Domain: The paper focuses on efficient streaming video generation using rewarded distribution matching distillation, falling within the domain of computer vision and deep learning.

2. 💡 Previous Research and New Ideas: Based on previous work in video diffusion models and distribution matching distillation, the paper introduces two new ideas: EMA-Sink for maintaining long-term context and Rewarded Distribution Matching Distillation (Re-DMD) for enhancing motion dynamics.

3. ❓ Problem: The paper addresses the challenge of generating high-quality streaming videos in real-time while maintaining both visual fidelity and dynamic motion, as current methods often result in diminished motion dynamics and over-dependence on initial frames.

4. 🛠️ Methods: The authors implement EMA-Sink to maintain compressed global states through exponential moving average updates, and Re-DMD which uses a vision-language model to rate and prioritize samples with greater dynamics during the distillation process.

5. 📊 Results and Evaluation: The method achieves state-of-the-art performance on standard benchmarks, enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU, with superior scores in both visual quality (4.82) and dynamic complexity (4.18) compared to baselines.