2025-11-04 Papers

1/2

Paper 1

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

Published: 2025-11-03

Link: http://arxiv.org/pdf/2511.01678

1. 📘 Topic and Domain: The paper presents UniLumos, a unified framework for image and video relighting that aims to achieve physically plausible lighting effects through AI-based methods.
2. 💡 Previous Research and New Ideas: Based on previous diffusion models for relighting that operate in semantic latent space, this paper introduces physics-plausible feedback by incorporating RGB-space geometry feedback into a flow-matching backbone.
3. ❓ Problem: The paper addresses the issue of unrealistic lighting effects in existing diffusion-based relighting methods, which often produce overexposed highlights, misaligned shadows, and incorrect occlusions due to lack of physical correctness.
4. 🛠️ Methods: The authors implement physics-plausible feedback using depth and normal maps extracted from outputs, employ path consistency learning for efficient training, and develop a structured six-dimensional annotation protocol for illumination attributes.
5. 📊 Results and Evaluation: UniLumos achieved state-of-the-art relighting quality with improved physical consistency while delivering a 20x speedup for both image and video relighting, evaluated through metrics like PSNR, SSIM, LPIPS, and a new LumosBench framework.

UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

UniLumos: Unified Image and Video Relighting Workflow LumosData Construction Pipeline Step 1 Subject Mask Step 2 Lumos Aug. Step 3 Gaussian BG Step 4 Caption Aug. 6D Lighting Annotation Protocol: Direction | Light Source | Intensity | Color Temp | Dynamics | Optical UniLumos Architecture Input Video V_real Wan-VAE Encoder ❄ Flow Matching Wan2.1 Backbone N × DiT Blocks 🔥 + umT5 Text Encoder Path Consistency Learning Few-step Inference Output V_relit Wan-VAE Decoder ❄ Joint Loss L₀: Flow Matching L_fast: Path Consistency L_phy: Physics Feedback Training Strategy: 80% → L₀ 20% → L_fast 50% of L₀ → L_phy Physics-Plausible Feedback Dense Estimator (e.g., Lotus) ❄ Frozen Weights Geometry Maps Depth D ∈ ℝ^[T,H,W] Normal N ∈ ℝ^[T,H,W] Physics-Guided Loss L_phy L_phy = M ⊙ (||D̂ - D||₂/||D||₂ + ||N̂ - N||₂/||N||₂) • Aligns lighting with scene geometry • Improves shadow alignment & spatial coherence Key Benefits: RGB-space supervision | Geometry-free inference | Physical plausibility LumosBench Evaluation Evaluation Metrics • Visual Fidelity: PSNR, SSIM, LPIPS • Temporal Consistency: R-Motion • Lumos Consistency: VLM-based attribute alignment • Dense L2 Error: Geometry alignment Results: • 20× speedup vs baselines • SOTA quality & consistency • Enhanced controllability Applications & Contributions Key Contributions 1. Unified framework for image & video relighting 2. Physics-plausible feedback with RGB-space geometry supervision 3. Structured 6D illumination annotation protocol 4. LumosBench: Attribute-level controllability evaluation Legend: ❄ Frozen Weights | 🔥 Trainable Weights | Flow: Data → Model → Feedback → Evaluation UniLumos achieves physically plausible relighting through structured data, flow-matching architecture, and geometry-aware feedback
Q1
1. What is the main innovation of UniLumos compared to previous relighting methods?
It uses a completely new diffusion architecture
It incorporates RGB-space geometry feedback into flow-matching
It relies solely on latent space optimization
Q2
2. How many dimensions does UniLumos's illumination annotation protocol contain?
Four dimensions covering basic lighting attributes
Five dimensions including temporal dynamics
Six dimensions covering direction, source type, intensity, color temperature, dynamics and optical phenomena
Q3
3. What is the performance improvement in terms of speed that UniLumos achieves?
5x speedup compared to previous methods
20x speedup for both image and video relighting
50x speedup but only for image relighting
1/2

Paper 2

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Published: 2025-11-03

Link: http://arxiv.org/pdf/2511.01295

1. 📘 Topic and Domain: The paper presents UniREditBench, a comprehensive benchmark for evaluating reasoning-based image editing models across both real-world and game-world scenarios.
2. 💡 Previous Research and New Ideas: Previous benchmarks focused mainly on single-object attribute transformations in realistic scenarios; this paper introduces new dimensions including multi-object interactions and game-world scenarios with human-defined rules, plus a dual-reference evaluation system.
3. ❓ Problem: The paper addresses the lack of comprehensive benchmarks for evaluating complex reasoning-based image editing tasks and the limitations of text-only reference evaluation methods.
4. 🛠️ Methods: The authors developed a multi-scenario data synthesis pipeline to create 2,700 curated samples across 8 primary dimensions and 18 sub-dimensions, implemented dual-reference evaluation using both textual and ground-truth image references, and created UniREdit-Data-100K dataset with chain-of-thought reasoning annotations.
5. 📊 Results and Evaluation: The fine-tuned UniREdit-Bagel model showed substantial improvements over both open-source and closed-source models in handling complex reasoning-based image editing tasks, demonstrating the effectiveness of their benchmark and dataset.

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

UniREditBench: Methodology Flow Real-World Data Synthesis Game-World Data Synthesis 1. Hand-crafted Reference Prompts Original image description, instruction, textual reference 2. VLM Scale-up (Gemini 2.5 Pro) Generate diverse text prompts 3. Image Pair Generation Generate original & edited images 4. Quality Filtering VLM-based filtering & CoT generation 1. Game Problem Design Maze, Sokoban, Sudoku, etc. 2. Python Program Generation Automatic image & instruction generation 3. CoT Transformation Convert programmatic to natural language 4. Quality Assurance Logical & visual correctness validation Unified Data Processing Pipeline Instruction De-duplication Multi-dimensional Filtering Human Inspection Text hallucination, Instruction adherence, Content preservation Visual quality, Image hallucination, CoT quality UniREditBench 2,700 samples 8 dimensions, 18 sub-categories UniREdit-Data-100K 100,421 training samples High-quality CoT annotations UniREdit-Bagel Fine-tuned Model Enhanced Performance Dual-Reference Instruction Following Visual Consistency Visual Quality
Q1
1. What key innovation does UniREditBench introduce in its evaluation methodology compared to previous benchmarks?
Using only textual references for evaluation
Using dual-reference evaluation with both text and ground-truth images
Using only numerical metrics for evaluation
Q2
2. How many total samples and dimensions does UniREditBench contain?
1,000 samples across 5 primary dimensions
2,700 samples across 8 primary dimensions and 18 sub-dimensions
5,000 samples across 10 primary dimensions
Q3
3. What unique aspect of game-world scenarios does UniREditBench evaluate that previous benchmarks didn't cover?
Only basic game graphics quality
Only game character animations
Logical and strategic reasoning governed by human-defined rules
1/2

Paper 3

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Published: 2025-11-03

Link: http://arxiv.org/pdf/2511.01833

1. 📘 Topic and Domain: A comprehensive benchmark called TIR-Bench for evaluating agentic thinking-with-images reasoning capabilities in multimodal large language models.
2. 💡 Previous Research and New Ideas: Based on previous visual search benchmarks that only test basic operations, this paper proposes a more comprehensive benchmark testing complex tool-based image manipulation and reasoning.
3. ❓ Problem: Current benchmarks fail to fully evaluate advanced visual reasoning capabilities like intelligently creating and operating tools to transform images for problem-solving.
4. 🛠️ Methods: Created a 13-task benchmark requiring tool use for image processing, evaluated 22 multimodal language models including both open source and proprietary models with and without tool-use capabilities.
5. 📊 Results and Evaluation: TIR-Bench proved challenging with best performance at only 46%, models with tool-use capabilities significantly outperformed standard models, and agentic fine-tuning was shown more effective than direct fine-tuning.

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

TIR-Bench: Methodology Flow Chart Task Design 13 Diverse Tasks • Tool-based reasoning • Multi-step manipulation • Dynamic visual processing Data Collection 1,215 Total Examples • Human annotation • Synthetic generation • Web sourcing Model Evaluation 22 MLLMs Tested • Open-source models • Proprietary models • Tool-using agents Analysis Performance • Function calling • Fine-tuning study • Qualitative analysis 13 Task Categories Color VQA Low-Light VQA Instrument Reading Jigsaw Puzzle Math VQA Maze Rotated OCR Proportion VQA Rotation Game Spot Difference Symbolic Reasoning Visual Search Word Search Evaluation Process Zero-shot GPT-4o Judge Accuracy/IoU Analysis Comparison Key Findings • Best performance: 46% (o3-TU) • Non-agentic models perform poorly (<29%) • Tool-use capabilities essential for success • Agentic fine-tuning outperforms direct SFT • Function calling improves with guidance • Recent models show better iterative calling • TIR-Bench proves universally challenging • Thinking-with-images capability crucial
Q1
1. What was the main limitation of previous visual reasoning benchmarks that TIR-Bench aimed to address?
They only tested basic operations like localization and cropping
They were too computationally expensive to run
They only worked with black and white images
Q2
2. When comparing agentic fine-tuning versus direct fine-tuning on the rotated OCR task, what was discovered?
Direct fine-tuning performed better with more data
Agentic fine-tuning showed significantly better performance that scaled with data size
Both methods performed equally well
Q3
3. What was the highest accuracy achieved by any model on the TIR-Bench benchmark?
28.9% by Gemini-2.5-Pro
46% by o3-TU
67% by GPT-4