2025-04-11 Papers

1/2

Paper 1

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Published: 2025-04-10

Link: http://arxiv.org/pdf/2504.07960

1. 📘 Topic and Domain: Universal image generation framework called VisualCloze that leverages visual in-context learning to handle diverse image generation tasks within a single model.
2. 💡 Previous Research and New Ideas: Based on diffusion models and task-specific image generation approaches, proposing visual in-context learning where models learn tasks from visual demonstrations rather than relying solely on language instructions.
3. ❓ Problem: Addressing limitations of current image generation approaches that either require task-specific models or face challenges with task ambiguity, sparse task distributions, and lack of generalization to unseen tasks.
4. 🛠️ Methods: Creating a graph-structured dataset (Graph200K) with interrelated tasks, formulating image generation as an image infilling problem, and fine-tuning FLUX.1-Fill-dev to support visual in-context learning where tasks are demonstrated through examples.
5. 📊 Results and Evaluation: The model successfully handles various in-domain tasks with reduced ambiguity, generalizes to unseen tasks, enables task unification, and supports reverse generation, outperforming comparable methods in conditional generation, style transfer, and subject-driven image generation tasks.

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

VisualCloze Methodology Flowchart Problem Task-specific models lack efficiency; Universal models face instruction, distribution, & architecture issues. Solution: VisualCloze Framework 1. Visual In-Context Learning (VICL) Input Format: • C In-Context Examples (Demos) - Each: L images (Conditions + Target) • 1 Query - L-1 Condition Images + 1 Blank Target Goal: Learn task from visual examples, not just text. 2. Unified Task as Infilling Process: • Concatenate all input images into a grid. • Mask the target image region (M). • Use Infilling Model: Generate masked region. • Objective: `X_hat = f(X_grid | T_layout, M)` Benefit: Aligns with pre-trained infilling models. 3. Graph200K Dataset Structure & Purpose: • Built on Subjects200K. • Graph: Images (nodes) + Annotations (edges). • 5 Meta-Tasks (CondGen, Edit, Restore, Style, IP). • Increases task density & overlap. Benefit: Promotes learning transferable knowledge. 4. Model & Training Implementation: • Base Model: FLUX.1-Fill-dev (Infilling). • Fine-tuning: LoRA (Rank 256, minimal changes). • Training Data: Graph200K + others (VITON, etc.). • Positional Embedding: 3D-RoPE for aspect ratios. Benefit: Leverages strong priors with low cost. Key Capabilities Enabled by VisualCloze Improved Seen Tasks Reduced ambiguity, better performance. Unseen Task Generalization Adapts to new tasks via VICL examples. Task Unification Combines multiple sub-tasks into a single step. Reverse Generation Infers conditions from target image.
Q1
1. What is the main innovation of VisualCloze compared to previous universal image generation approaches?
Using a larger and more diverse training dataset
Visual in-context learning instead of relying on language instructions
Developing a completely new diffusion model architecture
Q2
2. What problem does the Graph200K dataset address in the context of visual tasks?
The lack of high-quality training images
The sparsity and isolation of visual tasks that limits knowledge transfer
The computational complexity of training large generative models
Q3
3. Which of the following capabilities was NOT demonstrated by VisualCloze?
Generating frontal faces from side-view images (unseen task)
Reverse generation (inferring conditions from target images)
Real-time video generation with temporal consistency
1/2

Paper 2

MM-IFEngine: Towards Multimodal Instruction Following

Published: 2025-04-10

Link: http://arxiv.org/pdf/2504.07957

MM-IFEngine: Towards Multimodal Instruction Following

MM-IFEngine Workflow MM-IFEngine: Image-Instruction Pair Generation Diverse Image Sources (CC3M, ALLaVA, UI, Geo, Chart) Step 1: Image Filter (Resolution, Semantics) Step 2: Task Generation (GPT-4o/Refine Existing) Step 3: Constraints Integration (LLM Generate & Validate) Constraint Pool (32 Types, 6 Categories) High-Quality Image-Instruction Pairs Dataset Generation Generate Responses (InternVL2.5-78B) Post-Process (Filter) -> MM-IFInstruct-23k (SFT) Generate Rejected Responses (Qwen2-VL-7B) Settings: -Remove Constraints (33/66/100%) -Remove Image -> MM-IFDPO-23k (DPO) MM-IFEval Benchmark Creation Human Annotation & LLM Conflict Check (400 Qs: 300C + 100P) MM-IFEval Benchmark MM-IFEval Hybrid Evaluation Method Rule-based Verification (Objective Constraints) e.g., word count, format, numbers LLM-based Direct Judgment (Clear Constraints) e.g., keyword mention LLM-based Comparative Judgment (Subjective Constraints) e.g., tone, style, role-play Model Training & Evaluation Base MLLMs (e.g., LLaVA, Qwen2) SFT Training (on Instruct-23k) DPO Training (on DPO-23k) Fine-tuned MLLMs Evaluate on Benchmarks (MM-IFEval, MIA, IFEval, VQA)
Q1
1. What is the primary innovation of MM-IFEngine compared to existing instruction following benchmarks?
It uses only proprietary models for evaluation
It focuses exclusively on text-based constraints
It incorporates both compose-level and perception-level constraints with strong visual correlations
Q2
2. How many distinct constraint categories are included in MM-IFEval?
8 categories with an average of 2.6 constraints per question
32 categories with an average of 5.1 constraints per question
16 categories with an average of 3.5 constraints per question
Q3
3. What evaluation strategy does MM-IFEval use that makes it more precise than previous benchmarks?
It relies exclusively on GPT-4o for all evaluations
A hybrid approach combining rule-based verification and judge models
It uses only human evaluators to ensure accuracy
1/2

Paper 3

HoloPart: Generative 3D Part Amodal Segmentation

Published: 2025-04-10

Link: http://arxiv.org/pdf/2504.07943

1. 📘 Topic and Domain: The paper introduces "3D part amodal segmentation," a novel task in 3D computer vision that decomposes 3D shapes into complete semantic parts, even when parts are occluded.
2. 💡 Previous Research and New Ideas: The paper builds on existing 3D part segmentation techniques but extends beyond them by proposing a diffusion-based model (HoloPart) that can complete partial segments into full 3D parts, similar to how 2D amodal segmentation has evolved for images.
3. ❓ Problem: The paper solves the challenge of generating complete 3D parts from incomplete surface segments, addressing key difficulties in inferring occluded geometry, maintaining global shape consistency, and handling diverse shapes with limited training data.
4. 🛠️ Methods: The authors use a two-stage approach: first applying existing 3D part segmentation to obtain initial surface patches, then using their novel HoloPart diffusion model with local attention and context-aware attention mechanisms to complete these segments into full 3D parts.
5. 📊 Results and Evaluation: HoloPart significantly outperforms state-of-the-art shape completion methods on new benchmarks based on ABO and PartObjaverse-Tiny datasets, demonstrating superior performance in Chamfer Distance, IoU, and F-Score metrics, while enabling applications in geometry editing, animation, and material assignment.

HoloPart: Generative 3D Part Amodal Segmentation

HoloPart Methodology: 3D Part Amodal Segmentation Input: 3D Shape (Mesh/Point Cloud) Stage 1: Initial Part Segmentation Apply existing method (e.g., SAMPart3D) Output: Incomplete Segments {si}, Whole Shape (X), Mask (M) Stage 2: HoloPart - Part Completion (for each segment si) Input: Segment (si -> S), Whole Shape (X), Mask (M) 1. Attention Encoding Context-Aware Attn (S0, X, M) -> co Local Attn (S0, S) -> cl 2. Part Diffusion Model (vθ) (Pretrained on Objects, Finetuned on Parts) Inputs: Noise (ε), Time (t), co, cl Process: Iterative Denoising (CFG) -> Complete Part Latent (z_part) 3. Decoding & Mesh Extraction VAE Decoder (D) -> Occupancy -> Marching Cubes -> Complete Part (pi) Output: Set of Complete Parts {p1, ..., pn} (3D Part Amodal Segmentation) Pretraining VAE + Diffusion trained on large dataset of WHOLE shapes to learn general 3D priors. Data Curation Process ABO & Objaverse (filtered). Create Whole-Part pairs ({si}, {K}) for finetuning HoloPart.
Q1
1. What is the key innovation that distinguishes HoloPart from traditional 3D part segmentation methods?
It uses a larger training dataset with more diverse 3D shapes
It completes the geometry of occluded parts rather than just identifying visible surface patches
It performs segmentation in a single end-to-end process instead of using a two-stage approach
Q2
2. Which two key attention mechanisms does HoloPart incorporate to balance local details and global context?
Temporal attention and spatial attention
Cross-modal attention and self-supervised attention
Local attention and shape context-aware attention
Q3
3. What practical downstream application is NOT mentioned as a benefit of 3D part amodal segmentation in the paper?
Geometry editing and material assignment
Animation of individual parts
Facial recognition and biometric authentication