2025-09-22 Papers

1/2

Paper 1

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Published: 2025-09-19

Link: http://arxiv.org/pdf/2509.16197

1. 📘 Topic and Domain: A unified multimodal large language model called MANZANO that can both understand and generate visual content, operating in the domain of computer vision and natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous unified multimodal models that struggled with performance trade-offs between understanding and generation capabilities, this paper proposes a novel hybrid vision tokenizer that uses a single shared encoder with specialized adapters for both tasks.
3. ❓ Problem: The paper addresses the conflict between visual tokenization methods in existing unified models, where discrete tokens work better for generation but continuous embeddings are superior for understanding tasks.
4. 🛠️ Methods: Implements a three-component architecture: a hybrid vision tokenizer (producing both continuous and discrete tokens), a unified LLM decoder (for text/image token prediction), and a diffusion-based image decoder (for pixel generation), trained through a three-stage process of pre-training, continued pre-training, and supervised fine-tuning.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance among unified models, with their 3B model matching or exceeding larger models' performance on understanding tasks while maintaining strong generation capabilities, and shows consistent improvements when scaled up to 30B parameters.

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

MANZANO: Unified Multimodal Model Workflow Stage 1: Hybrid Tokenizer Training Vision Encoder (ViT) Continuous Adapter Discrete Adapter 300M LLM Decoder Random Stage 2: Unified LLM Training Understanding Data Generation Data Text-only Data Pre-training (1.6T tokens) Continued Pre-training (83B tokens) Supervised Fine-tuning Unified LLM Decoder (300M - 30B parameters) Stage 3: Image Decoder Training 256x256 512x512 1024x1024 2048x2048 DiT-Air Architecture (0.9B-3.5B) Flow Matching Pipeline Inference Pipeline Understanding Task Image Input Continuous Tokens LLM Output Text Answer Generation Task Text Input Discrete Tokens Image Output Key Design Principles Unified Semantic Space Both adapters share same encoder backbone Simple & Scalable Standard AR objective Decoupled components Minimal Task Conflict Hybrid tokenizer reduces understanding-generation gap Progressive Scaling 300M to 30B LLM 0.9B to 3.5B decoder Results: SOTA on understanding tasks, competitive generation performance Especially strong on text-rich benchmarks (DocVQA, ChartQA, OCRBench)
Q1
1. What is the key innovation in MANZANO's architecture that helps resolve the conflict between understanding and generation tasks?
Using two completely separate vision encoders for each task
A hybrid vision tokenizer with shared encoder and specialized adapters
Removing the tokenizer entirely and using raw pixel values
Q2
2. When scaling up MANZANO from 3B to 30B parameters, what was observed?
Performance decreased due to overfitting
Only generation capabilities improved while understanding stayed the same
Consistent improvements across both understanding and generation tasks
Q3
3. How does MANZANO's training process differ from conventional approaches?
It uses a single-stage training process focused only on generation
It requires no pre-training and starts directly with fine-tuning
It employs a three-stage process: pre-training, continued pre-training, and supervised fine-tuning
1/2

Paper 2

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15130

1. 📘 Topic and Domain: The paper focuses on trajectory-controlled video generation using pre-trained video diffusion models, specifically in the domain of 3D/4D computer vision and generative AI.
2. 💡 Previous Research and New Ideas: Based on previous work in video diffusion models and trajectory control methods that required model retraining, this paper proposes a novel training-free approach that leverages existing model knowledge.
3. ❓ Problem: The paper addresses the challenge of achieving precise camera trajectory control in video generation while maintaining high visual quality, without requiring expensive model retraining or fine-tuning.
4. 🛠️ Methods: The authors develop a three-part framework called WorldForge that includes: Intra-Step Recursive Refinement (IRR) for step-wise trajectory guidance, Flow-Gated Latent Fusion (FLF) for separating motion from appearance features, and Dual-Path Self-Corrective Guidance (DSG) for maintaining visual quality.
5. 📊 Results and Evaluation: The method achieves state-of-the-art performance in both static 3D scene generation and dynamic 4D trajectory control, outperforming existing approaches in terms of FID, CLIP similarity, and trajectory accuracy metrics while requiring no additional training.

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

WorldForge: Training-Free 3D/4D Video Generation Framework Input Single Image/ Video Frame 3D Vision Foundation Model Depth + Camera Poses Static Point Cloud Dynamic Point Cloud Trajectory Warping Warped Frames + Masks Video Diffusion Model (Frozen Weights) Wan 2.1 / SVD Intra-Step Recursive Refinement (IRR) Trajectory injection at each step Flow-Gated Latent Fusion (FLF) Motion/appearance decoupling Dual-Path Self-Corrective Guidance (DSG) Artifact suppression Single-Step Denoising Process xₜ NN Predict Unguided Guided FLF Fusion Corrected Output 3D Scene Generation Novel view synthesis 4D Trajectory Control Dynamic re-rendering Video Effects Stabilization, Editing Object manipulation Key Features • Training-free inference • Plug-and-play framework • Model-agnostic design • Precise trajectory control TRAINING-FREE • PLUG-AND-PLAY • MODEL-AGNOSTIC Performance Highlights: • Superior FID scores on LLFF, MipNeRF-360, Tanks-and-Temples • Best trajectory accuracy (ATE, RPE-T, RPE-R metrics) • 360° view synthesis from single image • Compatible with Wan 2.1, SVD, and other VDMs
Q1
1. What is the main innovation of WorldForge compared to previous approaches in trajectory-controlled video generation?
It uses a completely new video diffusion architecture
It achieves control without requiring any model retraining or fine-tuning
It generates higher resolution videos than other methods
Q2
2. Which component of WorldForge is responsible for separating motion-related features from appearance-related features in the latent space?
Dual-Path Self-Corrective Guidance (DSG)
Intra-Step Recursive Refinement (IRR)
Flow-Gated Latent Fusion (FLF)
Q3
3. What practical application is NOT mentioned as a use case for WorldForge in the paper?
Video stabilization and camera path smoothing
Real-time face animation and lip syncing
Object removal and replacement in videos
1/2

Paper 3

BaseReward: A Strong Baseline for Multimodal Reward Model

Published: 2025-09-19

Link: http://arxiv.org/pdf/2509.16127

1. 📘 Topic and Domain: The paper focuses on developing high-performance Multimodal Reward Models (MRMs) for aligning Multimodal Large Language Models with human preferences in the domain of AI alignment and multimodal machine learning.
2. 💡 Previous Research and New Ideas: Based on existing reward modeling approaches for text-only LLMs, the paper proposes a systematic guide for building MRMs by investigating every crucial component in the development pipeline, introducing BaseReward as a simple yet effective architecture.
3. ❓ Problem: The paper addresses the lack of a systematic guide for building state-of-the-art Multimodal Reward Models, which are crucial for aligning MLLMs with human preferences.
4. 🛠️ Methods: The authors conducted comprehensive experimental analyses of reward modeling paradigms, reward head architecture, training strategies, data curation, backbone model selection, and ensemble methods, ultimately developing BaseReward using a Qwen2.5-VL backbone with an optimized two-layer reward head.
5. 📊 Results and Evaluation: BaseReward established new state-of-the-art performance on major benchmarks, including MM-RLHF-Reward Bench (11% improvement), VL-Reward Bench (18% improvement), and showed consistent performance gains when integrated into reinforcement learning pipelines across various tasks.

BaseReward: A Strong Baseline for Multimodal Reward Model

BaseReward: Multimodal Reward Model Development Pipeline Phase 1: Systematic Experimental Analysis Approaches Naive-RM Critic-RM Generative-RM Architecture 2-layer MLP SiLU activation Reward head Training No regularization Standard loss 3e-6 LR Data Multimodal + Text-only 2.8M pairs Backbone Qwen2.5-VL 7B parameters Scale analysis Ensemble Multiple models Averaging Diversity boost Key Finding: Text-only data significantly enhances multimodal reward modeling performance Phase 2: BaseReward Implementation Model Architecture Qwen2.5-VL-7B backbone 2-layer MLP reward head SiLU activation function Training Data 7 curated datasets 2.8M preference pairs Multimodal + Text-only Training Strategy Learning rate: 3e-6 Batch size: 128 64 H100 GPUs Ensemble Qwen2.5-VL + Qwen2-VL Simple averaging Phase 3: Evaluation & Results Benchmark Results MM-RLHF: +11.9% improvement VL-Reward: +18% improvement SOTA Achievement Outperforms Claude 3.7 Sonnet Best open-source MRM RL Validation GRPO algorithm Consistent improvements Phase 4: Practical Application & Deployment RL Pipeline GRPO optimization 8 rollouts per prompt Hybrid Reward Rule-based + BaseReward Best performance Task Performance Perception, Reasoning Conversational tasks Efficiency Fast inference Low overhead BaseReward: SOTA Multimodal Reward Model
Q1
1. What surprising finding did the researchers discover about text-only data in multimodal reward modeling?
Text-only data was completely ineffective for multimodal tasks
Text-only data significantly enhanced multimodal judgment, especially in safety and mathematics
Text-only data only worked when combined with video data
Q2
2. Which reward head configuration proved most effective in the BaseReward model?
A single-layer linear head with ReLU activation
A five-layer deep network with Tanh activation
A two-layer MLP with SiLU activation
Q3
3. When integrating BaseReward into reinforcement learning, which reward scheme showed the best overall performance?
Pure rule-based reward system
BaseReward-only scoring
Hybrid approach combining rule-based checks with BaseReward scoring