2025-09-22 Papers

1/2

Paper 1

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Published: 2025-09-19

Link: http://arxiv.org/pdf/2509.16197

1. 📘 Topic and Domain: A unified multimodal large language model called MANZANO that can both understand and generate visual content, operating in the domain of computer vision and natural language processing.

2. 💡 Previous Research and New Ideas: Based on previous unified multimodal models that struggled with performance trade-offs between understanding and generation capabilities, this paper proposes a novel hybrid vision tokenizer that uses a single shared encoder with specialized adapters for both tasks.

3. ❓ Problem: The paper addresses the conflict between visual tokenization methods in existing unified models, where discrete tokens work better for generation but continuous embeddings are superior for understanding tasks.

4. 🛠️ Methods: Implements a three-component architecture: a hybrid vision tokenizer (producing both continuous and discrete tokens), a unified LLM decoder (for text/image token prediction), and a diffusion-based image decoder (for pixel generation), trained through a three-stage process of pre-training, continued pre-training, and supervised fine-tuning.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance among unified models, with their 3B model matching or exceeding larger models' performance on understanding tasks while maintaining strong generation capabilities, and shows consistent improvements when scaled up to 30B parameters.

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

1/2

Paper 2

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15130

1. 📘 Topic and Domain: The paper focuses on trajectory-controlled video generation using pre-trained video diffusion models, specifically in the domain of 3D/4D computer vision and generative AI.

2. 💡 Previous Research and New Ideas: Based on previous work in video diffusion models and trajectory control methods that required model retraining, this paper proposes a novel training-free approach that leverages existing model knowledge.

3. ❓ Problem: The paper addresses the challenge of achieving precise camera trajectory control in video generation while maintaining high visual quality, without requiring expensive model retraining or fine-tuning.

4. 🛠️ Methods: The authors develop a three-part framework called WorldForge that includes: Intra-Step Recursive Refinement (IRR) for step-wise trajectory guidance, Flow-Gated Latent Fusion (FLF) for separating motion from appearance features, and Dual-Path Self-Corrective Guidance (DSG) for maintaining visual quality.

5. 📊 Results and Evaluation: The method achieves state-of-the-art performance in both static 3D scene generation and dynamic 4D trajectory control, outperforming existing approaches in terms of FID, CLIP similarity, and trajectory accuracy metrics while requiring no additional training.

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

1/2

Paper 3

BaseReward: A Strong Baseline for Multimodal Reward Model

Published: 2025-09-19

Link: http://arxiv.org/pdf/2509.16127

1. 📘 Topic and Domain: The paper focuses on developing high-performance Multimodal Reward Models (MRMs) for aligning Multimodal Large Language Models with human preferences in the domain of AI alignment and multimodal machine learning.

2. 💡 Previous Research and New Ideas: Based on existing reward modeling approaches for text-only LLMs, the paper proposes a systematic guide for building MRMs by investigating every crucial component in the development pipeline, introducing BaseReward as a simple yet effective architecture.

3. ❓ Problem: The paper addresses the lack of a systematic guide for building state-of-the-art Multimodal Reward Models, which are crucial for aligning MLLMs with human preferences.

4. 🛠️ Methods: The authors conducted comprehensive experimental analyses of reward modeling paradigms, reward head architecture, training strategies, data curation, backbone model selection, and ensemble methods, ultimately developing BaseReward using a Qwen2.5-VL backbone with an optimized two-layer reward head.

5. 📊 Results and Evaluation: BaseReward established new state-of-the-art performance on major benchmarks, including MM-RLHF-Reward Bench (11% improvement), VL-Reward Bench (18% improvement), and showed consistent performance gains when integrated into reinforcement learning pipelines across various tasks.