2025-11-20 Papers

1/2

Paper 1

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Published: 2025-11-18

Link: http://arxiv.org/pdf/2511.14993

1. 📘 Topic and Domain: The paper introduces Kandinsky 5.0, a family of foundation models for high-resolution image and video generation, consisting of three core models: Image Lite (6B parameters), Video Lite (2B parameters), and Video Pro (19B parameters).

2. 💡 Previous Research and New Ideas: Based on previous diffusion models and flow matching approaches, the paper proposes new architectural optimizations including CrossDiT (Cross-Attention Diffusion Transformer) and NABLA (Neighborhood Adaptive Block-Level Attention) mechanism for efficient video generation.

3. ❓ Problem: The paper addresses the challenges of creating high-quality, consistent, and controllable video generation while maintaining computational efficiency and reducing the complexity of attention mechanisms for long video sequences.

4. 🛠️ Methods: The paper implements a multi-stage training pipeline including pre-training, supervised fine-tuning, distillation, and RL-based post-training, along with comprehensive data processing and curation methods. It also introduces optimizations for VAE encoding, memory efficiency, and inference speed.

5. 📊 Results and Evaluation: Through human side-by-side evaluations, the models demonstrated superior or competitive performance against leading models like Sora, Veo, and Wan across key metrics including visual quality, motion dynamics, and prompt adherence. The NABLA mechanism achieved 2.7× reduction in training and inference time while maintaining quality.

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

1/2

Paper 2

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Published: 2025-11-18

Link: http://arxiv.org/pdf/2511.15065

1. 📘 Topic and Domain: Evaluation of video models' spatial reasoning abilities through maze-solving tasks in the computer vision and artificial intelligence domain.

2. 💡 Previous Research and New Ideas: Based on previous research in text-based reasoning (Chain-of-Thought) and video generation models, proposes a novel "reasoning via video" paradigm where reasoning emerges through video frame generation rather than text generation.

3. ❓ Problem: Addresses the lack of comprehensive benchmarks for evaluating video models' reasoning capabilities and investigates whether video models can perform complex spatial reasoning tasks.

4. 🛠️ Methods: Created VR-Bench, a benchmark with 7,920 procedurally generated maze videos across five maze types, evaluated models through path matching metrics and rule compliance, and used supervised fine-tuning on video models.

5. 📊 Results and Evaluation: Fine-tuned video models outperformed vision-language models, showing 10-20% performance improvement through test-time scaling, strong generalization across different maze types and textures, and superior spatial reasoning capabilities.

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

1/2

Paper 3

VisPlay: Self-Evolving Vision-Language Models from Images

Published: 2025-11-19

Link: http://arxiv.org/pdf/2511.15661

1. 📘 Topic and Domain: The paper presents VisPlay, a self-evolving reinforcement learning framework for Vision-Language Models (VLMs) that improves visual reasoning capabilities using unlabeled images.

2. 💡 Previous Research and New Ideas: Based on previous research in self-evolving language models and reinforcement learning for VLMs, the paper proposes a novel framework that enables VLMs to autonomously improve without human-annotated data through self-play between two roles.

3. ❓ Problem: The paper addresses the limitation of current VLM training approaches that rely heavily on costly human-annotated labels and task-specific heuristics for defining rewards.

4. 🛠️ Methods: The paper implements a self-play framework where a single base VLM alternates between two roles - an Image-Conditioned Questioner that generates challenging questions and a Multimodal Reasoner that produces answers, jointly trained using Group Relative Policy Optimization (GRPO).

5. 📊 Results and Evaluation: The framework achieved consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks when tested on three models (Qwen2.5-VL and MiMo-VL), showing effectiveness in enhancing VLM capabilities without human supervision.