2025-11-20 Papers

1/2

Paper 1

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Published: 2025-11-18

Link: http://arxiv.org/pdf/2511.14993

1. 📘 Topic and Domain: The paper introduces Kandinsky 5.0, a family of foundation models for high-resolution image and video generation, consisting of three core models: Image Lite (6B parameters), Video Lite (2B parameters), and Video Pro (19B parameters).
2. 💡 Previous Research and New Ideas: Based on previous diffusion models and flow matching approaches, the paper proposes new architectural optimizations including CrossDiT (Cross-Attention Diffusion Transformer) and NABLA (Neighborhood Adaptive Block-Level Attention) mechanism for efficient video generation.
3. ❓ Problem: The paper addresses the challenges of creating high-quality, consistent, and controllable video generation while maintaining computational efficiency and reducing the complexity of attention mechanisms for long video sequences.
4. 🛠️ Methods: The paper implements a multi-stage training pipeline including pre-training, supervised fine-tuning, distillation, and RL-based post-training, along with comprehensive data processing and curation methods. It also introduces optimizations for VAE encoding, memory efficiency, and inference speed.
5. 📊 Results and Evaluation: Through human side-by-side evaluations, the models demonstrated superior or competitive performance against leading models like Sora, Veo, and Wan across key metrics including visual quality, motion dynamics, and prompt adherence. The NABLA mechanism achieved 2.7× reduction in training and inference time while maintaining quality.

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0 Model Development Pipeline Data Processing • T2I Dataset (500M images) • T2V Dataset (250M videos) • I2I Editing Dataset • SFT Quality Dataset • Russian Cultural Dataset Architecture • CrossDiT Backbone • NABLA Attention • Flow Matching • Qwen2.5-VL Encoder • HunyuanVideo VAE Multi-Stage Training • Pre-training (Multi-resolution) • Supervised Fine-tuning (SFT) • Model Souping • Distillation (CFG + TSCD) • RL-based Post-training Optimizations • VAE Acceleration (2.5x) • NABLA Sparse Attention • Memory Optimization • torch.compile • MagCache Kandinsky 5.0 Model Family Image Lite (6B) • Text-to-Image • Image Editing • High Resolution • RL Post-training Video Lite (2B) • Text-to-Video • Image-to-Video • 10-second clips • Flash variant Video Pro (19B) • High-quality T2V • Superior dynamics • Up to 1408p • Flash variant Training Stages Pipeline 1. Multi-resolution Pre-training 2. Domain-specific SFT + Model Souping 3. Distillation (CFG → TSCD → Adversarial) 4. RL-based Post-training (Images) Human Evaluation Results Superior Visual Quality & Motion Dynamics vs Sora, Veo, Wan models Competitive Prompt Following • State-of-the-art Open Source Performance Key Technical Innovations NABLA: 2.7x speedup with 90% sparsity Flow Matching + CrossDiT Architecture Model Souping for domain expertise Multi-stage distillation (100→16 NFEs) Reward-based RL fine-tuning Comprehensive data curation pipeline Open Source Release MIT License • Code, Weights & Training Checkpoints
Q1
1. What is the main innovation of the NABLA mechanism introduced in Kandinsky 5.0?
It reduces training time by compressing video data before processing
It dynamically constructs content-aware sparse attention masks for efficient video processing
It eliminates the need for attention mechanisms entirely in video generation
Q2
2. How did the researchers handle the challenge of evaluating video generation quality?
They relied solely on automated metrics like FID scores
They used only internal testing by the development team
They conducted human side-by-side evaluations comparing with other models like Sora and Veo
Q3
3. What unique approach did Kandinsky 5.0 take in its training pipeline?
It used only pre-training on a single large dataset
It employed a multi-stage approach including pre-training, SFT, distillation, and RL-based post-training
It focused exclusively on adversarial training methods
1/2

Paper 2

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Published: 2025-11-18

Link: http://arxiv.org/pdf/2511.15065

1. 📘 Topic and Domain: Evaluation of video models' spatial reasoning abilities through maze-solving tasks in the computer vision and artificial intelligence domain.
2. 💡 Previous Research and New Ideas: Based on previous research in text-based reasoning (Chain-of-Thought) and video generation models, proposes a novel "reasoning via video" paradigm where reasoning emerges through video frame generation rather than text generation.
3. ❓ Problem: Addresses the lack of comprehensive benchmarks for evaluating video models' reasoning capabilities and investigates whether video models can perform complex spatial reasoning tasks.
4. 🛠️ Methods: Created VR-Bench, a benchmark with 7,920 procedurally generated maze videos across five maze types, evaluated models through path matching metrics and rule compliance, and used supervised fine-tuning on video models.
5. 📊 Results and Evaluation: Fine-tuned video models outperformed vision-language models, showing 10-20% performance improvement through test-time scaling, strong generalization across different maze types and textures, and superior spatial reasoning capabilities.

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

VR-Bench: Video Model Reasoning Evaluation Workflow Dataset Construction Maze Generation (5 Types) Video Generation (7,920 videos) Model Training Supervised Fine-Tuning (SFT) Wan2.2-TI2V-5B → Wan-R1 Evaluation Framework Path Matching Rule Compliance EM, SR, PR, SD, VLM-Score, MF Experimental Analysis Baseline Comparison Test-Time Scaling Generalization Tests Five Maze Types Regular Irregular 3D Maze Trapfield Sokoban Difficulty Levels Texture Variations Visual Styles Model Categories VLMs (3 models) Closed Video (6) Open Video (2) Gemini, GPT-5, Qwen Veo, Sora-2, Kling, etc. Wan2.5, Wan2.2 Key Findings & Insights Video > Text Reasoning Better spatial perception Test-Time Scaling 10-20% improvement Strong Generalization Difficulty, texture, type SFT Effectiveness Elicits reasoning ability Generalization Tests: Difficulty • Texture • Maze Type • Test-Time Scaling Comprehensive Evaluation Metrics Path Metrics EM, SR, PR, SD Rule Compliance VLM-Score, MF Chain of Frame Temporal reasoning Performance Wan-R1 SOTA Test-Time Scale Pass@K analysis Reasoning via Video: A New Paradigm for Spatial Reasoning Video models demonstrate superior spatial reasoning through sequential frame generation Outperforming text-based approaches with better scalability and generalization
Q1
1. What is the key innovation in the paper's reasoning approach compared to traditional methods?
Using text-based chain-of-thought reasoning
Implementing reasoning through sequential video frame generation
Applying reinforcement learning to maze solving
Q2
2. What unique phenomenon did researchers discover about video models' performance during testing?
Models performed better in 3D mazes than 2D mazes
Performance decreased with larger sampling sizes
Diverse sampling during inference improved reasoning reliability by 10-20%
Q3
3. Which of the following best describes VR-Bench's evaluation approach?
Only evaluates maze completion success rates
Focuses exclusively on visual quality metrics
Uses both path matching metrics and rule compliance evaluation
1/2

Paper 3

VisPlay: Self-Evolving Vision-Language Models from Images

Published: 2025-11-19

Link: http://arxiv.org/pdf/2511.15661

1. 📘 Topic and Domain: The paper presents VisPlay, a self-evolving reinforcement learning framework for Vision-Language Models (VLMs) that improves visual reasoning capabilities using unlabeled images.
2. 💡 Previous Research and New Ideas: Based on previous research in self-evolving language models and reinforcement learning for VLMs, the paper proposes a novel framework that enables VLMs to autonomously improve without human-annotated data through self-play between two roles.
3. ❓ Problem: The paper addresses the limitation of current VLM training approaches that rely heavily on costly human-annotated labels and task-specific heuristics for defining rewards.
4. 🛠️ Methods: The paper implements a self-play framework where a single base VLM alternates between two roles - an Image-Conditioned Questioner that generates challenging questions and a Multimodal Reasoner that produces answers, jointly trained using Group Relative Policy Optimization (GRPO).
5. 📊 Results and Evaluation: The framework achieved consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks when tested on three models (Qwen2.5-VL and MiMo-VL), showing effectiveness in enhancing VLM capabilities without human supervision.

VisPlay: Self-Evolving Vision-Language Models from Images

VisPlay: Self-Evolving Vision-Language Models Raw Images (No Labels) Base VLM (Single Model) Image-Conditioned Questioner Generates challenging questions Multimodal Reasoner Produces silver responses Question Generation with Rewards: • Uncertainty Reward • Diversity Regularization • Format Constraint Group Relative Policy Optimization (GRPO) Joint training of both roles Pseudo-Label Generation: • Majority Voting • Confidence Scoring • Informative Filtering Iterative Self-Evolution Co-evolution Loop Performance Improvements • Visual Reasoning • Compositional Generalization • Hallucination Reduction • Cross-domain Adaptation Algorithm Steps: 1. Sample question groups 2. Compute confidence scores 3. Calculate rewards 4. Update via GRPO 5. Generate curated dataset Key Innovation: Self-evolving framework using only raw images without human supervision Achieves competitive performance through autonomous question generation and reasoning improvement
Q1
1. What is the main innovation of VisPlay compared to traditional VLM training approaches?
It uses pre-trained language models to generate questions
It enables self-improvement without human-annotated data through role-based self-play
It combines multiple existing VLMs to enhance performance
Q2
2. How does the Image-Conditioned Questioner evaluate the difficulty of generated questions?
By comparing with human-rated difficulty scores
By measuring the length and complexity of questions
By analyzing the uncertainty in the Multimodal Reasoner's responses
Q3
3. Which of the following best describes the co-evolution process in VisPlay?
The model alternates between generating easier and harder questions randomly
Two separate models compete against each other to improve performance
A single base model switches between questioning and reasoning roles while progressively improving both capabilities