2026-01-06 Papers

1/2

Paper 1

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Published: 2026-01-05

Link: http://arxiv.org/pdf/2601.02256

1. 📘 Topic and Domain: Visual autoregressive (VAR) image generation and reinforcement learning, focusing on improving text-to-image generation models.
2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO) and VAR models; proposes novel techniques to handle asynchronous policy conflicts in VAR generation.
3. ❓ Problem: Addresses the challenge of asynchronous policy conflicts in VAR models where the number of query tokens fluctuates significantly across generation steps, leading to unstable training.
4. 🛠️ Methods: Introduces three components: Value as Middle Return (VMR) for intermediate rewards, Per-Action Normalization Weighting (PANW) for balancing timestep contributions, and Mask Propagation (MP) for focused credit assignment.
5. 📊 Results and Evaluation: Achieved significant improvements in sample quality and objective alignment over vanilla GRPO baseline, with increased Word Accuracy from 0.5536 to 0.7841 and improved NED from 0.7816 to 0.9081 on the CVTG-2K dataset.

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

VAR RL Done Right: Methodology Flow Problem Identification Asynchronous Policy Conflicts in VAR MDP Formulation State: (r₁, r₂, ..., rₜ) Action: Next resolution grid Three Synergistic Components Value as Middle Return (VMR) • Two-stage optimization • Prefix & Suffix decomposition • Structure-preserving reward Per-Action Normalization Weighting (PANW) • Weight: kₜ = 1/(hₜ×wₜ)^α • Balance gradient scales • Normalize step contributions Mask Propagation (MP) • Spatiotemporal masking • Backward propagation • Focus on relevant tokens Enhanced GRPO Framework Integrated optimization with conflict resolution Text Rendering OCR-based rewards CVTG-2K evaluation Human Preference HPSv3 rewards Multi-domain evaluation Results State-of-the-art performance Stable training dynamics First RL Framework for VAR T2I Key Formulations V*ₘ(sₘ) = η log E[exp(R(sₜ)/η)] kₜ = 1/(hₜ × wₜ)^α
Q1
1. What is the main challenge addressed by this paper in Visual Autoregressive (VAR) models?
High computational costs during training
Asynchronous policy conflicts due to varying token numbers
Limited vocabulary in text-to-image generation
Q2
2. Which component of the proposed framework is responsible for providing dense feedback to early steps while preserving family-optimality?
Mask Propagation (MP)
Per-Action Normalization Weighting (PANW)
Value as Middle Return (VMR)
Q3
3. What improvement did the proposed method achieve in Word Accuracy on the CVTG-2K dataset?
From 0.5536 to 0.6841 (+23%)
From 0.5536 to 0.7841 (+41.6%)
From 0.5536 to 0.8841 (+59.7%)
1/2

Paper 2

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Published: 2026-01-05

Link: http://arxiv.org/pdf/2601.02358

1. 📘 Topic and Domain: The paper presents VINO, a unified visual generator for image and video generation and editing within a single framework, in the domain of computer vision and deep learning.
2. 💡 Previous Research and New Ideas: Based on previous diffusion models and multimodal assistants, it proposes a new unified framework that combines a vision-language model with a Multimodal Diffusion Transformer, using interleaved conditioning tokens.
3. ❓ Problem: The paper addresses the fragmentation of visual generation pipelines, where text-to-image, text-to-video, and visual editing models are developed separately, lacking a unified framework for handling multiple tasks.
4. 🛠️ Methods: VINO couples a vision-language model with a Multimodal Diffusion Transformer, using learnable query tokens and a token-boundary mechanism, trained through a progressive three-stage pipeline that gradually expands capabilities.
5. 📊 Results and Evaluation: The model demonstrates strong performance across diverse generation and editing benchmarks, showing improved identity preservation, faithful instruction following, and better controllability in multi-identity edits compared to existing task-specific models.

VINO: A Unified Visual Generator with Interleaved OmniModal Context

VINO: Unified Visual Generator Workflow Input Processing • Text Instructions • Reference Images • Reference Videos • System Prompts • Learnable Tokens Vision-Language Model (VLM) Qwen3VL-4B Multimodal Encoding Special Tokens <vision_start/end> VAE Encoder Visual Latents Fine-grained Details Spatial Information Boundary Marking 3D RoPE Multimodal Diffusion Transformer (MMDiT) VLM Features Semantic Embeddings Cross-modal Attention Learnable Tokens VAE Latents Visual Details Boundary Tokens Interleaved Context Denoising Diffusion Process Noise Prediction Full Attention Progressive Training Strategy Stage 1: Alignment VLM-MMDiT Connection Long Captions (T2V/T2I) 20k steps Stage 2: Adaptation Mixed Long/Short Prompts MMDiT Training Begins 4k steps Stage 3: Multi-task Full Multi-modal Training Generation + Editing 16k steps Unified Output Generation • Text-to-Image/Video Generation • Instruction-based Image/Video Editing • Reference-guided Generation • Multi-identity Control Key Innovations ✓ Learnable Query Tokens ✓ Token Boundary Mechanism ✓ Interleaved Context ✓ Progressive Training ✓ Unified Architecture ✓ Cross-modal Grounding ✓ Identity Preservation
Q1
1. What is the main innovation VINO introduces to solve the fragmentation problem in visual generation?
Using multiple specialized models working in parallel
Implementing a shared diffusion backbone with interleaved conditioning tokens
Creating separate pipelines for each visual task
Q2
2. How does VINO's progressive training strategy work?
Trains all capabilities simultaneously from scratch
Starts with video generation and gradually adds image tasks
Begins with image generation and slowly incorporates video capabilities
Q3
3. What unique architectural component does VINO use to maintain consistent grounding across different visual references?
Token-boundary mechanism that reuses VLM tokens within MMDiT
Multiple parallel attention heads
Separate encoders for each modality
1/2

Paper 3

K-EXAONE Technical Report

Published: 2026-01-04

Link: http://arxiv.org/pdf/2601.01739

1. 📘 Topic and Domain: Development of K-EXAONE, a large-scale multilingual language model with 236B parameters, in the domain of natural language processing and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on EXAONE 4.0's hybrid architecture, introducing new Mixture-of-Experts (MoE) architecture with expanded language support and longer context windows.
3. ❓ Problem: Addressing South Korea's infrastructure gap in AI development by creating a globally competitive language model despite limited AI-specialized resources.
4. 🛠️ Methods: Implemented a three-stage training approach (pre-training, context length extension, post-training) using MoE architecture with 128 experts, where only 23B parameters are activated during inference.
5. 📊 Results and Evaluation: K-EXAONE demonstrated competitive performance across multiple benchmarks including reasoning (MMLU-PRO: 83.8%), math (AIME: 92.8%), coding (LiveCodeBench: 80.7%), and achieved strong multilingual capabilities in six languages.

K-EXAONE Technical Report

K-EXAONE Training Workflow MoE Architecture 236B params, 23B active 128 experts, top-8 + shared Enhanced Tokenizer 100K → 150K vocab SuperBPE + 6 languages Pre-training 11T tokens, 3-stage curriculum Multilingual Thinking FP8 Context Extension 8K → 32K → 256K Rehearsal Reasoning Long-Doc Post-training SFT + Tool Use Agentic Web Search Reinforcement Learning AGAPO Algorithm Math Code STEM Preference Learning GROUPER Method Chat Safety Creative Comprehensive Evaluation Reasoning Agentic Korean Safety Multilingual K-AUT Safety Framework Universal Human Values Social Safety Korean Sensitivity Future Risk KGC-Safety Benchmark 2,260 test instances 226 risk categories 96.1% Safety Rate Key Technical Features Hybrid Attention MTP Self-Drafting QK Norm + RoPE Dropless Routing 256K Context Muon Optimizer Performance Highlights MMLU-Pro: 83.8 | AIME 2025: 92.8 | LiveCodeBench: 80.7 τ²-Bench: 73.2 | IFBench: 67.3 | KOBALT: 61.8 MMMLU: 85.7 | KGC-Safety: 96.1 | WildJailbreak: 89.9 K-EXAONE 236B-A23B Frontier-Level Multilingual Foundation Model Korean, English, Spanish, German, Japanese, Vietnamese
Q1
1. What unique architectural innovation does K-EXAONE use to achieve efficient scaling while maintaining strong performance?
Dense transformer architecture with parallel processing
Mixture-of-Experts (MoE) with 128 experts activating only 23B parameters
Hybrid attention mechanism with transformer blocks
Q2
2. What was the main challenge that motivated the development of K-EXAONE in South Korea?
Lack of multilingual training data
Competition with other Asian AI models
Shortage of AI-specialized data centers and AI chips
Q3
3. During inference, how does K-EXAONE improve its decoding throughput?
By using Multi-Token Prediction for self-drafting achieving 1.5x improvement
By reducing the number of active experts to minimum
By implementing parallel processing across all experts