2026-01-06 Papers

1/2

Paper 1

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Published: 2026-01-05

Link: http://arxiv.org/pdf/2601.02256

1. 📘 Topic and Domain: Visual autoregressive (VAR) image generation and reinforcement learning, focusing on improving text-to-image generation models.

2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO) and VAR models; proposes novel techniques to handle asynchronous policy conflicts in VAR generation.

3. ❓ Problem: Addresses the challenge of asynchronous policy conflicts in VAR models where the number of query tokens fluctuates significantly across generation steps, leading to unstable training.

4. 🛠️ Methods: Introduces three components: Value as Middle Return (VMR) for intermediate rewards, Per-Action Normalization Weighting (PANW) for balancing timestep contributions, and Mask Propagation (MP) for focused credit assignment.

5. 📊 Results and Evaluation: Achieved significant improvements in sample quality and objective alignment over vanilla GRPO baseline, with increased Word Accuracy from 0.5536 to 0.7841 and improved NED from 0.7816 to 0.9081 on the CVTG-2K dataset.

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

1/2

Paper 2

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Published: 2026-01-05

Link: http://arxiv.org/pdf/2601.02358

1. 📘 Topic and Domain: The paper presents VINO, a unified visual generator for image and video generation and editing within a single framework, in the domain of computer vision and deep learning.

2. 💡 Previous Research and New Ideas: Based on previous diffusion models and multimodal assistants, it proposes a new unified framework that combines a vision-language model with a Multimodal Diffusion Transformer, using interleaved conditioning tokens.

3. ❓ Problem: The paper addresses the fragmentation of visual generation pipelines, where text-to-image, text-to-video, and visual editing models are developed separately, lacking a unified framework for handling multiple tasks.

4. 🛠️ Methods: VINO couples a vision-language model with a Multimodal Diffusion Transformer, using learnable query tokens and a token-boundary mechanism, trained through a progressive three-stage pipeline that gradually expands capabilities.

5. 📊 Results and Evaluation: The model demonstrates strong performance across diverse generation and editing benchmarks, showing improved identity preservation, faithful instruction following, and better controllability in multi-identity edits compared to existing task-specific models.

VINO: A Unified Visual Generator with Interleaved OmniModal Context

1/2

Paper 3

K-EXAONE Technical Report

Published: 2026-01-04

Link: http://arxiv.org/pdf/2601.01739

1. 📘 Topic and Domain: Development of K-EXAONE, a large-scale multilingual language model with 236B parameters, in the domain of natural language processing and artificial intelligence.

2. 💡 Previous Research and New Ideas: Based on EXAONE 4.0's hybrid architecture, introducing new Mixture-of-Experts (MoE) architecture with expanded language support and longer context windows.

3. ❓ Problem: Addressing South Korea's infrastructure gap in AI development by creating a globally competitive language model despite limited AI-specialized resources.

4. 🛠️ Methods: Implemented a three-stage training approach (pre-training, context length extension, post-training) using MoE architecture with 128 experts, where only 23B parameters are activated during inference.

5. 📊 Results and Evaluation: K-EXAONE demonstrated competitive performance across multiple benchmarks including reasoning (MMLU-PRO: 83.8%), math (AIME: 92.8%), coding (LiveCodeBench: 80.7%), and achieved strong multilingual capabilities in six languages.