2026-02-11 Papers

1/2

Paper 1

Code2World: A GUI World Model via Renderable Code Generation

Published: 2026-02-10

Link: http://arxiv.org/pdf/2602.09856

1. 📘 Topic and Domain: This paper introduces Code2World, a GUI world model that predicts next user interface states through renderable HTML code generation for autonomous GUI agents.
2. 💡 Previous Research and New Ideas: The paper builds on existing text-based and pixel-based GUI world models but proposes a novel "renderable code generation" paradigm that uses structured HTML code as an intermediate representation to achieve both high visual fidelity and fine-grained structural controllability.
3. ❓ Problem: The paper aims to solve the limitation of existing GUI agents that lack predictive foresight, operating without simulating action consequences, which leads to costly corrections and potential failures in high-risk scenarios.
4. 🛠️ Methods: The authors use a two-stage training approach: supervised fine-tuning on synthesized AndroidCode dataset (80K+ samples), followed by Render-Aware Reinforcement Learning with dual rewards for visual semantic fidelity and action consistency.
5. 📊 Results and Evaluation: Code2World-8B achieves state-of-the-art performance in next UI prediction, rivaling GPT-5 and Gemini-3-Pro-Image, and significantly enhances downstream GUI agents with +9.5% improvement on AndroidWorld navigation tasks when used as a plug-and-play simulator.

Code2World: A GUI World Model via Renderable Code Generation

Code2World: GUI World Model via Renderable Code Generation Data Construction AndroidCode Dataset 80K+ screen-action pairs GPT-5 synthesis Visual feedback revision SigLIP score > 0.9 Stage 1: SFT Supervised Fine-tuning Qwen3-VL-8B backbone HTML syntax learning UI layout logic Code generation Stage 2: RARL Render-Aware RL Visual semantic reward Action consistency reward GRPO optimization VLM-as-Judge Code2World-8B GUI World Model Renderable code gen HTML → Browser render High visual fidelity Structural control Input Processing Current GUI (I_t) + Action (a_t) + Goal (G) Visual prompting with red circles/arrows Semantic action expansion Code Generation Predicted HTML Code (Ĉ_{t+1}) Semantic placeholders for images Inline SVG for icons Browser Rendering R(Ĉ_{t+1}) → Î_{t+1} Deterministic visual output High-fidelity GUI state Evaluation Functional Logic: S_ad, S_id Visual Quality: S_ele, S_lay VLM-as-Judge framework GUI Agent Enhancement Propose-Simulate-Select pipeline +9.5% improvement on AndroidWorld Plug-and-play integration Key Results • Rivals GPT-5 and Gemini-3-Pro-Image in next UI prediction • Significant improvements in both offline and online navigation • Superior generalization across unseen applications • Lightweight 8B model outperforms 100B+ baselines Key Innovation: Renderable Code Generation Paradigm Bridging visual fidelity and structural controllability through HTML code representation Method Flow Data → Training → Model → Application → Results
Q1
1. What is the key innovation of Code2World compared to existing GUI world models?
Using renderable HTML code generation as an intermediate representation
Training on larger datasets with more GPU resources
Implementing faster pixel-based diffusion models
Q2
2. What are the two reward components used in Code2World's Render-Aware Reinforcement Learning strategy?
Speed reward and accuracy reward
Visual semantic reward and action consistency reward
Memory efficiency reward and computational cost reward
Q3
3. How much improvement does Code2World provide when enhancing Gemini-2.5-Flash on AndroidWorld navigation tasks?
+15.2% improvement in success rate
+6.8% improvement in success rate
+9.5% improvement in success rate
1/2

Paper 2

Chain of Mindset: Reasoning with Adaptive Cognitive Modes

Published: 2026-02-10

Link: http://arxiv.org/pdf/2602.10063

1. 📘 Topic and Domain: The paper introduces Chain of Mindset (CoM), a framework for large language model reasoning that enables dynamic switching between different cognitive modes during problem-solving across mathematics, coding, and multimodal reasoning tasks.
2. 💡 Previous Research and New Ideas: The paper builds on cognitive science research identifying distinct reasoning modes (spatial, convergent, divergent thinking) and existing LLM reasoning methods like Chain-of-Thought, proposing the novel idea of step-level adaptive mindset orchestration where models can dynamically switch between four heterogeneous cognitive modes within a single reasoning process.
3. ❓ Problem: The paper addresses the limitation that existing LLM reasoning methods apply a single fixed mindset throughout problem-solving, which prevents models from adapting their cognitive approach when different stages of the same problem require fundamentally different reasoning strategies.
4. 🛠️ Methods: The authors developed a three-layer architecture with a Meta-Agent that orchestrates four specialized mindsets (Spatial, Convergent, Divergent, Algorithmic), combined with a bidirectional Context Gate mechanism that filters information flow between components to prevent interference during mindset transitions.
5. 📊 Results and Evaluation: CoM achieved state-of-the-art performance across six challenging benchmarks, outperforming the strongest baseline by 4.96% on Qwen3-VL-32B-Instruct and 4.72% on Gemini-2.0-Flash, with results evaluated using pass@1 accuracy metrics while maintaining computational efficiency.

Chain of Mindset: Reasoning with Adaptive Cognitive Modes

Chain of Mindset (CoM) Framework Input Problem q Meta-Agent Cognitive Decision π(st) Plan-Call-Internalize Loop Input Gate Filter Context Output Gate Distill Results Spatial Mindset Visual Imagination Text→Image Structure Grounding Convergent Logical Synthesis Focused Analysis Deep Reasoning Divergent Branch Generation Parallel Exploration Multiple Paths Algorithmic Code Generation Precise Calculation Iterative Verification Iterative Process Flow Cognitive Decision Mindset Selection Context Filtering Mindset Execution Insight Integration Final Answer Key Features • Training-free framework • Dynamic mindset switching • Bidirectional context filtering • State-dependent adaptation Feedback Loop
Q1
1. What is the fundamental limitation of existing LLM reasoning methods that Chain of Mindset (CoM) aims to address?
They require extensive training data and computational resources
They apply a single fixed mindset throughout problem-solving instead of adapting cognitive approaches
They cannot handle multimodal inputs like images and text simultaneously
Q2
2. Which component in the CoM framework is responsible for preventing cross-module information interference during mindset transitions?
The Meta-Agent that orchestrates reasoning decisions
The Context Gate with bidirectional semantic filtering
The four specialized mindset modules themselves
Q3
3. In the ablation study, removing which component caused the largest overall performance drop of 8.24%?
The Divergent mindset module
The Spatial mindset module
The Context Gate mechanism
1/2

Paper 3

UI-Venus-1.5 Technical Report

Published: 2026-02-09

Link: http://arxiv.org/pdf/2602.09082

1. 📘 Topic and Domain: This paper presents UI-Venus-1.5, a unified end-to-end GUI (Graphical User Interface) agent designed for automating interactions in digital environments across mobile and web platforms.
2. 💡 Previous Research and New Ideas: The paper builds on their previous UI-Venus-1.0 model and proposes three key advances: comprehensive Mid-Training with 10 billion tokens across 30+ datasets, Online Reinforcement Learning with full-trajectory rollouts, and a unified single GUI agent constructed via model merging of domain-specific models.
3. ❓ Problem: The paper aims to solve the challenge of achieving both broad generality and consistently strong task performance in GUI agents, addressing the gap between step-level and trace-level accuracy during training and the need for practical deployment-ready agents.
4. 🛠️ Methods: The authors used a four-stage training pipeline including Mid-Training for GUI knowledge injection, Offline-RL for task-specific optimization, Online-RL using Group Relative Policy Optimization (GRPO) for complex navigation, and model merging (specifically TIES-Merge) to unify specialized models into a single agent.
5. 📊 Results and Evaluation: UI-Venus-1.5 achieved state-of-the-art performance on multiple benchmarks including ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous baselines, and demonstrated robust navigation capabilities across Chinese mobile apps in real-world scenarios.

UI-Venus-1.5 Technical Report

UI-Venus-1.5 Training Pipeline Qwen3-VL Base Model Stage 1: Mid-Training • 10B tokens, 30+ datasets • GUI knowledge injection • Data refinement pipeline • Real device generation • Foundation GUI semantics Stage 2: Offline-RL Grounding Format + Point rewards Web Navigation GRPO Mobile Navigation GRPO Reward Components: Format Reward + Action Reward + Coordinate Reward Stage 3: Online-RL DaaS Infrastructure Device-as-a-Service GRPO Algorithm Full trajectory rollouts Task Generation Dynamic + Static Reward Design Completion + Penalty Stage 4: Model Merge TIES-Merge Strategy • Unified single agent • Domain consolidation • Parameter interpolation • Minimal performance loss UI-Venus-1.5 Unified GUI Agent 2B | 8B | 30B-A3B Key Technical Advances • Mid-Training: 10B tokens across 30+ datasets for foundational GUI semantics • Online-RL: Full-trajectory rollouts addressing step-trace accuracy mismatch • Model Merging: Unified agent via TIES-Merge from domain-specific models State-of-the-Art Results • ScreenSpot-Pro: 69.6% • VenusBench-GD: 75.0% • AndroidWorld: 77.6% • OSWorld-G-R: 76.4% • UI-Vision: 54.7% • WebVoyager: 76.0% • Robust navigation across 40+ Chinese mobile apps for real-world scenarios Foundation GUI Knowledge Enhanced Training Specialized Models Unified Agent
Q1
1. What is the primary architectural innovation that allows UI-Venus-1.5 to handle both mobile and web environments in a single model?
Using separate neural networks for each platform and switching between them
Model merging strategy that unifies domain-specific models (grounding, web, and mobile) into one cohesive checkpoint
Training multiple models simultaneously and selecting the best one at runtime
Q2
2. According to the paper, what was the key limitation observed during Offline-RL training that motivated the addition of Online-RL?
The model was overfitting to the training data and couldn't generalize
Step-level success rates increased steadily while trace-level success rates eventually peaked and declined
The computational cost was too high for practical deployment
Q3
3. What is the scale of the Mid-Training corpus used in UI-Venus-1.5, and what is its primary purpose?
5 billion tokens from 15+ datasets to improve general language understanding
10 billion tokens from 30+ datasets to establish foundational GUI semantics and bridge the gap between general vision-language models and GUI-specific understanding
20 billion tokens from 50+ datasets to enable multilingual GUI interaction