2026-02-12 Papers

1/2

Paper 1

GENIUS: Generative Fluid Intelligence Evaluation Suite

Published: 2026-02-11

Link: http://arxiv.org/pdf/2602.11144

1. 📘 Topic and Domain: This paper introduces GENIUS, a benchmark for evaluating Generative Fluid Intelligence (GFI) in unified multimodal models, focusing on their ability to perform dynamic reasoning and adaptation in visual generation tasks rather than just retrieving pre-trained knowledge.
2. 💡 Previous Research and New Ideas: The paper builds on the Cattell-Horn-Carroll theory of intelligence that distinguishes between Crystallized Intelligence (knowledge retrieval) and Fluid Intelligence (novel problem solving), proposing the first formal definition and benchmark for Generative Fluid Intelligence with three core dimensions: Implicit Pattern Induction, Ad-hoc Constraint Execution, and Contextual Knowledge Adaptation.
3. ❓ Problem: The paper addresses the gap in evaluating whether current unified multimodal models possess true general intelligence for visual generation, as existing benchmarks primarily assess memorized knowledge rather than the ability to reason, adapt, and solve novel visual generation problems on the fly.
4. 🛠️ Methods: The authors created a manually curated benchmark with 510 expert-designed samples across 5 tasks and 20 sub-tasks, employed hybrid evaluation using LMM-as-a-judge with three metrics (Rule Compliance, Visual Consistency, Aesthetic Quality), and proposed a training-free attention adjustment mechanism based on theoretical analysis of in-context learning as implicit fine-tuning.
5. 📊 Results and Evaluation: The systematic evaluation of 12 models revealed significant performance deficits with even the best proprietary model (Nano Banana Pro) achieving only 57.19% overall score, demonstrating that current models struggle with fluid intelligence tasks and often prioritize aesthetic quality over logical rule compliance, while the proposed attention mechanism showed consistent improvements across all tasks.

GENIUS: Generative Fluid Intelligence Evaluation Suite

GENIUS: Generative Fluid Intelligence Evaluation Suite - Methodology Flow Theoretical Foundation: CHC Theory Crystallized Intelligence (CI) vs Generative Fluid Intelligence (GFI) Three Core Primitives: Inductive Inference | Abstract Dynamic Reasoning | Adaptive Inhibition Formalized into: Implicit Pattern Induction | Ad-hoc Constraint Execution | Contextual Knowledge Adaptation Implicit Pattern Induction Task: Implicit Pattern Generation (86 samples) Sub-tasks: Overall Style, Visual Feature, Spatial Relationship, Palette, Entity Ad-hoc Constraint Execution Tasks: Symbolic (153) + Visual (60) Constraints Sub-tasks: Operation Implementation, Visual Metaphor, Layout, Features, Binding Contextual Knowledge Adaptation Tasks: Prior-Conflicting (101) + Multi-Semantic (110) Sub-tasks: Biological Growth, Gravity, Animal Behavior, Time Reversal, Weather Hybrid Evaluation Framework Rule Compliance (RC) Visual Consistency (VC) Aesthetic Quality (AQ) LMM-as-Judge (Gemini-3-Pro) + Manual Hints + 3-Point Scale (0,1,2) Systematic Evaluation 12 Representative Models Proprietary: Nano Banana Pro/Base, GPT-Image, SeeDream 4.0/4.5 Open-source: Qwen-Image, GLM-Image, FLUX.2, NextStep-1, Emu3.5, Bagel Key Finding: Even SOTA models fall short (Best: 57.19) Failure Analysis & Solution Attention Visualization: Irregular noise & spikes Theoretical Framework: ICL as Implicit Fine-tuning Root Cause: Imbalanced attention → noisy gradients Solution: Training-free Attention Adjustment Three-Stage Attention Adjustment Pipeline 1. Keyword Distillation Extract task-critical cues 2. Relevance Mapping Compute semantic alignment 3. Bias Injection Modulate attention logits Result: +6.18% improvement on Bagel (Overall Score)
Q1
1. What are the three core primitives that define Generative Fluid Intelligence (GFI) according to the GENIUS framework?
Visual Understanding, Text Generation, and Image Synthesis
Implicit Pattern Induction, Ad-hoc Constraint Execution, and Contextual Knowledge Adaptation
Rule Compliance, Visual Consistency, and Aesthetic Quality
Q2
2. What was the highest overall score achieved by any model on the GENIUS benchmark, and what does this reveal about current AI capabilities?
85.3% by GPT-Image, showing models are close to human-level fluid intelligence
72.8% by Bagel, indicating moderate success in generative reasoning tasks
57.19% by Nano Banana Pro, demonstrating significant deficits in fluid intelligence even for state-of-the-art models
Q3
3. According to the paper's theoretical analysis, what is the primary cause of models' failure in Generative Fluid Intelligence tasks?
Insufficient training data containing novel scenarios and constraints
Imbalanced attention distribution that results in noisy implicit gradients, preventing models from overcoming pre-trained priors
Limited computational resources during inference that restrict complex reasoning capabilities
1/2

Paper 2

PhyCritic: Multimodal Critic Models for Physical AI

Published: 2026-02-11

Link: http://arxiv.org/pdf/2602.11124

1. 📘 Topic and Domain: This paper focuses on developing multimodal critic models specifically designed for evaluating physical AI tasks involving perception, causal reasoning, and planning in embodied environments.
2. 💡 Previous Research and New Ideas: The paper builds on existing multimodal reward models and reinforcement learning techniques for vision-language models, proposing a novel self-referential critic finetuning approach where the critic first generates its own prediction before evaluating candidate responses.
3. ❓ Problem: The paper aims to solve the lack of physics-aware multimodal critics that can reliably evaluate responses involving physical perception, causal reasoning, and action planning, as existing critics focus mainly on general visual domains.
4. 🛠️ Methods: The authors use a two-stage RLVR pipeline with GRPO optimization: Stage 1 involves physical skill warmup on question-answer pairs, followed by Stage 2 self-referential critic finetuning where the model generates its own prediction before judging candidate responses.
5. 📊 Results and Evaluation: PhyCritic achieved the best performance among open-source 7B/8B models on PhyCritic-Bench (68.0% accuracy), outperformed baselines on physical reasoning benchmarks (CosmosReason1-Bench, CV-Bench, EgoPlan-Bench2), and demonstrated strong generalization to general multimodal reward benchmarks.

PhyCritic: Multimodal Critic Models for Physical AI

PhyCritic: Multimodal Critic Models for Physical AI Two-Stage RLVR Training Pipeline Training Data Sources • RoboVQA • BridgeData V2 • HoloAssist • AgiBot World • Cosmos-Reason1 Base Model Qwen2.5-VL-7B Stage 1: Physical Skill Warmup Vanilla GRPO Training 80 steps on Physical QA pairs Accuracy Reward R = I(Â_pred(Q) = A^Q) Stage 2: Self-Referential Critic Finetuning Self-Prediction <pred_think> <pred> Own reasoning Critic Judgment <think> \boxed{} Preference eval Reward Components R_total = R_acc + α_form × R_form R_acc = α_sp × R_sp + α_crit × R_crit Self-prediction + Critic + Format rewards GRPO with Multi-Reward 300 steps with preference data PhyCritic Model 7B Evaluation Results PhyCritic-Bench 68.0% accuracy Best open-source 7B CosmosReason1 63.9% accuracy Physical reasoning VL-RewardBench 57.3% accuracy General domains Best-of-N +6.5% improvement Test-time scaling Key Innovation Self-referential critic finetuning
Q1
1. What is the core innovation of PhyCritic's self-referential critic finetuning approach?
The model first generates its own prediction for the question before evaluating candidate responses
The model uses multiple vision encoders to process physical scenes from different angles
The model employs adversarial training to distinguish between correct and incorrect physical reasoning
Q2
2. Which datasets were NOT used in constructing the PhyCritic training dataset?
RoboVQA, BridgeData V2, and HoloAssist
ImageNet, COCO, and Visual Genome
AgiBot World and Cosmos-Reason1
Q3
3. How many training samples and RL steps does PhyCritic require compared to approaches using millions of supervised traces?
10,000+ samples with 1,000+ RL steps for comprehensive coverage
4,058 samples with 380 RL steps (80 + 300) for data-efficient training
50,000 samples with 2,000 RL steps for robust performance
1/2

Paper 3

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Published: 2026-02-10

Link: http://arxiv.org/pdf/2602.10090

1. 📘 Topic and Domain: This paper presents Agent World Model (AWM), a synthetic environment generation pipeline for training tool-use agents in reinforcement learning within the domain of autonomous AI agents and multi-turn tool interactions.
2. 💡 Previous Research and New Ideas: The paper builds on existing work in LLM-based agents and synthetic data generation, proposing a novel code-driven pipeline that systematically generates executable environments with database-backed state consistency rather than relying on LLM simulation or limited hand-crafted environments.
3. ❓ Problem: The paper aims to solve the scalability limitations in training tool-use agents caused by the lack of diverse, reliable, and executable environments that can support large-scale reinforcement learning.
4. 🛠️ Methods: The authors use a five-step synthesis pipeline (scenario generation, task synthesis, database design, interface creation, and verification) combined with Group Relative Policy Optimization (GRPO) for reinforcement learning training on 1,000 generated environments.
5. 📊 Results and Evaluation: The results show that agents trained exclusively on synthetic environments achieve strong out-of-distribution generalization across three benchmarks (BFCLv3, τ²-bench, MCP-Universe), consistently outperforming baseline methods and demonstrating the effectiveness of the synthetic training approach.

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Agent World Model (AWM) Workflow Step 1: Scenario Synthesis 100 seed domains → 1,000 scenarios LLM expansion + filtering Step 2: Task Generation 10 tasks per scenario API-solvable, post-auth Step 3: Database Design SQLite schema generation Sample data synthesis Self-correction mechanism Step 4: Interface Generation MCP toolset design Python code generation 35 tools per environment Step 5: Verification Code-augmented judge Database state comparison Reward signal generation Agentic Reinforcement Learning GRPO Algorithm History-Aware Training 1,024 Parallel Instances Trained Tool-Use Agents AWM Pipeline Results 1,000 Environments Diverse scenarios 35,062 Tools MCP interfaces 10,000 Tasks With verification 88% Success Rate Self-correction 18.5 tables avg 2K lines code avg 8.5 steps avg 1.13 trials avg Evaluation on Out-of-Distribution Benchmarks BFCLv3 +12.11 improvement τ²-bench Competitive results MCP-Universe Best overall Generalization Strong OOD performance Key Technical Innovations Code Database-backed state consistency MCP Unified interface protocol Auto Self-correction mechanism Scale Parallel RL training 1K instances
Q1
1. What is the key advantage of AWM's code-driven environments over LLM-simulated environments for agent training?
They provide more reliable state transitions and are more cost-effective for large-scale RL training
They generate more creative and diverse scenarios than LLM simulation
They require less computational resources during environment synthesis
Q2
2. How many environments and tools did the AWM pipeline successfully generate at scale?
500 environments with 20,000 tools
1,000 environments with 35,062 tools
2,000 environments with 50,000 tools
Q3
3. What verification approach does AWM use to provide robust reward signals for RL training?
Pure code-based verification that strictly checks database state changes
LLM-only verification that judges based solely on agent trajectories
Code-augmented LLM-as-a-Judge that combines structured database verification with trajectory context