2026-02-20 Papers

1/2

Paper 1

Unified Latents (UL): How to train your latents

Published: 2026-02-19

Link: http://arxiv.org/pdf/2602.17270

1. 📘 Topic and Domain: The paper presents Unified Latents (UL), a framework for learning latent representations in generative modeling, specifically for image and video generation using diffusion models.
2. 💡 Previous Research and New Ideas: The paper builds on Latent Diffusion Models and VAE frameworks, proposing to jointly train an encoder, diffusion prior, and diffusion decoder with linked encoding noise to the prior's minimum noise level, providing interpretable control over latent information content.
3. ❓ Problem: The paper addresses the challenge of how to optimally regularize latent representations when they will be subsequently modeled by diffusion models, balancing reconstruction quality against generation performance.
4. 🛠️ Methods: The authors use a deterministic encoder with fixed Gaussian noise linked to the diffusion prior's precision, train a diffusion decoder conditioned on noisy latents, and apply reweighted ELBO loss with sigmoid weighting for optimization.
5. 📊 Results and Evaluation: On ImageNet-512, UL achieves FID of 1.4 with high reconstruction quality while requiring fewer training FLOPs than Stable Diffusion latents; on Kinetics-600, it achieves state-of-the-art FVD of 1.3.

Unified Latents (UL): How to train your latents

Unified Latents (UL) Workflow Stage 1: Joint Training Input Image x Encoder E(x,ε) → z_clean Add Noise z_0 = α₀z_clean + σ₀ε Diffusion Prior z_1 → z_0 KL regularization Diffusion Decoder D(y_t, z_0, t) Reconstruct x from z_0 with sigmoid weighting Loss Components L_p: Prior loss (ELBO weighted) L_d: Decoder loss (Sigmoid weighted) Loss factor: 1.3-1.7 L = L_p + L_d Stage 2: Base Model Training (Frozen Encoder) Frozen Encoder E(x,ε) → z_clean Base Model (Larger) Sigmoid weighted ELBO Sampling Process 1. Sample z_1 ~ N(0,I) 2. Base model: z_1 → z_0 3. Sample y_1 ~ N(0,I) 4. Decoder: (y_1, z_0) → x Key Innovation: Fixed noise σ₀ ≈ 0.08 Links encoder noise to diffusion precision Provides interpretable bitrate control Results: ImageNet-512 FID 1.4 | Kinetics-600 FVD 1.3 (SOTA)
Q1
1. What is the key innovation in Unified Latents that differentiates it from standard VAE approaches?
Using GAN-based discriminators for better mode coverage
Linking the encoder's output noise to the diffusion prior's minimum noise level
Employing discrete tokens instead of continuous latents
Q2
2. According to the paper, what happens when the loss factor is increased in Unified Latents?
Lower reconstruction quality but easier-to-model latents
Better reconstruction quality but higher latent bitrate
Reduced training FLOPs with unchanged generation quality
Q3
3. Why do the authors prefer training a separate base model in stage 2 rather than using the prior directly for generation?
The prior trained with ELBO loss weights all frequencies equally, leading to poor sample quality
Stage 2 training requires less GPU memory than joint training
The decoder cannot handle samples from the prior distribution
1/2

Paper 2

CADEvolve: Creating Realistic CAD via Program Evolution

Published: 2026-02-18

Link: http://arxiv.org/pdf/2602.16317

1. 📘 Topic and Domain: The paper focuses on generating complex Computer-Aided Design (CAD) programs using evolutionary methods and vision-language models for 3D parametric modeling.
2. 💡 Previous Research and New Ideas: The paper builds on existing CAD sequence datasets (DeepCAD, Fusion360, CAD-Recode) which are limited to sketch-extrude operations, and proposes CADEvolve - an evolutionary pipeline that uses VLMs to iteratively grow CAD programs from simple primitives to industrial-grade complexity.
3. ❓ Problem: The paper aims to solve the bottleneck of limited public CAD datasets that lack complex operations, multi-operation composition, and design intent, which hinders effective AI model training for CAD automation.
4. 🛠️ Methods: The authors use an evolutionary propose-execute-filter pipeline with GPT-4o-mini that iteratively edits parent programs, validates them through staged checks (execution, geometry, visual-text agreement), and creates a three-tier dataset (generators, programs, canonicalized scripts).
5. 📊 Results and Evaluation: CADEvolve generated ~8k parametric generators and ~1.3M executable scripts; a VLM fine-tuned on this dataset achieved state-of-the-art Image2CAD performance on DeepCAD, Fusion360, and MCB benchmarks with improved CD/IoU metrics compared to the cadrille baseline.

CADEvolve: Creating Realistic CAD via Program Evolution

CADEvolve Pipeline Flow Phase 1: Evolutionary Synthesis (CADEvolve-G) Seed Pool 46 hand-written generators VLM Propose GPT-5-mini k children Code Synthesis + Retrieval param2cq 3-Stage Validation Execution Geometry/Visual Selection Accept/Reject Evolution Loop 7,945 Generators Phase 2: Sampling and Processing (CADEvolve-P) CMA-ES Sampling N=15 per generator Code Augmentation 10x rewrites Image2CAD Qwen2-VL-2B Mesh Distillation ABC/ShapeNet ~1.74M Scripts Phase 3: Canonicalization (CADEvolve-C) Unification Flat sequence Centering at (0,0,0) Normalization Size=200 Binarization Integer grid CADRecode Mix Sketch diversity Final Dataset: ~2.7M Scripts Ready for SFT + RL Training Data Generation Processing/Training Validation/Quality Output/Result
Q1
1. What is the key innovation that CADEvolve introduces to overcome the limitations of existing CAD datasets?
Using evolutionary methods to grow complex CAD programs from simple primitives through VLM-guided iterations
Training larger vision-language models with more parameters to better understand 3D geometry
Manually annotating thousands of industrial CAD designs with multi-operation sequences
Q2
2. Why did the authors apply canonicalization (centering, scaling, and binarization) to their CAD scripts?
To compress the dataset size and reduce storage requirements
To make the learner focus on construction logic rather than incidental syntax or scale variations
To convert CadQuery code into other CAD software formats like Fusion360
Q3
3. What percentage of proposed CAD programs were rejected during the evolutionary synthesis process in later iterations?
Around 15-20% due to minor syntax errors
Approximately 40-50% because of duplicate designs
Up to 85% due to strict validation rules and increasing complexity
1/2

Paper 3

Towards a Science of AI Agent Reliability

Published: 2026-02-18

Link: http://arxiv.org/pdf/2602.16666

1. 📘 Topic and Domain: The paper addresses AI agent reliability evaluation, proposing a multi-dimensional framework for measuring how consistently, robustly, predictably, and safely AI agents perform beyond simple accuracy metrics.
2. 💡 Previous Research and New Ideas: Building on safety-critical engineering practices from aviation, nuclear power, and automotive domains, the paper introduces a novel decomposition of agent reliability into four dimensions with 12 concrete metrics, moving beyond traditional single-score accuracy evaluations.
3. ❓ Problem: Current AI agent evaluations rely primarily on mean task success rates, which obscure critical operational flaws like inconsistent behavior across runs, sensitivity to input variations, unpredictable failures, and unbounded error severity.
4. 🛠️ Methods: The authors evaluate 14 agentic models across two benchmarks (GAIA and τ-bench) using multi-run protocols, prompt perturbations, fault injection, environment modifications, and LLM-based safety analysis to compute metrics across consistency, robustness, predictability, and safety dimensions.
5. 📊 Results and Evaluation: Despite 18 months of capability improvements showing steady accuracy gains, reliability improvements lag significantly behind, with consistency and discrimination identified as the weakest dimensions requiring immediate research focus.

Towards a Science of AI Agent Reliability

Towards a Science of AI Agent Reliability Research Workflow 1. Motivation & Problem Identification Real-world failures Gap: Accuracy ≠ Reliability Safety-critical domains Need: Multi-dimensional 2. Framework Development Consistency Cout, Ctraj, Cres Robustness Rfault, Renv, Rprompt Predictability Pcal, PAUROC, Pbrier Safety Scomp, Sharm 12 Concrete Metrics Mathematical formulations: normalized, disentangled from capability Aggregation: R = 1/3(RCon + RPred + RRob) 3. Experimental Evaluation 14 Models (OpenAI, Google, Anthropic) 2 Benchmarks (GAIA, τ-bench) Protocols (K=5 runs, perturbations) Reliability Profiling (Consistency, Robustness, Predictability, Safety) 4. Key Findings & Recommendations Finding: Reliability lags capability Recommendation: Dynamic benchmarks needed Impact: Reliability-aware deployment
Q1
1. What real-world incident involving AI agents did the paper use to illustrate the gap between benchmark performance and reliable operation?
Microsoft's Bing Chat developing multiple personalities and refusing to answer questions about competitors
Replit's AI coding assistant deleting an entire production database despite explicit instructions forbidding such changes
Google's Bard hallucinating historical facts during a public demonstration to investors
Q2
2. According to the paper's findings, which reliability dimension showed the most promising improvement trends in recent AI models?
Consistency - models became more deterministic in their outputs across multiple runs
Robustness - models handled environmental perturbations with increasing stability
Predictability (specifically calibration) - models' confidence estimates became better aligned with actual success rates
Q3
3. What unusual pattern did the authors discover regarding trajectory consistency in AI agents?
Agents achieved high distributional consistency but low sequence consistency, meaning they selected similar actions but in different orders
Smaller models consistently outperformed larger models in maintaining identical action sequences
Trajectory consistency improved linearly with task difficulty, contradicting theoretical predictions