2025-10-20 Papers

1/2

Paper 1

Agentic Entropy-Balanced Policy Optimization

Published: 2025-10-16

Link: http://arxiv.org/pdf/2510.14545

1. 📘 Topic and Domain: Agentic Entropy-Balanced Policy Optimization (AEPO) for reinforcement learning in large language models (LLMs), specifically focusing on web agent training and tool use capabilities.
2. 💡 Previous Research and New Ideas: Based on previous agentic RL methods that use entropy signals for tool exploration, but introduces novel entropy balancing in both rollout and policy update phases to address limitations of excessive entropy reliance.
3. ❓ Problem: Addresses two key challenges in entropy-based RL: "High-Entropy Rollout Collapse" where excessive branching occurs along specific paths, and "High-Entropy Token Gradient Clipping" where valuable exploratory behaviors are lost during training.
4. 🛠️ Methods: Implements two core components: (1) Dynamic entropy-balanced rollout mechanism that adaptively allocates sampling budgets and penalizes consecutive high-entropy steps, and (2) Entropy-balanced policy optimization that preserves high-entropy token gradients through stop-gradient operations.
5. 📊 Results and Evaluation: Outperformed 7 mainstream RL algorithms across 14 datasets, achieving with Qwen3-14B: 47.6% on GAIA, 11.2% on HLE, and 43.0% on WebWalkerQA for Pass@1; 65.0%, 26.0%, and 70.0% respectively for Pass@5.

Agentic Entropy-Balanced Policy Optimization

Agentic Entropy-Balanced Policy Optimization (AEPO) Workflow Two Entropy-Driven Challenges in Agentic RL 1. High-Entropy Rollout Collapse 2. High-Entropy Token Gradient Clipping AEPO: Two Core Components Component 1: Dynamic Entropy-Balanced Rollout Entropy Pre-Monitoring Calculate H_root and H_tool Adaptive Budget Allocation Entropy-Balanced Beaming Branch Probability P_t Consecutive Branch Penalty I_Gain = m · H_root + (k-m) · H_tool m = k · σ(β(H_root - H_avg_tool)) Tree-Structured Rollout Process Global Sampling (m) + Branch Sampling (k-m) Prevent Over-branching via Penalty Component 2: Entropy-Balanced Policy Optimization Stop-Gradient Operation Preserve High-Entropy Token Gradients Entropy-Aware Advantage Prioritize High-Uncertainty Tokens F(θ) = {1+ε_h if δ>1+ε_h and A>0; 0 if δ<1-ε_l and A<0; δ otherwise} Entropy-Aware Advantage Estimation Ã(t) = Ã_Acc(t) × (1 + a·Ã_ΔH(t)) Integrate accuracy-based and entropy-based advantages Experimental Results GAIA: 47.6% (Pass@1), 65.0% (Pass@5) HLE: 11.2% (Pass@1), 26.0% (Pass@5) WebWalkerQA: 43.0% (Pass@1), 70.0% (Pass@5) Consistent outperformance across 14 datasets with only 1K RL samples Key Benefits of AEPO Improved rollout sampling diversity while maintaining stable policy entropy Balanced exploration across tree-structured rollout via adaptive budget allocation Enhanced tool-call efficiency and reduced financial costs in web agent training
Q1
1. What is the main innovation of AEPO compared to previous agentic RL methods?
It completely eliminates the use of entropy signals in training
It balances entropy in both rollout and policy update phases
It only focuses on policy update optimization
Q2
2. Which of the following best describes the 'High-Entropy Rollout Collapse' problem that AEPO addresses?
The complete failure of all rollout attempts during training
Excessive branching occurring along specific paths while neglecting other potential paths
The inability to generate any high-entropy states during rollout
Q3
3. What performance improvement did AEPO achieve with Qwen3-14B on the GAIA benchmark for Pass@5?
47.6%
65.0%
70.0%
1/2

Paper 2

WithAnyone: Towards Controllable and ID Consistent Image Generation

Published: 2025-10-16

Link: http://arxiv.org/pdf/2510.14975

1. 📘 Topic and Domain: Identity-consistent image generation with a focus on controllable multi-person portrait synthesis in computer vision and AI image generation.
2. 💡 Previous Research and New Ideas: Based on previous identity-consistent generation and customization models, it proposes a novel contrastive training approach using paired reference images rather than just reconstruction.
3. ❓ Problem: The paper aims to solve the "copy-paste" artifact problem where models directly replicate reference faces instead of preserving identity across natural variations in pose, expression and lighting.
4. 🛠️ Methods: Introduces WithAnyone model built on FLUX architecture using: (1) MultiID-2M dataset with paired references, (2) GT-aligned ID loss, (3) ID contrastive loss with extended negatives, and (4) 4-phase training pipeline.
5. 📊 Results and Evaluation: WithAnyone achieves state-of-the-art identity similarity while significantly reducing copy-paste artifacts, evaluated through both quantitative metrics and user studies showing improved controllability and visual quality.

WithAnyone: Towards Controllable and ID Consistent Image Generation

WithAnyone: Controllable and ID Consistent Image Generation Data Construction 1. Single-ID Collection & Clustering 2. Multi-ID Collection + 2/3/4 actors 3. ID Image Pairing Embedding Retrieval 4. Post-processing Filtering & Labelling MultiID-2M Dataset Training Pipeline Phase 1: Reconstruction Fixed Prompt Phase 2: Reconstruction Full Captions Phase 3: Paired Tuning Phase 4: Quality Tuning Training Objectives Diffusion Loss GT-Aligned ID ID Contrastive L = L_diff + λ_ID * L_ID + λ_CL * L_CL Extended Negative Pool for Contrastive Learning Ground Truth Landmark Alignment Model Architecture FLUX DiT Backbone Face Embedding (ArcFace) Image Embedding (SigLIP) Cross Attention with Location Control Attention Mask for Face Regions MultiID-Bench • Sim(GT) - Ground Truth Similarity • Copy-Paste Metric • Identity Blending • Generation Quality Key Innovation: Breaking Copy-Paste vs Identity Fidelity Trade-off Paired Training Data Different reference & target from same identity GT-Aligned Loss Accurate identity measurement Extended Negatives Rich contrastive learning Copy-Paste Metric Quantifies artifacts vs natural variation Results & Achievements Higher GT Similarity Better identity match Lower Copy-Paste Natural variations Better Controllability Pose & expression State-of-the-art Performance MultiID-2M Dataset • 500K paired multi-ID images • 1.5M additional unpaired images • 3K identities, 400+ refs each MultiID-Bench • 435 test cases (1-4 people) • Long-tail identities • Comprehensive metrics
Q1
1. What is the main problem that WithAnyone aims to solve in identity-consistent image generation?
Low resolution output images
The 'copy-paste' artifact where models directly replicate reference faces
Slow generation speed and high computational costs
Q2
2. How many phases are there in WithAnyone's training pipeline?
2 phases - reconstruction and fine-tuning
3 phases - pretraining, paired tuning and quality tuning
4 phases - reconstruction pretraining with fixed prompt, reconstruction pretraining with captions, paired tuning, and quality tuning
Q3
3. What is unique about the MultiID-2M dataset introduced in this paper?
It only contains single-person portrait images
It focuses exclusively on synthetic faces
It provides multiple paired reference images for each identity in group photos
1/2

Paper 3

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Published: 2025-10-16

Link: http://arxiv.org/pdf/2510.14979

1. 📘 Topic and Domain: Development of native vision-language models (VLMs) that integrate vision and language processing in a unified architecture, in the domain of multimodal AI.
2. 💡 Previous Research and New Ideas: Based on modular VLMs that combine separate visual encoders and language models; proposes a novel unified architecture called NEO with native primitives that process vision and language jointly from the start.
3. ❓ Problem: Addresses limitations of modular VLMs including complex multi-stage training, rigid visual biases, and inefficient vision-language alignment by developing a more integrated approach.
4. 🛠️ Methods: Implements a unified architecture with native primitives including flexible position encoding, multi-head native attention, and native rotary position embeddings, trained end-to-end on 390M image-text pairs.
5. 📊 Results and Evaluation: NEO achieves competitive performance compared to top modular VLMs across diverse benchmarks despite using less training data, particularly strong in visual-centric tasks while showing some limitations in knowledge-intensive and OCR tasks.

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

NEO: Native Vision-Language Model Workflow Image Input 32×32 patches Text Input Tokenized PEL Patch Embedding WEL Word Embedding Native VLM Primitives Multi-Head Native Attention Native-RoPE (T,H,W) Pre-Buffer L1 Primitive Layers Pixel-Word Alignment Visual Learning Post-LLM L2 Primitive Layers Reasoning & Generation LLM Capabilities Stage 1 Pre-Training Stage 2 Mid-Training Stage 3 SFT Training Data 345M Image-Text Pairs → 40M Multi-task → 4M Instructions Web-scale + Synthetic + High-quality Key Features • Unified Architecture • End-to-end Training • Mixed Attention • Modality-specific RoPE • Scalable Design Model Variants NEO-2.2B Qwen3-1.7B + 12 Pre-Buffer NEO-9B Qwen3-8B + 6 Pre-Buffer Performance • Competitive with Modular VLMs • Superior to Native VLMs • Efficient Training • Reusable Components • Cost-effective Applications • Visual QA • OCR & Document Understanding • Chart & Diagram Analysis • Multimodal Reasoning • Vision-Language Generation Unified Output Text Generation with Visual Understanding
Q1
1. What is the main innovation of NEO compared to traditional modular VLMs?
It uses much more training data than other models
It integrates vision and language processing in a unified architecture from the start
It completely eliminates the need for visual encoding
Q2
2. How many image-text pairs were used to train NEO?
390 million
3.9 billion
39 million
Q3
3. What is a limitation of NEO identified in the paper?
It performs poorly on basic visual recognition tasks
It requires more computational resources than modular VLMs
It shows weaker performance on knowledge-intensive and OCR tasks