2026-03-19 Papers

1/2

Paper 1

MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification

Published: 2026-03-16

Link: http://arxiv.org/pdf/2603.15726

1. 📘 Topic and Domain: The paper presents MiroThinker-1.7 and MiroThinker-H1, research agents designed for complex long-horizon reasoning tasks in the domain of agentic AI systems.
2. 💡 Previous Research and New Ideas: The paper builds on ReAct paradigm and agentic LLMs (GPT-5.4, Claude-4.6, etc.), proposing agentic mid-training for atomic capabilities and verification-centric reasoning with local and global verifiers for reliable multi-step problem solving.
3. ❓ Problem: The paper aims to solve the challenge that simply scaling interaction length in agent trajectories accumulates noise and errors rather than improving reasoning quality in complex real-world tasks.
4. 🛠️ Methods: The authors use a four-stage training pipeline (mid-training, supervised fine-tuning, preference optimization, reinforcement learning) with dual-loop agent architecture, sliding-window context management, and verification mechanisms at both local and global levels.
5. 📊 Results and Evaluation: MiroThinker-H1 achieves state-of-the-art performance with 88.2 on BrowseComp and 88.5 on GAIA, outperforming commercial agents while requiring 43% fewer interaction rounds compared to previous versions, evaluated using LLM-as-Judge across multiple benchmarks.

MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification

MiroThinker-1.7 & H1: Method Workflow High-Quality QA Construction Corpus-based Pipeline Subgraph sampling, broad coverage WebHop Pipeline Web expansion, difficulty control 4-Stage Training Pipeline 1. Agentic Mid-training Planning Boosting Reasoning Sculpting Summarization 2. Supervised Fine-tuning Expert trajectories Multi-turn interaction Tool execution 3. Preference Optimization DPO training Correctness ranking Quality filtering 4. Reinforcement Learning GRPO optimization Entropy control Creative exploration MiroThinker-1.7-mini 3B activated params Preference distillation Efficient performance MiroThinker-1.7 Full-scale model Strong atomic abilities Effective interaction MiroThinker-H1 Heavy-duty reasoning Local verification Global verification Agentic Workflow & Verification ReAct Interaction Loop Thought Action Observation Context Management: Sliding-window (K=5) Result truncation Tool Interface Information Retrieval • google_search • scrape_extract Code Execution • run_python_code • run_command File Transfer H1 Verification Local Verifier Step-level audit Error correction Global Verifier Trajectory audit Solution selection
Q1
1. What is the key insight behind MiroThinker's approach to improving long-horizon reasoning?
Increasing the context window size to 512K tokens for better memory retention
Scaling effective interaction quality rather than simply increasing interaction length
Using multiple parallel agents to explore different solution paths simultaneously
Q2
2. How does MiroThinker-H1's verification-centric reasoning mode work?
It uses blockchain technology to verify each reasoning step cryptographically
It employs human experts to validate intermediate results during inference
It integrates Local and Global Verifiers to audit step-level and complete reasoning processes
Q3
3. What surprising result did MiroThinker-1.7-mini achieve compared to MiroThinker-1.5?
16.7% better performance with 43% fewer interaction rounds on average
50% reduction in training time with identical performance metrics
Triple the context length capacity while maintaining the same parameter count
1/2

Paper 2

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.17187

1. 📘 Topic and Domain: The paper presents MetaClaw, a continual meta-learning framework for deployed LLM agents that enables them to evolve and adapt in real-world usage through skill synthesis and policy optimization.
2. 💡 Previous Research and New Ideas: The paper builds on memory-based methods (Reflexion), skill-based approaches (Voyager, ExpeL), and RL-based LLM training (RLHF, GRPO), proposing a novel dual-mechanism approach that combines gradient-free skill evolution with opportunistic gradient-based policy optimization while maintaining strict support-query data separation.
3. ❓ Problem: The paper addresses the fundamental tension that deployed LLM agents remain static after training while user needs and task distributions evolve continuously, causing performance degradation without service interruption for retraining.
4. 🛠️ Methods: MetaClaw employs two complementary mechanisms: skill-driven fast adaptation that analyzes failures to synthesize reusable behavioral instructions with zero downtime, and opportunistic policy optimization that performs RL-based LoRA fine-tuning during user-inactive windows detected by monitoring sleep schedules, system inactivity, and calendar events.
5. 📊 Results and Evaluation: On MetaClaw-Bench (934 questions), skill adaptation improved accuracy by up to 32% relative, the full pipeline advanced Kimi-K2.5 from 21.4% to 40.6% accuracy with 8.25× gain in task completion, and on AutoResearchClaw, skill injection alone improved composite robustness by 18.3%.

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

MetaClaw: Continual Meta-Learning Framework Task Stream τ₁, τ₂, τ₃... Meta-Model M (θ, S) Agent Execution π_θ(·|τ, Retrieve(S,τ)) Trajectory a₁, a₂, ..., aₙ Skill-Driven Fast Adaptation Failure Analysis Support Data D_sup Skill Evolver E LLM Analysis Update S S_g+1 = S_g ∪ ΔS Zero Downtime Opportunistic Policy Optimization Query Data D_qry Post-adaptation OMLS Scheduler Idle Detection RL Training Cloud LoRA Update θ Skill Generation Versioning Support-Query Separation Flush samples with version ≤ g Three Idle Signals Sleep Inactivity Calendar Key Features: • Dual timescale adaptation: fast (skills) and slow (weights) • Zero-downtime skill injection via prompt modification • Opportunistic training during user idle periods • Strict support-query data separation via versioning
Q1
1. What unique scheduling mechanism does MetaClaw use to avoid disrupting user experience during policy optimization?
A distributed computing cluster that runs training in parallel without affecting the main agent
An Opportunistic Meta-Learning Scheduler (OMLS) that monitors sleep hours, keyboard inactivity, and Google Calendar events
A predictive algorithm that forecasts when users will need the agent and schedules training accordingly
Q2
2. Why does MetaClaw maintain strict separation between 'support data' and 'query data' through its skill generation versioning mechanism?
To prevent stale rewards from pre-adaptation failures contaminating gradient updates for policy optimization
To comply with data privacy regulations by keeping user interactions separate from training data
To reduce memory consumption by archiving old trajectories that are no longer needed
Q3
3. When MetaClaw was applied to AutoResearchClaw's 23-stage research pipeline, what was the primary improvement mechanism and its impact?
Full RL training reduced pipeline execution time by 23% through optimized stage transitions
Memory augmentation allowed the system to cache previous research papers for faster retrieval
Skill injection alone improved composite robustness by 18.3% without any gradient updates
1/2

Paper 3

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Published: 2026-03-17

Link: http://arxiv.org/pdf/2603.16859

1. 📘 Topic and Domain: The paper introduces SocialOmni, a benchmark for evaluating audio-visual social interactivity in omni-modal large language models (OLMs) during multi-party conversations.
2. 💡 Previous Research and New Ideas: The paper builds on existing OLM benchmarks that focus on static accuracy-centric tasks, proposing a new evaluation framework that operationalizes social interactivity across three dimensions: who (speaker identification), when (interruption timing), and how (natural response generation).
3. ❓ Problem: The paper addresses the gap in evaluating OLMs' conversational social competence, as current benchmarks fail to assess models' ability to navigate dynamic dialogue cues, determine appropriate turn-taking timing, and generate socially coherent responses in real-time multi-party settings.
4. 🛠️ Methods: The authors created a benchmark with 2,000 perception samples and 209 interaction-generation instances across 15 dialogue domains, using multiple-choice questions for speaker identification and LLM-as-judge protocols for evaluating turn-timing decisions and response quality.
5. 📊 Results and Evaluation: Testing 12 OLMs revealed no single model dominates all three axes, with significant decoupling between perceptual accuracy and generation quality - models excelling at speaker identification often fail at natural interruption generation, confirming that understanding-centric metrics alone cannot characterize conversational competence.

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

SocialOmni Benchmark Workflow Data Collection Video Sources • 3,000+ raw videos • 15 dialogue subcategories • 4 domains • CC-BY licenses Preprocessing Processing Steps • Extract 10-30s clips • Audio clarity filtering • Face visibility check • ASR transcription Task Design Task I: Who - Perception • Speaker identification at timestamp t • 4-way multiple choice • 2,000 perception samples • Consistent/Inconsistent splits • Audio-visual alignment testing Task II: When & How - Generation • Binary turn-taking decision • Context-appropriate utterance • 209 generation instances • Multi-reference continuations • Temporal constraint evaluation Evaluation Metrics • Who: Top-1 accuracy, Macro-F1 • When: Timing categories (E/O/L) • How: LLM-as-judge scoring • Consistency gap analysis • Response coverage metrics Model Testing & Benchmarking 12 OLMs tested: GPT-4o, Gemini series, Qwen series, OmniVinci, VITA, Baichuan-Omni, MiniOmni2 Key Findings • No single model dominates all axes • Perception-generation decoupling observed • Open-source models lag commercial systems • Cross-modal temporal incoherence issues Common Failure Patterns Perception: • Cross-modal temporal incoherence • Correct transcription, wrong speaker Generation: • Premature interruption • Contextually incoherent continuation
Q1
1. What surprising pattern did SocialOmni reveal about the relationship between a model's perceptual accuracy and its conversational abilities?
Models with high speaker identification accuracy consistently generate the most natural interruptions
There is a pronounced decoupling - models excelling at speaker identification often fail at generating appropriate interruptions
Perceptual accuracy perfectly predicts a model's ability to time conversational turns correctly
Q2
2. Which model achieved the best performance on each of SocialOmni's three evaluation axes?
GPT-4o dominated all three axes: who, when, and how
Gemini 3 Pro Preview excelled at all tasks due to its unified architecture
Different models led each axis: Qwen3-Omni (who), Gemini 3 Pro Preview (when), and Gemini 2.5 Flash (how)
Q3
3. What are the two dominant failure patterns SocialOmni identified in the 'when' (turn-taking timing) task?
Aggressive models frequently interrupt too early, while conservative models miss conversational windows entirely
All models consistently respond exactly on time but with inappropriate content
Models only fail when audio and visual cues are perfectly synchronized