2025-08-01 Papers

1/2

Paper 1

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Published: 2025-07-31

Link: http://arxiv.org/pdf/2507.23726

1. 📘 Topic and Domain: Automated theorem proving using large language models, focusing on solving complex mathematical problems including International Mathematical Olympiad (IMO) challenges.
2. 💡 Previous Research and New Ideas: Based on previous LLM theorem provers and AlphaGeometry, introduces new "lemma-style" proving approach and integrates formal verification with Lean programming language, moving beyond natural language proofs.
3. ❓ Problem: Addresses the challenge of automated mathematical reasoning and theorem proving, particularly for complex IMO-level problems, which current LLMs struggle with due to lack of clear supervision signals in natural language proofs.
4. 🛠️ Methods: Implements two systems: Seed-Prover (using lemma-style whole-proof reasoning with three-tiered inference strategies) and Seed-Geometry (specialized geometry engine), both leveraging reinforcement learning and formal verification through Lean.
5. 📊 Results and Evaluation: Achieved impressive results including proving 5/6 IMO 2025 problems, 78.1% of past IMO problems, 99.6% on MiniF2F test set, and outperforming previous state-of-the-art systems across multiple benchmarks.

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Seed-Prover Workflow Seed-Geometry Extended DSL (Ruler-Compass) C++ Reasoning Engine (100x faster) Seed LLM Policy Model Beam Search (Distributed) 230M Geometry Problems Seed-Prover Core Lemma-Style Proving Generate lemmas before main theorem Conjecture Proposing Iterative Refinement Multi-stage RL Training (VAPO) Diverse prompting with Lean feedback Three-Tier Test-Time Scaling Light 8-16 refinements Pass@64-256 1-2 hours Medium Outer + Inner Refinement Complex proofs Heavy 5000 conjectures Lemma pool Days thinking Lean Compiler Feedback & Verification Lemma Pool Self-Summarization Performance Results IMO 2025: 5/6 problems solved Past IMO: 78.1% (121/155) MiniF2F: 99.6% (Saturated) PutnamBench: 331/657 (50%+) IMO-AG-50: 43/50 (vs AG2: 42) CombiBench: 30% (vs 10%) MiniCTX-v2: 81.8% (vs 44.3%) Iterative Proof Process Problem Generate Lean Check Feedback Refine Success Iterative Refinement Loop Geometry Integration Results Feed
Q1
1. What is the key innovation in Seed-Prover's approach to theorem proving compared to previous methods?
Using natural language processing exclusively
Implementing lemma-style whole-proof reasoning with iterative refinement
Focusing only on geometry problems
Q2
2. In the IMO 2025 competition, how did Seed-Prover and Seed-Geometry collectively perform?
Proved 3 out of 6 problems during competition
Proved 5 out of 6 problems during competition
Proved 4 out of 6 during competition, reaching 5/6 post-competition
Q3
3. What unique feature does Seed-Geometry implement to handle geometry problems more efficiently?
It uses only natural language processing
It relies solely on manual proof verification
It implements a C++ backend that's 100x faster than previous Python implementations
1/2

Paper 2

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Published: 2025-07-31

Link: http://arxiv.org/pdf/2507.23682

1. 📘 Topic and Domain: Vision-Language-Action (VLA) models for robotic manipulation, specifically focusing on enhancing latent action modeling to improve robot control policies.
2. 💡 Previous Research and New Ideas: Based on previous VLA models and latent action learning approaches (LAPA, GR00T, IGOR), introduces a novel framework called villa-X that improves both latent action learning and its integration into VLA pre-training through proprioceptive supervision and joint diffusion modeling.
3. ❓ Problem: The challenge of developing generalizable robot manipulation policies that can effectively learn from both robot data and human videos while bridging the gap between high-level visual-language instructions and low-level robot controls.
4. 🛠️ Methods: Implements a two-component system: (1) a Latent Action Model with proprioceptive supervision for better action representation, and (2) an Actor module that jointly models latent and robot actions through diffusion processes, using a pre-trained vision-language model.
5. 📊 Results and Evaluation: Achieves superior performance compared to existing baselines on both simulated environments (SIMPLER and LIBERO benchmarks) and real-world robotic tasks, demonstrating improved success rates across various manipulation tasks and better generalization capabilities.

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

villa-X: Vision-Language-Latent-Action Framework Stage 1: LAM Training Video Frames Robot Data IDM VQ FDM Proprio FDM Stage 2: ACT Training VLM ACT-Latent ACT-Robot Joint Diffusion Cross Attention Stage 3: Finetuning Target Data Embodiment Context Robot Policy Key Components & Innovations Proprio FDM • Predicts future robot states • Aligns latent actions with physical dynamics • Grounds actions in robot behavior Joint Diffusion • Models latent & robot actions • Conditions robot actions on latent actions • Enables structured transfer Hierarchical Design • Latent actions as mid-level bridge • Sequence planning at both latent & robot levels Data Sources & Evaluation Training Data • OpenX Robot Data • Ego4D Human Videos • Something-Something V2 • Bridge V2, EPIC-KITCHENS Simulation • SIMPLER Benchmark • LIBERO Tasks • Google Robot Platform • WidowX Platform Real World • Realman Robot Arm • Xarm + Xhand • Gripper Manipulation • Dexterous Hand Tasks Results • Superior performance • Better generalization • Cross-embodiment transfer Technical Details • PaliGemma VLM backbone • Flow matching for diffusion • Attention masking strategy • Embodiment context embedding villa-X: Enhanced latent action modeling for generalizable robot manipulation policies
Q1
1. What is the key innovation in villa-X's Latent Action Model compared to previous approaches?
Using larger neural networks for action prediction
Adding proprioceptive supervision through a Forward Dynamics Model
Increasing the size of the training dataset
Q2
2. How does villa-X handle the transfer between latent actions and robot actions?
Through simple weight initialization
By treating them as independent processes
Through joint diffusion modeling with conditional attention
Q3
3. What unique training strategy does villa-X use to prevent over-reliance on latent actions?
Randomly masking 50% of latent action tokens and applying attention dropout
Using a larger batch size during training
Adding more regularization layers
1/2

Paper 3

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Published: 2025-07-30

Link: http://arxiv.org/pdf/2507.22968

1. 📘 Topic and Domain: The paper presents C3, a bilingual benchmark dataset for evaluating Spoken Dialogue Models' (SDMs) ability to handle complex conversations in both English and Chinese.
2. 💡 Previous Research and New Ideas: Based on previous research on SDM benchmarks that focused mainly on single-language evaluation, this paper proposes a new comprehensive benchmark that includes phonological ambiguity, semantic ambiguity, omission, coreference, and multi-turn interaction phenomena.
3. ❓ Problem: The paper addresses the lack of comprehensive evaluation methods for understanding SDMs' effectiveness in handling complex conversational challenges, particularly in bilingual contexts.
4. 🛠️ Methods: The authors created a dataset of 1,079 instances comprising five phenomena categories, developed an LLM-based evaluation method, and tested six popular SDMs across different languages and conversational complexities.
5. 📊 Results and Evaluation: The evaluation revealed that SDMs perform differently across languages and phenomena, with English generally being easier than Chinese, semantic ambiguity being particularly challenging in Chinese, and omission being the most difficult context-dependency phenomenon to handle.

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

C3 Benchmark: Methodology Workflow Literature Review & Statistical Analysis (CABank, MagicData-RAMC) Identify Complex Phenomena (5 key challenges) Real-world Data Collection (Web sources & datasets) Dataset Construction & Quality Control (1,079 instances) Phonological Ambiguity • Heterograph • Heteronym • Stress/Intonation Semantic Ambiguity • Lexical • Syntactic Omission • Detection • Completion Coreference • Detection • Resolution Multi-turn Interaction • History tracking • Context memory Cam-data Ambiguity Evaluation (Phonological + Semantic) Ccon-data Context-dependency (Omission + Coreference + Multi-turn) Speech Processing TTS Generation + Manual Check Bilingual Support English + Chinese LLM-based Evaluation GPT-4o + DeepSeek-R1 SDM Performance Testing 6 End-to-end Models Evaluated GPT-4o-Audio, Qwen2.5-Omni, MooER-Omni, etc. Finding 1 Ambiguity is most difficult especially semantic in Chinese Finding 2 Omission is hardest in context-dependency Finding 3 Chinese dialogues more challenging than English
Q1
1. What is the most challenging type of context-dependency phenomenon for SDMs according to the paper's findings?
Coreference resolution
Omission handling
Multi-turn interaction
Q2
2. How many total instances are included in the C3 benchmark dataset?
1,079 instances
1,586 instances
2,000 instances
Q3
3. Which phenomenon showed the most significant performance gap between Chinese and English language processing?
Phonological ambiguity
Semantic ambiguity
Multi-turn interaction