2025-08-01 Papers

1/2

Paper 1

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Published: 2025-07-31

Link: http://arxiv.org/pdf/2507.23726

1. 📘 Topic and Domain: Automated theorem proving using large language models, focusing on solving complex mathematical problems including International Mathematical Olympiad (IMO) challenges.

2. 💡 Previous Research and New Ideas: Based on previous LLM theorem provers and AlphaGeometry, introduces new "lemma-style" proving approach and integrates formal verification with Lean programming language, moving beyond natural language proofs.

3. ❓ Problem: Addresses the challenge of automated mathematical reasoning and theorem proving, particularly for complex IMO-level problems, which current LLMs struggle with due to lack of clear supervision signals in natural language proofs.

4. 🛠️ Methods: Implements two systems: Seed-Prover (using lemma-style whole-proof reasoning with three-tiered inference strategies) and Seed-Geometry (specialized geometry engine), both leveraging reinforcement learning and formal verification through Lean.

5. 📊 Results and Evaluation: Achieved impressive results including proving 5/6 IMO 2025 problems, 78.1% of past IMO problems, 99.6% on MiniF2F test set, and outperforming previous state-of-the-art systems across multiple benchmarks.

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

1/2

Paper 2

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Published: 2025-07-31

Link: http://arxiv.org/pdf/2507.23682

1. 📘 Topic and Domain: Vision-Language-Action (VLA) models for robotic manipulation, specifically focusing on enhancing latent action modeling to improve robot control policies.

2. 💡 Previous Research and New Ideas: Based on previous VLA models and latent action learning approaches (LAPA, GR00T, IGOR), introduces a novel framework called villa-X that improves both latent action learning and its integration into VLA pre-training through proprioceptive supervision and joint diffusion modeling.

3. ❓ Problem: The challenge of developing generalizable robot manipulation policies that can effectively learn from both robot data and human videos while bridging the gap between high-level visual-language instructions and low-level robot controls.

4. 🛠️ Methods: Implements a two-component system: (1) a Latent Action Model with proprioceptive supervision for better action representation, and (2) an Actor module that jointly models latent and robot actions through diffusion processes, using a pre-trained vision-language model.

5. 📊 Results and Evaluation: Achieves superior performance compared to existing baselines on both simulated environments (SIMPLER and LIBERO benchmarks) and real-world robotic tasks, demonstrating improved success rates across various manipulation tasks and better generalization capabilities.

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

1/2

Paper 3

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Published: 2025-07-30

Link: http://arxiv.org/pdf/2507.22968

1. 📘 Topic and Domain: The paper presents C3, a bilingual benchmark dataset for evaluating Spoken Dialogue Models' (SDMs) ability to handle complex conversations in both English and Chinese.

2. 💡 Previous Research and New Ideas: Based on previous research on SDM benchmarks that focused mainly on single-language evaluation, this paper proposes a new comprehensive benchmark that includes phonological ambiguity, semantic ambiguity, omission, coreference, and multi-turn interaction phenomena.

3. ❓ Problem: The paper addresses the lack of comprehensive evaluation methods for understanding SDMs' effectiveness in handling complex conversational challenges, particularly in bilingual contexts.

4. 🛠️ Methods: The authors created a dataset of 1,079 instances comprising five phenomena categories, developed an LLM-based evaluation method, and tested six popular SDMs across different languages and conversational complexities.

5. 📊 Results and Evaluation: The evaluation revealed that SDMs perform differently across languages and phenomena, with English generally being easier than Chinese, semantic ambiguity being particularly challenging in Chinese, and omission being the most difficult context-dependency phenomenon to handle.