2025-07-15 Papers

1/2

Paper 1

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10532

1. 📘 Topic and Domain: The paper examines the reliability of reinforcement learning results in large language models (LLMs), specifically focusing on mathematical reasoning capabilities and data contamination issues.

2. 💡 Previous Research and New Ideas: Based on recent work showing improved mathematical reasoning in LLMs through reinforcement learning, the paper proposes that some performance gains may be due to data contamination rather than actual learning.

3. ❓ Problem: The paper investigates why Qwen2.5 models show improved performance with random/incorrect reward signals while other models like Llama do not, suggesting potential data contamination in evaluation benchmarks.

4. 🛠️ Methods: The authors create a new synthetic arithmetic dataset called RandomCalculation for clean evaluation, analyze memorization capabilities through partial-prompt completion tests, and conduct controlled RLVR (Reinforcement Learning with Verifiable Rewards) experiments.

5. 📊 Results and Evaluation: Results show that Qwen2.5's performance gains on standard benchmarks like MATH-500 likely stem from data contamination, as only correct reward signals yield stable improvements on the clean RandomCalculation dataset, while random/incorrect rewards provide no benefit.

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

1/2

Paper 2

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10548

1. 📘 Topic and Domain: The paper focuses on embodied AI and vision-language models, specifically developing a dataset and benchmark for training AI agents to perform physical tasks in simulated 3D environments.

2. 💡 Previous Research and New Ideas: Based on existing vision-language models (VLMs) like GPT-4o and Gemini, it introduces a novel dataset EmbRACE-3K that adds step-by-step reasoning annotations and closed-loop interaction capabilities.

3. ❓ Problem: The paper addresses the limitation of current VLMs in embodied settings, where agents struggle with spatial reasoning, long-horizon planning, and maintaining goal awareness in interactive environments.

4. 🛠️ Methods: The authors created a dataset of 3,000 language-guided tasks in photorealistic environments using Unreal Engine, including step-by-step reasoning annotations, and developed a two-stage training approach combining supervised fine-tuning and reinforcement learning.

5. 📊 Results and Evaluation: The fine-tuned models achieved significant improvements over zero-shot baselines, with success rates improving from below 20% to over 80% on some tasks, though generalization to out-of-domain scenarios remained challenging.

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

1/2

Paper 3

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10541

1. 📘 Topic and Domain: The paper focuses on evaluating large reasoning models (LRMs) through simultaneous multi-problem testing, in the domain of artificial intelligence and language model evaluation.

2. 💡 Previous Research and New Ideas: Previous research relied on single-question evaluation benchmarks which have become saturated; this paper introduces REST (Reasoning Evaluation through Simultaneous Testing), a novel framework that tests models by presenting multiple problems simultaneously.

3. ❓ Problem: The paper addresses two key limitations of current evaluation methods: the saturation of existing benchmarks requiring constant creation of new ones, and the failure to assess models' performance under multi-context pressure that better reflects real-world scenarios.

4. 🛠️ Methods: The authors evaluated 34 advanced reasoning models across 7 reasoning benchmarks by concatenating multiple questions into a single prompt and measuring performance across different stress levels (number of simultaneous questions).

5. 📊 Results and Evaluation: The results showed significant performance degradation even in state-of-the-art models under stress testing (e.g., DeepSeek-R1 dropped 29.1% on AIME24), revealed stronger discriminative power than single-question evaluations, and identified that models trained with "long2short" technique performed better under REST.