2025-07-15 Papers

1/2

Paper 1

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10532

1. 📘 Topic and Domain: The paper examines the reliability of reinforcement learning results in large language models (LLMs), specifically focusing on mathematical reasoning capabilities and data contamination issues.
2. 💡 Previous Research and New Ideas: Based on recent work showing improved mathematical reasoning in LLMs through reinforcement learning, the paper proposes that some performance gains may be due to data contamination rather than actual learning.
3. ❓ Problem: The paper investigates why Qwen2.5 models show improved performance with random/incorrect reward signals while other models like Llama do not, suggesting potential data contamination in evaluation benchmarks.
4. 🛠️ Methods: The authors create a new synthetic arithmetic dataset called RandomCalculation for clean evaluation, analyze memorization capabilities through partial-prompt completion tests, and conduct controlled RLVR (Reinforcement Learning with Verifiable Rewards) experiments.
5. 📊 Results and Evaluation: Results show that Qwen2.5's performance gains on standard benchmarks like MATH-500 likely stem from data contamination, as only correct reward signals yield stable improvements on the clean RandomCalculation dataset, while random/incorrect rewards provide no benefit.

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Research Workflow: Data Contamination in RL for Mathematical Reasoning Problem Identification Qwen2.5 shows unexpected gains with spurious rewards Hypothesis Formation Data Contamination vs Strong Baseline Skills Memorization Testing Partial-prompt completion & answer accuracy Clean Dataset Creation RandomCalculation Dataset Generation RLVR on MATH-500 • Correct rewards • Random rewards • Inverted rewards • Mv-incorrect rewards Partial Prompt Analysis • 40%, 60%, 80% prompts • ROUGE-L scores • Exact Match rates • Multiple benchmarks Model Comparison • Qwen2.5 series • Llama3.1 series • Base vs Instruct • Template analysis Clean Evaluation • RandomCalculation • 1-20 step problems • Continuous rewards • GRPO algorithm Key Findings • Qwen shows high memorization on MATH-500 (54.6% completion) • Llama shows minimal memorization (3.8% completion) • Performance drops on fresh data Validation Results • Only correct rewards improve performance on clean data • Random/inverse rewards fail • Confirms contamination hypothesis • Validates reward importance GRPO Algorithm Group Relative Policy Optimization for RLVR Binary & continuous reward functions Dataset Construction Algorithm 1: Random arithmetic expressions Basic operations Variable complexity Evaluation Metrics • ROUGE-L scores • Exact Match rates • Answer accuracy • Performance curves Recommendations • Use clean benchmarks • Test multiple models • Verify reward quality • Check for contamination CONCLUSION Data contamination, not superior reasoning ability, explains Qwen2.5's performance gains under spurious rewards on MATH-500
Q1
1. What is the main reason behind Qwen2.5's apparent improvement in mathematical reasoning with random rewards according to the paper?
Superior mathematical capabilities compared to other models
Data contamination in evaluation benchmarks
Advanced reinforcement learning algorithms
Q2
2. How did the researchers create a clean evaluation benchmark to test true mathematical reasoning abilities?
They used existing benchmarks but filtered out contaminated data
They generated synthetic arithmetic problems with random operands
They collected new math problems from human experts
Q3
3. What happened when Qwen2.5 was tested on the new RandomCalculation dataset with random rewards?
It showed steady improvement in performance
It maintained its baseline performance
Training became unstable with no reliable improvement
1/2

Paper 2

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10548

1. 📘 Topic and Domain: The paper focuses on embodied AI and vision-language models, specifically developing a dataset and benchmark for training AI agents to perform physical tasks in simulated 3D environments.
2. 💡 Previous Research and New Ideas: Based on existing vision-language models (VLMs) like GPT-4o and Gemini, it introduces a novel dataset EmbRACE-3K that adds step-by-step reasoning annotations and closed-loop interaction capabilities.
3. ❓ Problem: The paper addresses the limitation of current VLMs in embodied settings, where agents struggle with spatial reasoning, long-horizon planning, and maintaining goal awareness in interactive environments.
4. 🛠️ Methods: The authors created a dataset of 3,000 language-guided tasks in photorealistic environments using Unreal Engine, including step-by-step reasoning annotations, and developed a two-stage training approach combining supervised fine-tuning and reinforcement learning.
5. 📊 Results and Evaluation: The fine-tuned models achieved significant improvements over zero-shot baselines, with success rates improving from below 20% to over 80% on some tasks, though generalization to out-of-domain scenarios remained challenging.

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments - Method Flow Stage 1 Environment Sampling & Pose Selection 6-DoF coordinates UnrealCV-Zoo Stage 2 Task Instruction Generation Gemini 2.5 Pro 5 task categories Stage 3 Human Demonstration & Trajectory Capture Real-time control Action sequences Stage 4 Step-wise Reasoning Annotation CoT explanations Intent grounding EmbRACE-3K Dataset 3,000+ tasks, 26,000+ decision steps Photorealistic environments with multimodal annotations Training Pipeline Supervised Fine-tuning (SFT) Reinforcement Learning (GRPO) Evaluation Framework • Exploration • Dynamic Spatial-Semantic • Multi-stage Goal Execution • Interaction Tasks Success Rate, GDE, SSPL metrics Models Evaluated • GPT-4o • Gemini 2.5 Pro • Qwen2.5-VL-7B Zero-shot vs Fine-tuned In-domain vs Out-of-domain Key Results Zero-shot models: <20% success rate across all tasks Fine-tuned Qwen2.5-VL: Substantial improvements with SFT+RL Step-wise reasoning annotations enhance decision quality Generalization remains challenging for out-of-domain scenarios Dataset Features Closed-loop Interaction Photorealistic Environments Step-wise Annotations Multimodal Grounding Spatio-temporal Awareness Long-horizon Planning Natural Language Rationales Diverse Task Types
Q1
1. What is the main limitation of current Vision-Language Models (VLMs) that EmbRACE-3K aims to address?
Poor performance in image classification tasks
Inability to handle real-time embodied interactions and spatial reasoning
Limited vocabulary in natural language processing
Q2
2. How many stages does the training pipeline in EmbRACE-3K consist of?
One stage using only reinforcement learning
Two stages combining supervised fine-tuning and reinforcement learning
Three stages including pre-training, fine-tuning, and testing
Q3
3. What unique feature of EmbRACE-3K's dataset annotations sets it apart from previous embodied AI datasets?
High-resolution 4K images of environments
Real-world robot demonstrations
Step-by-step natural language reasoning explanations for each action
1/2

Paper 3

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10541

1. 📘 Topic and Domain: The paper focuses on evaluating large reasoning models (LRMs) through simultaneous multi-problem testing, in the domain of artificial intelligence and language model evaluation.
2. 💡 Previous Research and New Ideas: Previous research relied on single-question evaluation benchmarks which have become saturated; this paper introduces REST (Reasoning Evaluation through Simultaneous Testing), a novel framework that tests models by presenting multiple problems simultaneously.
3. ❓ Problem: The paper addresses two key limitations of current evaluation methods: the saturation of existing benchmarks requiring constant creation of new ones, and the failure to assess models' performance under multi-context pressure that better reflects real-world scenarios.
4. 🛠️ Methods: The authors evaluated 34 advanced reasoning models across 7 reasoning benchmarks by concatenating multiple questions into a single prompt and measuring performance across different stress levels (number of simultaneous questions).
5. 📊 Results and Evaluation: The results showed significant performance degradation even in state-of-the-art models under stress testing (e.g., DeepSeek-R1 dropped 29.1% on AIME24), revealed stronger discriminative power than single-question evaluations, and identified that models trained with "long2short" technique performed better under REST.

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

REST: Reasoning Evaluation through Simultaneous Testing Problem Identification • Benchmark saturation • Single-question limitation REST Framework Concatenate multiple questions into single prompt Benchmark Reconstruction ps_i = Compose(qi, qi+1, ..., q[(i+s-1)mod N]) Stress level s ∈ Z+ Cyclic indexing for coverage Evaluation Setup • 34 LRMs (1.5B-671B params) • 7 benchmarks • Variable stress levels Answer Extraction Rule-based: \\boxed{} LLM-based: Gemma-3-27B Extract() function Key Findings • SOTA models show significant degradation • Enhanced discriminative power • Position bias effects • Long2Short training benefits • Post-training limitations Error Analysis • Question Omission (QO) • Reasoning Error (RE) • Output Truncation (OT) • Endless Repetition (ER) • Summary Error (SE) Mechanistic Insights • Overthinking trap phenomenon • Adaptive reasoning effort allocation • Question order effects (easy-first better) Applications & Impact • Cost-efficient evaluation paradigm • Revitalizes existing benchmarks • Real-world multi-context assessment Future Directions • Robust reasoning system development • Enhanced Long2Short training techniques Accuracy Formula Acc(Ps) = (1/N) Σ Acc(ps_i) = (1/N) Σ (1/s) Σ δ(â,a) where δ is Kronecker delta Stress Level Examples GSM8K: s ∈ {1,3,6,9,12} MATH500: s ∈ {1,3,5,7,9} AIME24: s ∈ {1,2,3,4,5} Performance Examples DeepSeek-R1 AIME24: 81.66% → 52.49% (-29.1%) Reveals hidden weaknesses
Q1
1. What was the primary motivation behind developing the REST evaluation framework?
To reduce the cost of creating new benchmarks
To test models' performance in multi-context scenarios
To improve model training efficiency
Q2
2. Which of the following findings about model performance under REST was most surprising?
Models trained with 'long2short' technique performed better
All models showed consistent performance across stress levels
Even state-of-the-art models like DeepSeek-R1 showed significant performance drops
Q3
3. What unique advantage does REST provide over traditional single-question evaluations?
It requires less computational resources
It reveals performance differences between models that appear similar in single-question tests
It improves model training speed