2025-04-15 Papers

1/2

Paper 1

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Published: 2025-04-14

Link: http://arxiv.org/pdf/2504.10368

1. 📘 Topic and Domain: The paper introduces S1-Bench, a benchmark for evaluating Large Reasoning Models' (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning.

2. 💡 Previous Research and New Ideas: The paper builds on research about LRMs' chain-of-thought capabilities but identifies a gap in evaluating their performance on simple tasks; it proposes a novel benchmark specifically designed to assess system 1 thinking capabilities.

3. ❓ Problem: The paper aims to solve the problem of LRMs' over-reliance on system 2 thinking (deliberative reasoning) when confronting extremely simple questions better suited for intuition-driven system 1 processing.

4. 🛠️ Methods: The authors constructed S1-Bench by creating simple, diverse, and naturally clear questions across multiple domains and languages, validated by both human annotators and smaller LLMs, then evaluated 22 different LRMs on this benchmark.

5. 📊 Results and Evaluation: The results revealed significant inefficiency in LRMs on simple tasks, with outputs averaging 15.5 times longer than traditional small LLMs, and showed that LRMs often identify correct answers early but continue unnecessary deliberation, sometimes even producing numerous errors.

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

1/2

Paper 2

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Published: 2025-04-13

Link: http://arxiv.org/pdf/2504.09643

1. 📘 Topic and Domain: The paper explores code generation using large language models with a novel reranking approach called RewardRanker.

2. 💡 Previous Research and New Ideas: The work builds upon previous code generation models and RLHF techniques, proposing a novel iterative self-training approach that uses Proximal Policy Optimization to improve reranking models rather than just generative models.

3. ❓ Problem: The paper addresses the challenge of generating high-quality code that solves complex programming tasks, particularly with decoder-based models that produce stochastic outputs where even minor errors can break entire solutions.

4. 🛠️ Methods: The authors developed an iterative self-training method combining supervised fine-tuning, reward model training, and PPO, with a cycle that incorporates hard negative examples into training to continuously improve reranking performance.

5. 📊 Results and Evaluation: Their 13.4B parameter model outperformed a 33B parameter model while being three times faster, achieved performance comparable to GPT-4, and surpassed it in C++ programming language when evaluated on the MultiPL-E benchmark.

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

1/2

Paper 3

SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

Published: 2025-04-14

Link: http://arxiv.org/pdf/2504.10157

1. 📘 Topic and Domain: The paper introduces SocioVerse, a world model framework for social simulation using LLM-based agents to model human behavior across political, news, and economic domains.

2. 💡 Previous Research and New Ideas: The paper builds on prior social simulation research but proposes a comprehensive framework with four alignment components (social environment, user engine, scenario engine, and behavior engine) and a 10-million real user pool to enhance simulation realism.

3. ❓ Problem: The paper addresses alignment challenges between simulated environments and the real world, including maintaining up-to-date context, precisely modeling target users, aligning interaction mechanisms, and capturing diverse behavioral patterns.

4. 🛠️ Methods: The authors implemented the framework with a 10-million user pool from social media platforms, demographic annotation systems, and standardized simulation pipelines across three scenarios: presidential election prediction, breaking news feedback, and national economic surveys.

5. 📊 Results and Evaluation: Results demonstrated that SocioVerse can accurately reflect large-scale population dynamics with over 90% accuracy in election predictions, consistent user reactions to breaking news, and close alignment with real-world economic statistics, while showing that both prior distribution and real-world knowledge enhance simulation accuracy.