2025-04-15 Papers

1/2

Paper 1

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Published: 2025-04-14

Link: http://arxiv.org/pdf/2504.10368

1. 📘 Topic and Domain: The paper introduces S1-Bench, a benchmark for evaluating Large Reasoning Models' (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning.
2. 💡 Previous Research and New Ideas: The paper builds on research about LRMs' chain-of-thought capabilities but identifies a gap in evaluating their performance on simple tasks; it proposes a novel benchmark specifically designed to assess system 1 thinking capabilities.
3. ❓ Problem: The paper aims to solve the problem of LRMs' over-reliance on system 2 thinking (deliberative reasoning) when confronting extremely simple questions better suited for intuition-driven system 1 processing.
4. 🛠️ Methods: The authors constructed S1-Bench by creating simple, diverse, and naturally clear questions across multiple domains and languages, validated by both human annotators and smaller LLMs, then evaluated 22 different LRMs on this benchmark.
5. 📊 Results and Evaluation: The results revealed significant inefficiency in LRMs on simple tasks, with outputs averaging 15.5 times longer than traditional small LLMs, and showed that LRMs often identify correct answers early but continue unnecessary deliberation, sometimes even producing numerous errors.

S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

S1-Bench Methodology Flowchart S1-Bench Construction LRM Evaluation on S1-Bench Input: Benchmark Surveys, Define Simple Subcategories Initial Q&A Generation (Generators + A Priori Simplicity Constraints) Quality Assessment (Annotators + Discriminators) Check Clarity, Uniqueness Retain / Modify / Discard? A Posteriori Verification (Validators: Small LLMs, Multi-Temp Sampling) Correctness Evaluation (Evaluator: GPT-4o) All Correct & Robust? Iterative Difficulty Reduction (Max 3) (Modify Q&A) Output: S1-Bench Dataset Retain Yes No Loop (≤3 times) Modify Input: S1-Bench, 22 LRMs Configure LRMs (Greedy & Top-p) Generate Responses Define Evaluation Metrics: - Format (S/L-Corr) - Efficiency (ART) - Accuracy (Pass@1, Acc@k) Main Results Analysis (Overthinking, Under-accuracy) In-depth Analyses Efficiency Analysis: - ART by Question Type - Solution Segmentation (Initial vs Additional Cost) - Redundancy (Similarity) Error Analysis: - Thinking Process (TP) vs Final Answer (FA) Simplicity Prejudgement: - Identify Prejudgements - Count Instances - Compare ART Output: Key Findings (LRM Inefficiency, Under-accuracy, Redundancy, Prejudgement)
Q1
1. What is the primary finding of S1-Bench regarding Large Reasoning Models' efficiency?
LRMs generate outputs that are 15.5 times longer than traditional small LLMs on simple tasks
LRMs are significantly faster at processing simple questions than traditional LLMs
LRMs and traditional LLMs show equivalent efficiency on simple tasks
Q2
2. How does S1-Bench ensure that its questions are truly simple?
By only including mathematics problems with single-digit numbers
Through both a priori constraints and a posteriori verification using smaller LLMs
By limiting questions to those that can be answered in one word only
Q3
3. What interesting phenomenon did the researchers discover about LRMs' ability to recognize question simplicity?
LRMs completely lack the ability to identify simple questions
LRMs can prejudge question simplicity but still exhibit inefficiency in their responses
LRMs only recognize simplicity in English questions but not in Chinese ones
1/2

Paper 2

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Published: 2025-04-13

Link: http://arxiv.org/pdf/2504.09643

1. 📘 Topic and Domain: The paper explores code generation using large language models with a novel reranking approach called RewardRanker.
2. 💡 Previous Research and New Ideas: The work builds upon previous code generation models and RLHF techniques, proposing a novel iterative self-training approach that uses Proximal Policy Optimization to improve reranking models rather than just generative models.
3. ❓ Problem: The paper addresses the challenge of generating high-quality code that solves complex programming tasks, particularly with decoder-based models that produce stochastic outputs where even minor errors can break entire solutions.
4. 🛠️ Methods: The authors developed an iterative self-training method combining supervised fine-tuning, reward model training, and PPO, with a cycle that incorporates hard negative examples into training to continuously improve reranking performance.
5. 📊 Results and Evaluation: Their 13.4B parameter model outperformed a 33B parameter model while being three times faster, achieved performance comparable to GPT-4, and surpassed it in C++ programming language when evaluated on the MultiPL-E benchmark.

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Iterative Self-Training for Code Generation via Reinforced Re-Ranking Methodology Flowchart (RewardRanker) Prepare Initial Datasets - SFT Data (Prompt-Completion) - Alignment Data (Triplets) (A) Supervised Fine-Tuning (SFT) Fine-tune base Generator Model using SFT dataset. (B) Train RewardRanker Train Reranker/Reward Model using Alignment Triplets. (C) Proximal Policy Optimization (PPO) Train Generator (from SFT) using RewardRanker for rewards. (D) Generate & Evaluate New Examples Use PPO-trained Generator to create solutions. Evaluate with test cases. Identify: Positive Examples Identify: Hard Negative Examples (E) Refine Dataset Add generated Positive & Hard Negative examples to Alignment Dataset. Generated Examples Iterative Refinement Loop (Feed refined dataset back to RewardRanker Training) SFT prepares the generator for domain-specific tasks. RewardRanker learns to prefer correct over incorrect code via pairwise comparison. PPO optimizes the generator to maximize scores from the current RewardRanker. Crucial step: Identify 'Hard Negatives' (high-scoring failures).
Q1
1. What is the primary innovation of RewardRanker compared to traditional PPO approaches?
It focuses on optimizing the generative model with a reward model
It emphasizes developing a robust reward/reranking model rather than just the generative model
It eliminates the need for any reward model in code generation
Q2
2. What key component makes the iterative self-training cycle of RewardRanker particularly effective?
The inclusion of hard negative examples in the training dataset
The exclusive use of correct solutions in training
Pre-defined test cases during inference
Q3
3. How did the 13.4B parameter RewardRanker model compare to larger models in the evaluation?
It performed significantly worse but was much faster
It matched performance but required more computational resources
It outperformed a 33B model while being three times faster
1/2

Paper 3

SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

Published: 2025-04-14

Link: http://arxiv.org/pdf/2504.10157

1. 📘 Topic and Domain: The paper introduces SocioVerse, a world model framework for social simulation using LLM-based agents to model human behavior across political, news, and economic domains.
2. 💡 Previous Research and New Ideas: The paper builds on prior social simulation research but proposes a comprehensive framework with four alignment components (social environment, user engine, scenario engine, and behavior engine) and a 10-million real user pool to enhance simulation realism.
3. ❓ Problem: The paper addresses alignment challenges between simulated environments and the real world, including maintaining up-to-date context, precisely modeling target users, aligning interaction mechanisms, and capturing diverse behavioral patterns.
4. 🛠️ Methods: The authors implemented the framework with a 10-million user pool from social media platforms, demographic annotation systems, and standardized simulation pipelines across three scenarios: presidential election prediction, breaking news feedback, and national economic surveys.
5. 📊 Results and Evaluation: Results demonstrated that SocioVerse can accurately reflect large-scale population dynamics with over 90% accuracy in election predictions, consistent user reactions to breaking news, and close alignment with real-world economic statistics, while showing that both prior distribution and real-world knowledge enhance simulation accuracy.

SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

SocioVerse Methodological Workflow SocioVerse Framework: Aligning Simulation with Reality Real-World Data Sources (Social Media, News, Stats) Social Environment Aligns Context Collect & Update: - Social Structure - Social Dynamics (Events) - Personalized Context Output: Dynamic Knowledge User Engine Aligns Agents with Users 1. Build 10M User Pool (X, Rednote Data) 2. Infer User Labels (LLM -> Human Eval -> Classifier) 3. Sample Target Group (IPF / IDS based on Task) 4. Filter Anomalous Data Output: Aligned User Profiles Scenario Engine Aligns Interaction Structure 1. Define Task/Query 2. Select Scenario Template: - Questionnaire (1-N) - In-depth Interview (1-1) - Behavior Experiment (N-N) Output: Simulation Setup Behavior Engine Aligns Agent Behavior Integrates Inputs: (Context, Profiles, Setup) Selects Agent Model: (ABM / LLM - General, Expert, Domain) Output: Simulated Behaviors Validation & Application Simulated Agent Behaviors / Responses Analysis & Evaluation Compare vs. Ground Truth (Metrics: Acc, RMSE, NRMSE, KL-Div) Aligned Simulation Results Example Applications: 1. Presidential Election Prediction (US) 2. Breaking News Feedback (ChatGPT) 3. National Economic Survey (China)
Q1
1. What is the primary innovation of SocioVerse compared to previous social simulation approaches?
It uses more powerful LLMs than previous approaches
It incorporates a 10-million real user pool with four alignment components
It focuses exclusively on political simulations
Q2
2. In the presidential election prediction simulation, which factor had the most significant impact on improving accuracy?
The choice of underlying LLM model
The social environment's real-world knowledge
The prior demographics distribution of users
Q3
3. What interesting pattern was observed about LLM performance in the national economic survey?
All models performed best on housing spending and worst on daily necessities
Models showed inconsistent performance across different spending categories
All models performed best on daily necessities spending and worst on housing spending