2025-05-27 Papers

1/2

Paper 1

ARM: Adaptive Reasoning Model

Published: 2025-05-26

Link: http://arxiv.org/pdf/2505.20258

1. 📘 Topic and Domain: The paper introduces ARM (Adaptive Reasoning Model), focusing on improving the efficiency of large language models' reasoning capabilities in the domain of natural language processing and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous research on large reasoning models and Group Relative Policy Optimization (GRPO), the paper proposes a new approach that enables models to adaptively select appropriate reasoning formats based on task difficulty, rather than using a uniform reasoning approach.
3. ❓ Problem: The paper aims to solve the "overthinking" problem in large reasoning models, where models apply unnecessarily complex reasoning to all tasks regardless of difficulty, leading to excessive token usage and computational inefficiency.
4. 🛠️ Methods: The paper uses a two-stage training approach: first applying supervised fine-tuning to teach the model four reasoning formats (Direct Answer, Short CoT, Code, and Long CoT), then implementing Ada-GRPO, an adapted version of GRPO with a format diversity reward mechanism.
5. 📊 Results and Evaluation: ARM achieved comparable accuracy while reducing token usage by ~30% on average (up to ~70% in some cases) compared to models using only Long CoT, and demonstrated a ~2× training speedup compared to traditional GRPO while maintaining performance across various reasoning tasks.

ARM: Adaptive Reasoning Model

ARM: Adaptive Reasoning Model Stage 1: Supervised Fine-tuning Direct Answer Short CoT Code Long CoT AQuA-Rat Dataset (10.8K samples) Stage 2: Ada-GRPO Training Format Diversity Reward r'i = αi(t) · ri Decay Mechanism Gradually reduces diversity influence over training Training Datasets: CSQA (4.9K), GSM8K (7.4K), MATH (7.5K) Adaptive Mode Instruction-Guided Mode Consensus-Guided Mode
Q1
1. What is the main problem that ARM (Adaptive Reasoning Model) aims to solve?
Models taking too long to generate any response
Models using unnecessarily complex reasoning for simple tasks
Models being unable to handle complex mathematical problems
Q2
2. Which of these is NOT one of the four reasoning formats used in ARM's training?
Mathematical Reasoning
Direct Answer
Code
Q3
3. What is the most significant efficiency improvement achieved by ARM compared to models using only Long CoT?
Reduced training time by 50%
Reduced token usage by up to 70%
Improved accuracy by 30%
1/2

Paper 2

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Published: 2025-05-25

Link: http://arxiv.org/pdf/2505.19457

1. 📘 Topic and Domain: The paper introduces BizFinBench, a comprehensive benchmark for evaluating large language models' performance in real-world financial applications.
2. 💡 Previous Research and New Ideas: Previous benchmarks like FLUE and FinEval focused on general financial knowledge testing, while this paper proposes a novel business-driven benchmark with real-world financial scenarios and introduces IteraJudge, a new evaluation method.
3. ❓ Problem: The paper addresses the gap between existing financial benchmarks and real-world applications, where current evaluation methods fail to adequately assess LLMs' performance in complex, business-oriented financial tasks.
4. 🛠️ Methods: The authors constructed a dataset of 6,781 queries across 5 dimensions and 9 categories from real user interactions, and developed IteraJudge, an iterative calibration-based evaluation framework for assessing LLM performance.
5. 📊 Results and Evaluation: Testing 25 models showed that no single model dominated across all tasks, with proprietary models like ChatGPT-o3 performing best in reasoning tasks (83.58%) and open-source models like DeepSeek-R1 excelling in numerical calculations (64.04%).

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Data Collection from iwencai APP Data Processing GPT-4o cleaning Dataset Construction Expert validation 5 Key Dimensions Numerical Calculation | Information Extraction Reasoning | Prediction Recognition | Question Answering IteraJudge Framework 1. Dimension-decoupled assessment 2. Sequential correction generation 3. Reference-aligned assessment
Q1
1. What is the main innovation of BizFinBench compared to previous financial benchmarks?
It has a larger dataset size
It focuses on business-driven real-world scenarios
It only evaluates open-source models
Q2
2. What unique evaluation method does the paper introduce to assess LLM performance?
IteraJudge - an iterative calibration-based framework
Traditional human evaluation only
Simple accuracy metrics comparison
Q3
3. According to the paper's results, which type of models performed best in reasoning tasks?
Open-source models like DeepSeek-R1
Proprietary models like ChatGPT-o3
Smaller specialized financial models
1/2

Paper 3

Lifelong Safety Alignment for Language Models

Published: 2025-05-26

Link: http://arxiv.org/pdf/2505.20259

1. 📘 Topic and Domain: The paper addresses safety alignment for Large Language Models (LLMs) through a lifelong learning framework that continuously adapts to evolving jailbreaking attacks.
2. 💡 Previous Research and New Ideas: Based on existing safety alignment and jailbreaking research, it proposes a novel competitive framework between a Meta-Attacker that discovers new attack strategies and a Defender that learns to resist them.
3. ❓ Problem: The paper aims to solve the vulnerability of static safety-aligned LLMs to new and unseen jailbreaking attacks that emerge after deployment.
4. 🛠️ Methods: The authors implement a two-stage framework: first warming up a Meta-Attacker using GPT-4 to extract insights from jailbreak research papers, then creating an iterative adversarial process where the Meta-Attacker and Defender evolve through competitive training.
5. 📊 Results and Evaluation: The initial Meta-Attacker achieved 73% attack success rate on RR and 57% transfer rate on LAT, while the evolved Defender reduced attack success to 7%, maintaining helpful capabilities across standard benchmarks.

Lifelong Safety Alignment for Language Models

Lifelong Safety Alignment Framework Warm-Up Stage Extract Strategies from Jailbreak Papers using GPT-4o API Meta-Attacker Evolution Generate Jailbreak Questions Through Beam Search and Reject Fine-Tuning Defender Evolution Refusal Training with Successful Attack Cases and Refusal Outputs Success Buffer (s,x,y,g) Tuples Failed Buffer (s,x,y,g) Tuples Lifelong Iterations Until K or N reached
Q1
1. What is the main innovation in this paper's approach to LLM safety alignment?
Using GPT-4 to analyze research papers
Creating a lifelong competitive framework between an attacker and defender
Developing new jailbreaking techniques
Q2
2. How does the Meta-Attacker initially learn attack strategies?
By randomly generating attack patterns
Through trial and error against the defender
By extracting insights from jailbreak research papers using GPT-4
Q3
3. What was the final impact of the evolved Defender on attack success rate?
Reduced it to 7% while maintaining helpful capabilities
Eliminated all attacks but lost helpful capabilities
Reduced it to 25% with some capability loss