2025-05-05 Papers

1/2

Paper 1

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.20966

1. 📘 Topic and Domain: The paper introduces "softpick," a new attention mechanism for transformer models in deep learning, specifically focusing on improving attention computation.

2. 💡 Previous Research and New Ideas: Based on traditional softmax attention in transformers, it proposes a novel rectified, not sum-to-one normalization function as a replacement for softmax attention.

3. ❓ Problem: The paper aims to solve two major issues in transformer models: attention sink (where attention heads allocate significant scores to irrelevant tokens) and massive activations (extremely large hidden state values).

4. 🛠️ Methods: The authors implemented and tested softpick on 340M parameter transformer models, comparing it with traditional softmax models using the same architecture and training configuration.

5. 📊 Results and Evaluation: Results showed softpick maintained performance parity with softmax on benchmarks while achieving 0% sink rate, reducing hidden state kurtosis from 33,510 to 340, creating 46.97% sparse attention maps, and performing better under quantization.

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

1/2

Paper 2

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Published: 2025-04-30

Link: http://arxiv.org/pdf/2505.00234

1. 📘 Topic and Domain: The paper explores improving Large Language Model (LLM) agents for sequential decision-making tasks through self-generated in-context examples rather than task-specific knowledge engineering.

2. 💡 Previous Research and New Ideas: Previous research relied on task-specific knowledge engineering through prompt tuning and curated examples, while this paper proposes using the agent's own successful experiences to automatically improve performance.

3. ❓ Problem: The paper addresses how to improve LLM agent performance without relying on labor-intensive task-specific knowledge engineering and prompt tuning.

4. 🛠️ Methods: The authors developed three approaches: Traj-Bootstrap (collecting successful trajectories), +DB-Selection (population-based training to identify high-performing databases), and +Exemplar-Selection (retaining individual trajectories based on empirical utility).

5. 📊 Results and Evaluation: The methods improved test performance significantly across three benchmarks - ALFWorld (73% to 91%), Wordcraft (55% to 72%), and InterCode-SQL (75% to 81%) - matching or exceeding performance of more complex approaches that use task-specific components.

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

1/2

Paper 3

Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.20708

1. 📘 Topic and Domain: Analysis of Large Language Models' reasoning processes through examination of intermediate steps ("subthoughts") in mathematical problem-solving.

2. 💡 Previous Research and New Ideas: Based on Chain-of-Thought prompting research; proposes a novel approach to analyze intermediate reasoning steps rather than just final answers.

3. ❓ Problem: The paper addresses the limitation of evaluating LLMs solely on their final answers, potentially missing valuable information encoded within the reasoning process.

4. 🛠️ Methods: Segments reasoning traces into subthoughts, generates multiple solution completions from each intermediate point, and aggregates answers using mode frequency analysis and entropy measurements.

5. 📊 Results and Evaluation: Achieved significant accuracy improvements (up to 13% on AIME2024 and 10% on AIME2025) across various models by using the most frequent answer from subthought completions instead of final answers, with lower entropy correlating strongly with correct solutions.