2025-05-05 Papers

1/2

Paper 1

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.20966

1. 📘 Topic and Domain: The paper introduces "softpick," a new attention mechanism for transformer models in deep learning, specifically focusing on improving attention computation.
2. 💡 Previous Research and New Ideas: Based on traditional softmax attention in transformers, it proposes a novel rectified, not sum-to-one normalization function as a replacement for softmax attention.
3. ❓ Problem: The paper aims to solve two major issues in transformer models: attention sink (where attention heads allocate significant scores to irrelevant tokens) and massive activations (extremely large hidden state values).
4. 🛠️ Methods: The authors implemented and tested softpick on 340M parameter transformer models, comparing it with traditional softmax models using the same architecture and training configuration.
5. 📊 Results and Evaluation: Results showed softpick maintained performance parity with softmax on benchmarks while achieving 0% sink rate, reducing hidden state kurtosis from 33,510 to 340, creating 46.97% sparse attention maps, and performing better under quantization.

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Softpick Paper Methodology Flowchart Focusing on the Softpick Function and its Evaluation Problem: Softmax Issues Standard Softmax in Transformer Attention: - Causes Attention Sink (Scores on BOS token) - Leads to Massive Activations - Hinders Quantization & Low-Precision Training Proposed Solution: Softpick Function Drop-in replacement for Softmax: Softpick(x)_i = ReLU(exp(x_i) - 1) ------------------------------ sum_j |exp(x_j) - 1| + epsilon Key Ideas: - ReLU numerator: Enables zero scores (sparsity) - Absolute denominator: Breaks sum-to-one, allows gradients for negative inputs Implementation Apply in Attention Mechanism: Attention(Q,K,V) = Softpick(QKT/sqrt(dk)) V Compatible with FlashAttention (Online algorithm derived) Experimental Setup Train two 340M Llama-style models from scratch: - Model 1: Standard Softmax Attention - Model 2: Softpick Attention Dataset: FineWeb-Edu (52B tokens) Hardware: 8xH100 GPUs Framework: flash-linear-attention / Flame Goal: Compare performance and behavior Analysis & Evaluation Compare Softmax vs. Softpick models on: 1. Training: Loss, Gradient Norm 2. Benchmarks: ARC-e, Lambada, PIQA, SciQ, Wikitext 3. Quantization: HQQ, BNB, GPTQ (2, 3, 4, 8-bit) 4. Internal States: - Attention Maps (Visuals, Sparsity %) - Sink Rate % - Hidden State Kurtosis, Min/Max Activations Key Results & Implications Softpick Model Performance: - Benchmark Parity: Similar/Slightly better than Softmax. - Quantization Robustness: Consistently outperforms Softmax, especially at low bit-precision (e.g., 2-bit). Softpick Model Properties: - No Attention Sink: 0% Sink Rate. - No Massive Activations: ~100x lower Kurtosis (340 vs 33k). - Sparse Attention: ~47% Sparsity in maps. Implications: Benefits for Quantization, Low-Precision Training, Sparsity, Pruning, Interpretability. Challenge / Open Problem Long Context Underscoring: - Softpick scores can become too small with long contexts and sparse patterns. - Affects retrieval tasks (e.g., Passkey). - Scalable-Softpick approach didn't solve it.
Q1
1. What are the primary issues with traditional softmax attention in transformers that Softpick is designed to address?
Vanishing gradients and overfitting.
High computational cost and inability to process long sequences.
Attention sink and massive activations in hidden states.
Q2
2. Compared to softmax, what is a key characteristic observed in the attention maps generated by Softpick?
They are denser, with fewer zero-valued scores.
They exhibit significant sparsity, with many zero-valued scores.
They show higher attention scores allocated to the first token.
Q3
3. Based on the paper's findings, how does Softpick perform relative to softmax when models are quantized?
Quantized Softpick models consistently outperform quantized softmax models.
Quantized Softpick models perform worse, especially at lower bit precisions.
Quantization has a similar negative impact on both Softpick and softmax models.
1/2

Paper 2

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Published: 2025-04-30

Link: http://arxiv.org/pdf/2505.00234

1. 📘 Topic and Domain: The paper explores improving Large Language Model (LLM) agents for sequential decision-making tasks through self-generated in-context examples rather than task-specific knowledge engineering.
2. 💡 Previous Research and New Ideas: Previous research relied on task-specific knowledge engineering through prompt tuning and curated examples, while this paper proposes using the agent's own successful experiences to automatically improve performance.
3. ❓ Problem: The paper addresses how to improve LLM agent performance without relying on labor-intensive task-specific knowledge engineering and prompt tuning.
4. 🛠️ Methods: The authors developed three approaches: Traj-Bootstrap (collecting successful trajectories), +DB-Selection (population-based training to identify high-performing databases), and +Exemplar-Selection (retaining individual trajectories based on empirical utility).
5. 📊 Results and Evaluation: The methods improved test performance significantly across three benchmarks - ALFWorld (73% to 91%), Wordcraft (55% to 72%), and InterCode-SQL (75% to 81%) - matching or exceeding performance of more complex approaches that use task-specific components.

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Workflow: Self-Generated In-Context Examples for LLM Agents 1. Initial Setup Base Agent: ReAct-style (Plan, Reason, Act) Input: Initial Examples (Human-provided D₀) Environment: Training Tasks (T_train) Goal: Construct Optimal Trajectory Database D 2a. Traj-Bootstrap (Naive Accumulation) Start with D₀ Loop through Training Tasks: Agent attempts task using current D Check Success? No Yes Add τ to D Discard τ Output: D_TB 2b. +DB-Selection (Population Training) Initialize N parallel Databases (D₁...Dɴ) D₁ D₂ ... Loop through Training Tasks (in parallel): Agent i attempts task using Dᵢ If Success: Add τ to Dᵢ Periodically (e.g., every 2ʲ tasks): Evaluate recent performance of each Dᵢ Replace Worst Dᵢ with copy of Best Dᵢ Output: Best D_DB_Sel 2c. +Exemplar-Selection (Quality Filter) Requires generated trajectories (e.g., from 2b) Collect all successful trajectories T_all ... Calculate Quality Metric Q(τ) for each τ: Based on retrieval freq. & success contribution For each unique training task t: Select τ* with highest Q(τ) among successful attempts for that task t Output: D_Ex_Sel 3. Evaluation Use constructed Database (D_TB, D_DB_Sel, or D_Ex_Sel) Run Base Agent on unseen Test Tasks (T_test) Measure: Task Success Rate Compare performance across methods Agent Interaction Loop (Simplified) 1. Retrieve relevant examples from D (based on goal, plan, obs, reason) 2. LLMplan (Initial Plan p) 3. Loop (t=1 to T): - Retrieve (based on g, p, o_t / r_t) - LLMreason (Generate reasoning r_t) - Retrieve (based on g, p, r_t) - LLMact (Decide action a_t) - Execute a_t -> get o_{t+1} (Agent uses D during execution)
Q1
1. What is the primary limitation of existing methods for improving LLM agents for sequential decision-making that this paper seeks to overcome?
Their inability to handle complex, long-horizon tasks.
Their dependence on labor-intensive, task-specific knowledge engineering.
Their high computational cost during test-time inference.
Q2
2. Which of the following is NOT one of the trajectory database construction methods proposed and evaluated in the paper?
Fine-tuning the LLM weights directly on successful trajectories.
Accumulating all successful self-generated trajectories (Traj-Bootstrap).
Selecting individual high-performing trajectories based on empirical utility (+Exemplar-Selection).
Q3
3. According to the paper's results, the performance boost from using self-generated in-context examples via Traj-Bootstrap is comparable to what alternative strategy on the benchmarks?
Using a simpler LLM but with extensive prompt tuning.
Allowing the baseline agent two to three attempts per task at test time.
Reducing the action space size to simplify the decision-making process.
1/2

Paper 3

Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.20708

1. 📘 Topic and Domain: Analysis of Large Language Models' reasoning processes through examination of intermediate steps ("subthoughts") in mathematical problem-solving.
2. 💡 Previous Research and New Ideas: Based on Chain-of-Thought prompting research; proposes a novel approach to analyze intermediate reasoning steps rather than just final answers.
3. ❓ Problem: The paper addresses the limitation of evaluating LLMs solely on their final answers, potentially missing valuable information encoded within the reasoning process.
4. 🛠️ Methods: Segments reasoning traces into subthoughts, generates multiple solution completions from each intermediate point, and aggregates answers using mode frequency analysis and entropy measurements.
5. 📊 Results and Evaluation: Achieved significant accuracy improvements (up to 13% on AIME2024 and 10% on AIME2025) across various models by using the most frequent answer from subthought completions instead of final answers, with lower entropy correlating strongly with correct solutions.

Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

Subthought Reasoning Analysis Workflow 1. Initial Trace Input: Problem P LLM(P) + Greedy Decoding Output: Full Trace T, Answer Alast 2. Segmentation Input: Full Trace T Split T using Linguistic Markers W Output: Subthoughts (s1, ..., sn) 3. Subthought Completion For each i in 1 to n: Partial Trace Ti = s1 + ... + si Prompt Pi = Format(P, Ti) Generate Completion Ci = LLM(Pi) (Using Greedy or Non-Greedy Sampling) Output: n Responses (R1, ..., Rn) 4. Answer Extraction Input: n Responses (R1, ..., Rn) Extract final answer Ai from each Ri Output: Answer Set A = {A1, ..., An} 5. Analysis & Aggregation Input: Answer Set A = {A1, ..., An} Input: Baseline Answer Alast A. Analyze Evolution & Distribution (e.g., Plot Ai vs i, Calculate Entropy H(A)) B. Aggregate Answers Calculate Mode Amode = Most Frequent(A) Output: Insights, Amode 6. Evaluation Input: Amode, Alast, Ground Truth Atrue Compare Accuracy: AccLast = Accuracy(Alast, Atrue) AccMostFreq = Accuracy(Amode, Atrue) Result: Compare AccMostFreq vs AccLast (Hypothesis: AccMostFreq >= AccLast) Key Elements & Concepts Problem (P): Input question requiring reasoning. LLM (M): Language model used for generation. Subthoughts (s1..sn): Segments of the reasoning trace. Markers (W): Linguistic cues for segmentation. Alast: Answer from initial full trace (Baseline). Ai: Answer from completion after subthought si. Amode: Most frequent answer in {A1..An} (Proposed). Entropy (H(A)): Measure of answer consistency.
Q1
1. According to the paper, what is a key limitation of standard LLM evaluation practices for reasoning tasks?
They only evaluate the speed of reasoning, not accuracy.
They rely solely on the final answer, overlooking the intermediate reasoning steps.
They do not use Chain-of-Thought prompting.
Q2
2. How does the proposed method in the paper aggregate the potential answers derived from completing reasoning traces at different intermediate subthoughts?
By taking the average of all extracted numerical answers.
By selecting the most frequently occurring answer (the mode).
By choosing the answer from the longest reasoning trace.
Q3
3. The paper found that the entropy of the answer distribution derived from subthought completions correlates with correctness. What does lower entropy typically indicate in this context?
The model struggled with the problem, producing many different answers.
The model's reasoning was more consistent across subthoughts, often correlating with a correct answer.
The reasoning trace was shorter than average.