1. 📘 Topic and Domain: The paper introduces "softpick," a new attention mechanism for transformer models in deep learning, specifically focusing on improving attention computation.
2. 💡 Previous Research and New Ideas: Based on traditional softmax attention in transformers, it proposes a novel rectified, not sum-to-one normalization function as a replacement for softmax attention.
3. ❓ Problem: The paper aims to solve two major issues in transformer models: attention sink (where attention heads allocate significant scores to irrelevant tokens) and massive activations (extremely large hidden state values).
4. 🛠️ Methods: The authors implemented and tested softpick on 340M parameter transformer models, comparing it with traditional softmax models using the same architecture and training configuration.
5. 📊 Results and Evaluation: Results showed softpick maintained performance parity with softmax on benchmarks while achieving 0% sink rate, reducing hidden state kurtosis from 33,510 to 340, creating 46.97% sparse attention maps, and performing better under quantization.