2026-02-25 Papers

1/2

Paper 1

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Published: 2026-02-20

Link: http://arxiv.org/pdf/2602.18292

1. 📘 Topic and Domain: Language model decoding strategies viewed as optimization problems on the probability simplex.

2. 💡 Previous Research and New Ideas: Builds on classical decoding methods (greedy, softmax, top-k, top-p) by proving they are special cases of a unified optimization framework, and introduces Best-of-K (BoK) sampler for multi-sample generation scenarios.

3. ❓ Problem: Current decoding methods are treated as disconnected heuristics rather than principled solutions, and existing methods perform poorly in multi-sample pipelines where coverage of good alternatives matters.

4. 🛠️ Methods: Formulates decoding as regularized optimization over probability distributions, derives KKT conditions for optimality, and uses mirror ascent algorithm for cases without closed-form solutions.

5. 📊 Results and Evaluation: BoK sampler improves accuracy by up to +18.6% on MATH500 at high temperatures compared to baseline, with consistent gains across three benchmarks (MATH500, GPQA, HumanEval) while adding minimal computational overhead (~1s).

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

1/2

Paper 2

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20160

1. 📘 Topic and Domain: The paper presents tttLRM, a Large Reconstruction Model that leverages Test-Time Training (TTT) for high-resolution, long-context, and autoregressive 3D reconstruction from multiple images.

2. 💡 Previous Research and New Ideas: The paper builds on Large Reconstruction Models (LRMs) and Test-Time Training approaches, proposing a novel architecture that interprets TTT fast weights as implicit 3D representations that can be decoded into explicit formats like 3D Gaussian Splatting with linear computational complexity.

3. ❓ Problem: The paper aims to solve the limitation of existing 3D reconstruction methods that either require slow per-scene optimization or are restricted to processing only a few input views due to quadratic attention complexity.

4. 🛠️ Methods: The authors use LaCT (Large Chunk Test-Time Training) blocks with linear complexity, encode input images as tokens that update fast weights during inference, and query these weights with virtual tokens to decode into explicit 3D representations like Gaussian Splats.

5. 📊 Results and Evaluation: The method achieves state-of-the-art performance on object (GSO) and scene-level (DL3DV-140, Tanks&Temples) datasets, outperforming GS-LRM and Long-LRM in PSNR/SSIM/LPIPS metrics while supporting up to 64 input views and being hundreds of times faster than optimization-based methods.

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

1/2

Paper 3

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Published: 2026-02-13

Link: http://arxiv.org/pdf/2602.13515

1. 📘 Topic and Domain: The paper focuses on trainable sparse attention methods for accelerating video diffusion models in computer vision.

2. 💡 Previous Research and New Ideas: The paper builds on existing sparse attention methods like SpargeAttention and VSA, proposing a hybrid Top-k+Top-p masking strategy combined with velocity distillation for fine-tuning.

3. ❓ Problem: The paper aims to achieve high attention sparsity (>90%) in video diffusion models without degrading generation quality, addressing failures of existing masking rules and fine-tuning approaches.

4. 🛠️ Methods: The authors use a hybrid Top-k+Top-p masking rule, implement an efficient CUDA-based sparse attention kernel, and employ velocity distillation loss instead of standard diffusion loss for fine-tuning.

5. 📊 Results and Evaluation: SpargeAttention2 achieves 95% attention sparsity with 16.2× attention speedup and up to 4.7× end-to-end generation speedup while maintaining video quality comparable to full attention, evaluated on Wan2.1 models using VBench metrics.