2026-02-25 Papers

1/2

Paper 1

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Published: 2026-02-20

Link: http://arxiv.org/pdf/2602.18292

1. 📘 Topic and Domain: Language model decoding strategies viewed as optimization problems on the probability simplex.
2. 💡 Previous Research and New Ideas: Builds on classical decoding methods (greedy, softmax, top-k, top-p) by proving they are special cases of a unified optimization framework, and introduces Best-of-K (BoK) sampler for multi-sample generation scenarios.
3. ❓ Problem: Current decoding methods are treated as disconnected heuristics rather than principled solutions, and existing methods perform poorly in multi-sample pipelines where coverage of good alternatives matters.
4. 🛠️ Methods: Formulates decoding as regularized optimization over probability distributions, derives KKT conditions for optimality, and uses mirror ascent algorithm for cases without closed-form solutions.
5. 📊 Results and Evaluation: BoK sampler improves accuracy by up to +18.6% on MATH500 at high temperatures compared to baseline, with consistent gains across three benchmarks (MATH500, GPQA, HumanEval) while adding minimal computational overhead (~1s).

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Decoding as Optimisation: Method Flow Master Objective max q∈Δ(V) [⟨q,st⟩ - λΩ(q)] subject to: q ∈ Ct Greedy λ = 0 (closed form) Softmax Ω = -H(q) (closed form) Top-K/P Support Ct (closed form) Sparsemax Ω ∝ ||q||²₂ (closed form) BoK Ω(BoK) (mirror ascent) KKT Optimality Conditions GLOBAL: Σq*(v) = 1, q*(v) ≥ 0 ACTIVE: q*(v) > 0 ⟹ st(v) - λ∇Ω(q*)(v) = η INACTIVE: q*(v) = 0 ⟹ st(v) - λ∇Ω(q*)(v) ≤ η Mirror Ascent Algorithm qⱼ₊₁ = qⱼ ⊙ exp(ηgⱼ) / ||qⱼ ⊙ exp(ηgⱼ)||₁ (For non-closed-form objectives like BoK)
Q1
1. What mathematical insight allows the paper to unify different decoding strategies like greedy, softmax, and top-k sampling?
They all minimize the same loss function with different learning rates
They are solutions to the same optimization problem with different regularizers on the probability simplex
They use the same neural network architecture but with different activation functions
Q2
2. Why does the paper argue that projected gradient ascent is suboptimal for optimizing over probability distributions?
It requires too much memory to store gradients for large vocabularies
It converges too slowly compared to other optimization methods
It implicitly assumes Euclidean geometry which is a poor match for the simplex manifold
Q3
3. What is the key innovation of the Best-of-K (BoK) sampler that makes it effective for multi-sample generation?
It optimizes for coverage by maximizing the probability that good tokens appear at least once in K samples
It uses a transformer architecture to predict which K tokens to sample
It pre-computes the top K tokens offline to reduce sampling latency
1/2

Paper 2

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Published: 2026-02-23

Link: http://arxiv.org/pdf/2602.20160

1. 📘 Topic and Domain: The paper presents tttLRM, a Large Reconstruction Model that leverages Test-Time Training (TTT) for high-resolution, long-context, and autoregressive 3D reconstruction from multiple images.
2. 💡 Previous Research and New Ideas: The paper builds on Large Reconstruction Models (LRMs) and Test-Time Training approaches, proposing a novel architecture that interprets TTT fast weights as implicit 3D representations that can be decoded into explicit formats like 3D Gaussian Splatting with linear computational complexity.
3. ❓ Problem: The paper aims to solve the limitation of existing 3D reconstruction methods that either require slow per-scene optimization or are restricted to processing only a few input views due to quadratic attention complexity.
4. 🛠️ Methods: The authors use LaCT (Large Chunk Test-Time Training) blocks with linear complexity, encode input images as tokens that update fast weights during inference, and query these weights with virtual tokens to decode into explicit 3D representations like Gaussian Splats.
5. 📊 Results and Evaluation: The method achieves state-of-the-art performance on object (GSO) and scene-level (DL3DV-140, Tanks&Temples) datasets, outperforming GS-LRM and Long-LRM in PSNR/SSIM/LPIPS metrics while supporting up to 64 input views and being hundreds of times faster than optimization-based methods.

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

tttLRM: Test-Time Training Workflow Input Views {I₁, I₂, ..., Iₙ} + Ray Embeddings Patchify & Tokenize Linear Layer LaCT Blocks (×L) Window Attention Fast Weights Update W = Update({Tᵢ}) Apply(W, Tᵢ) Virtual Views Query {I^v₁, I^v₂, ..., I^vₘ} 3D Decode • 3D Gaussians • Triplane NeRF Novel View Synthesis Real-time Rendering Autoregressive Mode For streaming inputs: 1. Update W with new batch 2. Predict G(b) immediately 3. Maintain historical gradients Distributed Training Sequence Parallelism: • Shard tokens across GPUs • Synchronize fast weights • All-reduce gradients Training Loss L = L_RGB + λ_depth × L_depth + λ_opacity × L_opacity Key Features Linear Complexity O(N) vs Traditional O(N²) Supports up to 64+ input views (1M+ tokens) Flexible output: 3D Gaussians, Triplane NeRF Fast feedforward reconstruction (~15s for 64 views) Autoregressive streaming capability Pretrained on novel view synthesis tasks
Q1
1. What is the key architectural innovation that allows tttLRM to process long sequences of input views with linear computational complexity?
Using LaCT (Large Chunk Test-Time Training) blocks that update fast weights as implicit 3D memory
Implementing multi-head self-attention with kernel approximations
Applying hierarchical pooling to reduce sequence length progressively
Q2
2. How does tttLRM's autoregressive reconstruction capability work when processing streaming visual inputs?
It stores all previous views in a buffer and reprocesses them with each new input
It incrementally updates fast weights like an RNN, immediately predicting 3D Gaussians for new views
It waits until all views are collected before beginning the reconstruction process
Q3
3. What unexpected benefit did the authors discover when pretraining tttLRM with novel view synthesis tasks before fine-tuning for 3D reconstruction?
It reduced the model size by 50% through weight pruning
It enabled the model to work without camera pose information
It significantly accelerated convergence and improved final reconstruction quality even for different 3D representations
1/2

Paper 3

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Published: 2026-02-13

Link: http://arxiv.org/pdf/2602.13515

1. 📘 Topic and Domain: The paper focuses on trainable sparse attention methods for accelerating video diffusion models in computer vision.
2. 💡 Previous Research and New Ideas: The paper builds on existing sparse attention methods like SpargeAttention and VSA, proposing a hybrid Top-k+Top-p masking strategy combined with velocity distillation for fine-tuning.
3. ❓ Problem: The paper aims to achieve high attention sparsity (>90%) in video diffusion models without degrading generation quality, addressing failures of existing masking rules and fine-tuning approaches.
4. 🛠️ Methods: The authors use a hybrid Top-k+Top-p masking rule, implement an efficient CUDA-based sparse attention kernel, and employ velocity distillation loss instead of standard diffusion loss for fine-tuning.
5. 📊 Results and Evaluation: SpargeAttention2 achieves 95% attention sparsity with 16.2× attention speedup and up to 4.7× end-to-end generation speedup while maintaining video quality comparable to full attention, evaluated on Wan2.1 models using VBench metrics.

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

SpargeAttention2: Method Workflow 1. Analysis Phase Case 1: Top-k/Top-p Failure Analysis Case 2: Why Trainable Sparse Works Better Case 3: Diffusion Loss Limitations 2. SpargeAttention2 Components Hybrid Masking Top-k + Top-p Handles uniform & skewed distributions Efficient Kernel Block-sparse attention CUDA implementation Based on FlashAttention Velocity Distillation Teacher-student setup Preserves generation quality 3. Training Process Pre-trained Model (Full Attention) θ_full (frozen) Replace Attention with SpargeAttn2 θ_sparse Velocity Distillation min ||u_sparse - u_full||² Optimized Sparse Model 95% sparsity 4. Results 95% Sparsity 16.2× Attn Speedup 4.7× E2E Speedup Quality Preserved
Q1
1. What is the key insight behind why trainable sparse attention can achieve higher sparsity than training-free methods?
Trainable methods use more sophisticated masking algorithms that can identify important tokens better
Fine-tuning makes attention distributions more concentrated, reducing both dropped error and renormalization error
Trainable methods have access to larger computational resources during inference
Q2
2. Why does SpargeAttention2 use velocity distillation loss instead of standard diffusion loss for fine-tuning?
Velocity distillation is computationally faster and requires less memory
Standard diffusion loss causes performance degradation when fine-tuning data distribution differs from pre-training data
Velocity distillation allows for higher sparsity ratios in the attention mechanism
Q3
3. When does the hybrid Top-k+Top-p masking strategy outperform individual masking methods?
Only when attention sparsity is below 50%
When dealing with both uniform probability distributions (where Top-p excels) and skewed distributions (where Top-k excels)
Exclusively during the training phase but not during inference