2025-10-31 Papers

1/2

Paper 1

The End of Manual Decoding: Towards Truly End-to-End Language Models

Published: 2025-10-30

Link: http://arxiv.org/pdf/2510.26697

1. 📘 Topic and Domain: The paper focuses on improving language model decoding by introducing AutoDeco, a novel architecture that enables truly end-to-end generation in the domain of natural language processing.
2. 💡 Previous Research and New Ideas: Previous research relied on static, manually-tuned decoding parameters (temperature, top-p); the paper proposes a new dynamic approach where the model learns to predict its own decoding parameters during generation.
3. ❓ Problem: The paper addresses the inefficient and suboptimal nature of manual decoding hyperparameter tuning in language models, which currently requires laborious hand-tuning and cannot adapt to different contexts within a single generation.
4. 🛠️ Methods: The authors developed AutoDeco, which augments transformers with lightweight prediction heads that dynamically predict temperature and top-p values at each generation step, using a differentiable soft top-p mechanism for training.
5. 📊 Results and Evaluation: AutoDeco outperformed standard decoding methods across eight benchmarks, matched oracle-tuned baselines without task-specific tuning, added only 1-2% latency overhead, and demonstrated an emergent ability to adjust generation style based on natural language commands.

The End of Manual Decoding: Towards Truly End-to-End Language Models

AutoDeco: End-to-End Language Model Workflow Training Strategy Differentiable Soft Top-p Temperature Scaling End-to-End Optimization Easy-Token Masking Dynamic Fine-Tuning AutoDeco Heads Temperature Head Top-p Head Lightweight MLPs Context-Specific Prediction Dynamic Inference Hidden State Computation Parameter Prediction Internal Probability Modification 1-2% Latency Overhead Emergent Control Natural Language Commands Instruction-based Decoding Control 95%+ Consistency Core Mathematical Framework 1. Temperature Scaling: p = softmax(l / T̂) 2. Differentiable Soft Mask: m(sorted) = exp(-α · ReLU(c - P̂)) 3. Final Distribution: p̃ = (p ⊙ m) / (Σ(p ⊙ m) + ε) 4. Parameter Prediction: T̂t = temp_head(ht) P̂t = top_p_head(ht, T̂t) Data Processing DeepMath-103K Dataset Reject Sampling Trajectories 400 Training Steps 6K Training Samples Evaluation Benchmarks Math: AIME, BRUMO25, HMMT25 General: GPQA, MMLU-Pro Code: LiveCodeBench Instruction: IFEval Model Families Llama-Nemotron-8B R1-Distill-Qwen-7B Qwen3-30B-A3B-Instruct OpenAI-GPT-OSS-20B Key Results Outperforms static decoding methods consistently Matches oracle-tuned baselines without test-set access Emergent natural language control capability Minimal computational overhead (1-2% latency)
Q1
1. What surprising emergent capability did AutoDeco demonstrate during experiments?
The ability to write poetry without training
The ability to interpret natural language commands to adjust its generation style
The ability to translate between multiple languages automatically
Q2
2. What is the key innovation in AutoDeco's training process that enables end-to-end learning?
Using a larger training dataset
Implementing a new transformer architecture
Introducing a differentiable 'soft' top-p mechanism
Q3
3. What practical advantage does AutoDeco have over traditional decoding methods in terms of computational overhead?
It adds only 1-2% latency with minimal memory impact
It reduces computation time by 50%
It requires no additional computational resources
1/2

Paper 2

Kimi Linear: An Expressive, Efficient Attention Architecture

Published: 2025-10-30

Link: http://arxiv.org/pdf/2510.26692

1. 📘 Topic and Domain: Development of Kimi Linear, a hybrid linear attention architecture for large language models focusing on efficient attention mechanisms and computational architecture.
2. 💡 Previous Research and New Ideas: Based on Gated DeltaNet and linear attention mechanisms, introduces Kimi Delta Attention (KDA) with finer-grained gating for more effective RNN memory usage.
3. ❓ Problem: Addresses the computational inefficiencies of standard attention mechanisms in LLMs, particularly for long-context and reinforcement learning scenarios where traditional attention has quadratic time complexity.
4. 🛠️ Methods: Implements a hybrid architecture combining KDA with Multi-Head Latent Attention in a 3:1 ratio, using specialized DPLR transition matrices and chunkwise algorithm for efficient computation.
5. 📊 Results and Evaluation: Outperforms full attention models across various tasks, reduces KV cache usage by up to 75%, and achieves up to 6× faster decoding throughput for 1M context length while maintaining superior performance.

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Linear Architecture Workflow Input Tokens Neural Parameterization q,k,v,α,β computation Kimi Delta Attention Fine-grained gating Channel-wise decay DPLR optimization Chunkwise Algorithm WY Representation UT Transform Hybrid Architecture 3:1 KDA:MLA ratio NoPE for MLA MLA Layers Global Attention State Update S_t = Diag(α_t)(I-βkk^T)S_{t-1}+βkv^T Output Processing RMSNorm + Gating Training Components • Pretraining (1.4T/5.7T tokens) • SFT with multi-stage • RL with PTX loss Evaluation Benchmarks • Short context: MMLU, BBH, GSM8K • Long context: RULER, RepoQA • Math & Code: AIME, LiveCodeBench Efficiency Gains • 75% KV cache reduction • 6× decoding speedup (1M tokens) • Linear time complexity Key Results • Outperforms MLA baseline • MMLU-Pro: 51.0 vs 47.2 • RULER: 84.3 vs 81.3 Synthetic Tasks • Palindrome • MQAR • Stack operations Scaling Laws • 1.16× computational efficiency • MoE architecture • 653M to 1.7B params Open Source • KDA kernels • vLLM integration • Pre-trained checkpoints Final Model • 48B total params • 3B activated params • 1M context support Key Innovation Summary KDA: Fine-grained channel-wise gating mechanism Chunkwise parallelization with DPLR optimization Hybrid 3:1 architecture balancing quality and efficiency Superior performance across short/long context tasks Significant efficiency gains: 75% memory reduction, 6× speedup
Q1
1. What is the main innovation of Kimi Linear's architecture compared to previous approaches?
Using pure linear attention without any global attention
Implementing a channel-wise gating mechanism in KDA
Removing all position embeddings from the model
Q2
2. What ratio does Kimi Linear use between KDA layers and global attention layers?
1:1 ratio for balanced computation
2:1 ratio to minimize memory usage
3:1 ratio for optimal performance-efficiency trade-off
Q3
3. What is the primary advantage of Kimi Linear in terms of computational efficiency?
Reduces training time by 75% compared to standard models
Achieves up to 6× faster decoding throughput for 1M context
Eliminates the need for GPU acceleration
1/2

Paper 3

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Published: 2025-10-30

Link: http://arxiv.org/pdf/2510.26768

1. 📘 Topic and Domain: The paper introduces AMO-Bench, a mathematical reasoning benchmark for evaluating Large Language Models' (LLMs) performance on high-difficulty math problems at or above International Mathematical Olympiad level.
2. 💡 Previous Research and New Ideas: Based on existing math benchmarks like AIME and MATH500 where LLMs are reaching performance saturation, this paper proposes a more challenging benchmark with entirely original problems that are cross-validated by experts.
3. ❓ Problem: The paper addresses the limitation of existing math benchmarks becoming less effective for evaluating top-tier LLMs due to performance saturation and potential data memorization issues.
4. 🛠️ Methods: The authors created 50 original math problems verified by experts, designed automatic grading methods combining parser-based and LLM-based approaches, and evaluated 26 different LLMs on the benchmark.
5. 📊 Results and Evaluation: The best-performing model achieved only 52.4% accuracy on AMO-Bench with most LLMs scoring below 40%, demonstrating significant room for improvement while showing promising scaling trends with increased test-time compute.

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

AMO-Bench Construction and Evaluation Workflow Data Creation Human Experts from Top Universities Quality Review • Data Correctness • MO Syllabus Validation Originality Review • Exist Competitions • Web Search Difficulty Review • IMO Standard • Model Performance AMO-Bench 50 Original Problems Problem Categories • Algebra (22%) • Functions (26%) More Categories • Combinatorics (24%) • Number Theory (18%) Answer Types Numerical • Set • Variable • Descriptive Final Answer Based Grading Grading Methods • Parser-Based (39) • LLM-Based (11) Evaluation Setup 26 LLMs Evaluated 32 Samples per Model Temperature: 0.7-1.0 Model Types • Proprietary • Open Source • Reasoning/Non-Reasoning Results Analysis Best: 52.4% (GPT-5) Most Models < 40% High Token Usage Key Findings • Test-time Scaling • Pass@32 > 70% • Room for Improvement AVG@32 Main Metric Pass@K Potential Analysis Token Usage Efficiency Study Benchmark Comparison AMO-Bench vs AIME24/25, HMMT25, MATH500 Significantly More Challenging Future Research Directions • Advanced Reasoning Capabilities • Test-time Compute Scaling • Mathematical Problem Solving • Error Analysis and Improvement
Q1
1. What unique feature distinguishes AMO-Bench from other mathematical benchmarks?
It only includes problems from past International Mathematical Olympiads
All problems are entirely original and newly crafted by experts
It focuses exclusively on geometry problems
Q2
2. What was the most surprising finding about LLMs' performance on AMO-Bench?
Models required significantly more output tokens compared to other benchmarks
All models achieved perfect scores
Open-source models consistently outperformed proprietary models
Q3
3. How does AMO-Bench handle the grading of solutions?
All solutions are manually graded by experts
Only parser-based automatic grading is used
Combines parser-based and LLM-based grading depending on answer type