2025-07-23 Papers

1/2

Paper 1

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

Published: 2025-07-22

Link: http://arxiv.org/pdf/2507.16784

1. 📘 Topic and Domain: Development of a large language model (TIM) and inference runtime (TIMRUN) for efficient long-horizon reasoning and tool use in natural language processing.
2. 💡 Previous Research and New Ideas: Based on traditional LLM architectures and multi-agent frameworks, proposes a novel approach modeling reasoning as recursive trees rather than linear sequences, with dynamic pruning of completed subtasks.
3. ❓ Problem: Addresses the context window limitations of LLMs that bottleneck reasoning accuracy and efficiency, particularly for long-horizon reasoning tasks and multi-hop tool use.
4. 🛠️ Methods: Implements a Thread Inference Model (TIM) that decomposes complex tasks into subtasks, coupled with TIMRUN inference engine that enables dynamic memory management and efficient tool integration through subtask pruning.
5. 📊 Results and Evaluation: Achieves comparable or better performance than baseline models while using less than 50% of cache slots, maintains stable throughput with multiple tool calls, and matches performance of complex agent frameworks without requiring specialized agent design.

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

TIM + TIMRUN: Beyond Context Limits for Long-Horizon Reasoning Data Synthesis • 20k math questions • 20k research questions • 6k ToolBench questions TIM Training • Supervised Fine-tuning • GRPO Reinforcement Learning • Thread-2 Structure Thread-2 Components • Thought • Tool Use • Subtasks • Conclusion • JSON Schema Generation TIMRUN Runtime • Memory Management • Subtask Pruning • Tool Integration Recursive Task Decomposition Process Task 1 1.1 1.2 Task 2 Task N Working Memory KV Cache Pruning Memory Reuse Subtask Pruning Mechanism • Rule-based pruning buffer (size 0-2) • Dynamic KV cache management • Positional embedding reuse • Up to 90% memory manipulation End-to-End Tool Integration • Multi-hop tool calls in single inference • Direct parameter extraction • Automatic response integration • Reduced token transmission overhead Structured JSON Generation • Constrained decoding with schemas • Pydantic class definitions • Recursive task hierarchy • Tool parameter validation Mathematical Reasoning Results • MATH500: 69.0% accuracy • AIME 2024: 46.7% accuracy • 50%+ KV cache reduction Research Task Performance • Datacommons QA: 67.9% accuracy • BrowseComp: Outperforms GPT-4o • No task-specific prompting needed Key System Benefits Unlimited Reasoning Beyond output token limits Virtual unlimited memory Higher Efficiency Improved throughput Reduced memory cost Single Model Agent One inference call Complete workflow
Q1
1. What is the key innovation in how TIM processes reasoning compared to traditional LLMs?
It uses a linear sequence of tokens with better compression
It models reasoning as recursive trees with prunable subtasks
It employs multiple separate models for different reasoning steps
Q2
2. What percentage of KV cache was typically pruned during TIM's operation while maintaining performance?
Around 20-30%
Around 35-45%
More than 50%
Q3
3. What unique advantage does TIMRUN provide for tool usage compared to traditional systems?
It requires fewer tools overall
It makes tool calls directly from runtime without returning to client
It creates specialized tools for each task
1/2

Paper 2

Step-Audio 2 Technical Report

Published: 2025-07-22

Link: http://arxiv.org/pdf/2507.16632

1. 📘 Topic and Domain: Step-Audio 2 is an end-to-end multi-modal large language model for audio understanding and speech conversation in the domain of artificial intelligence and speech processing.
2. 💡 Previous Research and New Ideas: Based on previous LALMs like GPT-4o, Qwen-Audio, and Step-Audio, it proposes new ideas of integrating discrete audio token generation into language modeling and incorporating retrieval-augmented generation with external tools.
3. ❓ Problem: The paper addresses the challenges in achieving natural and intelligent speech interaction, particularly in handling paralinguistic information and accessing real-world textual and acoustic knowledge.
4. 🛠️ Methods: The authors used a latent audio encoder, reasoning-centric reinforcement learning, multi-stage training on 680 billion text tokens and 8 million hours of audio data, and integrated retrieval-augmented generation with external tools like web and audio search.
5. 📊 Results and Evaluation: Step-Audio 2 achieved state-of-the-art performance across various benchmarks, including ASR (3.18% WER for English, 3.11% CER for Chinese), audio understanding (77.4% on MMAU), and speech conversation tasks, outperforming both open-source and commercial solutions.

Step-Audio 2 Technical Report

Step-Audio 2 Technical Report Methodology Flowchart Architecture Audio Encoder (Frozen, 25Hz) Audio Adaptor (Downsample to 12.5Hz) LLM Decoder (Text + Audio Tokens) Audio Detokenizer (Flow Matching + HiFi-GAN) External Tools Web Search Audio Search Multi-Stage Training Pipeline Pre-training Stage 1 100B ASR tokens Adaptor training only 12K steps, seq_len=8192 Pre-training Stage 2 128B text + 128B audio Extend tokenizer seq_len=16384 Pre-training Stage 3 800B tokens total Multi-task training ASR, TTS, S2ST, etc. Pre-training Stage 4 200B high-quality data Multilingual + dialectal 50k unique speakers Supervised Fine-tuning (SFT) 4B tokens for single epoch Multi-task instruction following Reasoning-centric datasets Reinforcement Learning PPO Stage 1: Binary rewards (60 iterations) PPO Stage 2: Learned preference (120 iterations) GRPO: Audio perception (400 iterations) Training Data Summary Total: 1.356T tokens (680B text + audio) 8 million hours of speech and audio 21 days training duration Key Innovations • End-to-end audio understanding and generation • Paralinguistic information processing • RAG with audio search tool Comprehensive Evaluation ASR Multi-language & Dialect Paralinguistic Understanding (11 dimensions) Audio Understanding (MMAU) Speech Translation (S2ST & S2TT) Tool Calling Web & Audio Search S2S Conversation (URO-Bench) State-of-the-art performance across all benchmarks Outperforms GPT-4o Audio, Kimi-Audio, and other SOTA models
Q1
1. What is the main innovation of Step-Audio 2 compared to previous audio language models?
Integration of discrete audio token generation into language modeling
Using a larger training dataset
Implementing faster processing speed
Q2
2. What unique tool did Step-Audio 2 introduce for enhancing speech interaction?
Text translation tool
Audio search tool with voice library
Image recognition tool
Q3
3. What was Step-Audio 2's word error rate (WER) for English ASR?
5.35%
4.18%
3.18%
1/2

Paper 3

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Published: 2025-07-22

Link: http://arxiv.org/pdf/2507.16814

1. 📘 Topic and Domain: Semi-off-policy reinforcement learning for enhancing visual slow-thinking reasoning capabilities in large vision-language models (LVLMs).
2. 💡 Previous Research and New Ideas: Based on previous research in on-policy and off-policy reinforcement learning for LVLMs, the paper proposes a novel semi-off-policy approach that combines on-policy visual understanding with off-policy reasoning.
3. ❓ Problem: The paper aims to solve the limitations of both on-policy RL (restricted by initial policy distribution) and off-policy RL (visual hallucination issues) in developing visual slow-thinking reasoning abilities in LVLMs.
4. 🛠️ Methods: SOPHIA combines on-policy visual understanding from LVLM with off-policy slow-thinking reasoning from language models, assigns outcome-based rewards to reasoning, propagates visual rewards backward, and uses off-policy RL algorithms to update the LVLM policy.
5. 📊 Results and Evaluation: SOPHIA improved InternVL3.0-38B by 8.50% on average across benchmarks, achieving state-of-the-art performance among open-source LVLMs and outperforming some closed-source models on MathVision (49.08%) and OlympiadBench (49.95%).

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

SOPHIA: Semi-off-Policy RL for Vision-Language Slow-thinking Reasoning Training Dataset 80K VQA Questions Policy Initialization InternVL + Warm-up Semi-off-Policy Sampling On-Policy Visual Understanding Off-Policy Slow Thinking (QwQ/R1) Image Captions K=8 samples Reasoning Trajectories N=8 per caption Reward Evaluation Outcome Reward R(y) = 1 if correct Visual Reward R(c) = avg(R(y)) Reward Propagation Policy Optimization Off-policy Dataset Filtered trajectories Importance Sampling π(y|x,v)/μ(y|x,v) Policy Gradient Update Enhanced LVLM with Slow-thinking Key Results InternVL3.0-38B + SOPHIA +8.50% average improvement MathVision: 49.08% OlympiadBench: 49.95% SOTA Performance Outperforms GPT-4.1 Key Technical Components Semi-off Policy Reward Propagation Visual Understanding Scalable Training Combines on-policy visual with off-policy reasoning Backpropagates outcome rewards to visual features Maintains visual alignment during reasoning learning No human annotations or closed-source models
Q1
1. What is the main innovation of SOPHIA compared to previous approaches?
It uses only on-policy reinforcement learning
It combines on-policy visual understanding with off-policy reasoning
It relies entirely on off-policy learning from other models
Q2
2. What improvement did SOPHIA achieve on InternVL3.0-38B's performance?
An average improvement of 4.25% across benchmarks
An average improvement of 8.50% across benchmarks
An average improvement of 12.75% across benchmarks
Q3
3. How does SOPHIA handle the reward system for training?
It only uses visual rewards based on image understanding
It relies solely on outcome-based rewards for reasoning
It combines outcome-based rewards with propagated visual rewards