2025-10-06 Papers

1/2

Paper 1

LongCodeZip: Compress Long Context for Code Language Models

Published: 2025-09-30

Link: http://arxiv.org/pdf/2510.00446

1. 📘 Topic and Domain: The paper presents LongCodeZip, a context compression framework for code language models, focusing on efficient processing of long programming code contexts.
2. 💡 Previous Research and New Ideas: Based on existing context compression methods like LLMLingua and code-specific approaches, it introduces a novel two-stage compression strategy specifically designed for code, considering code structure and dependencies.
3. ❓ Problem: The paper addresses the challenge of handling long code contexts in language models, where processing extensive codebases leads to high API costs, increased latency, and degraded performance.
4. 🛠️ Methods: Uses a dual-stage approach: (1) coarse-grained compression to select relevant functions using conditional perplexity, and (2) fine-grained compression that segments functions into blocks and selects optimal subsets under token budgets.
5. 📊 Results and Evaluation: Achieves up to 5.6× compression ratio while maintaining performance across multiple tasks (code completion, summarization, question answering), consistently outperforming baselines and reducing generation time from 15.7s to 6.6s.

LongCodeZip: Compress Long Context for Code Language Models

LongCodeZip Methodology Flow Input Long Code Context Task Instruction Token Budget B ① Coarse-grained Compression Function-Level Chunking Split by functions/classes AMI-based Ranking PPL(q) - PPL(q|c) Budget-Constrained Selection Top-N functions ② Fine-grained Compression Perplexity-based Block Detection Semantic boundaries Adaptive Budget Allocation Importance-weighted 0/1 Knapsack Block Selection Maximize relevance Compressed Context Up to 5.6× compression Preserved performance Key Innovations • Conditional perplexity ranking • Perplexity-based block detection • Adaptive budget allocation • 0/1 knapsack optimization • Training-free & model-agnostic Evaluation Tasks • Long Code Completion • Long Module Summarization • Repository QA (RepoQA) • Cross-model generalization • Efficiency analysis Benefits • Reduced API costs • Faster generation • Lower memory usage • Preserved code structure • Better than baselines AMI(c,q) DP
Q1
1. What is the main innovation of LongCodeZip compared to existing code compression methods?
It uses machine learning to automatically compress code
It employs a two-stage compression strategy considering code-specific structures
It focuses only on removing comments and whitespace
Q2
2. In the experimental results, what was the maximum compression ratio achieved by LongCodeZip while maintaining performance?
2.3×
4.1×
5.6×
Q3
3. Which of the following is NOT a component of LongCodeZip's fine-grained compression stage?
Perplexity-based block detection
Syntax tree parsing and optimization
Adaptive budget allocation
1/2

Paper 2

Apriel-1.5-15b-Thinker

Published: 2025-10-01

Link: http://arxiv.org/pdf/2510.01141

1. 📘 Topic and Domain: The paper presents Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model in the domain of artificial intelligence and large language models.
2. 💡 Previous Research and New Ideas: Based on Pixtral-12B architecture, it introduces a novel three-stage training methodology focusing on mid-training design rather than massive scale, challenging the conventional approach that bigger models are always better.
3. ❓ Problem: The paper addresses the challenge of creating high-performing multimodal AI models that can achieve frontier-level reasoning capabilities while remaining computationally efficient enough to run on a single GPU.
4. 🛠️ Methods: The authors use a three-stage approach: depth upscaling of the base model, staged continual pre-training for foundational and visual reasoning, and high-quality supervised fine-tuning with explicit reasoning traces.
5. 📊 Results and Evaluation: The model achieves a score of 52 on the Artificial Analysis Intelligence Index, matching larger models like DeepSeek-R1-0528, and performs within 5 points of Gemini-2.5-Flash and Claude Sonnet-3.7 across ten image benchmarks, despite its smaller size.

Apriel-1.5-15b-Thinker

Apriel-1.5-15B-Thinker Methodology Flow Stage 1: Architecture Base: Pixtral-12B Depth Upscaling 40→48 layers Projection Realignment Stage 2: CPT Stage 1 Foundational Reasoning 50% Text + 20% Replay + 30% Multimodal Seq Length: 32768 Stage 3: CPT Stage 2 Visual Reasoning Synthetic Data Gen Vision Encoder Frozen Seq Length: 16384 Stage 4: SFT High-Quality Data Explicit Reasoning Text-Only Training 4 Epochs + Merging Image Reconstruction Holistic scene priors Part-whole reasoning Region masking Visual Matching Correspondence Fine-grained discrimination Cross-view matching Object Detection Grounding Localization Presence identification Counting Visual elements Category-specific Precise enumeration Mathematics Reasoning traces Verification Coding Execution-based Quality control Science Domain expertise Synthetic generation Tool Use Function calling Interactive workflows Key Innovation: Progressive Training • Staged curriculum design • Cost-effective scaling • No RL/preference optimization • Single-GPU deployment Results: Frontier Performance • AA Intelligence Index: 52 • AIME'25: 88%, MMMU: 70.2% • Matches DeepSeek-R1-0528 • 15B parameters only Technical Highlights • Checkpoint averaging • Selective loss computation • Data decontamination • LLM-as-Judge verification Comprehensive Evaluation Framework Text: Artificial Analysis Intelligence Index 10 benchmarks (MMLU-Pro, GPQA, AIME, etc.) Vision: VLMEvalKit 10 benchmarks (MMMU, MathVista, AI2D, etc.) Apriel-1.5-15B-Thinker Open-source multimodal reasoning model
Q1
1. What is the main innovation that distinguishes Apriel-1.5-15B-Thinker from other models?
Its massive scale and parameter count
Its focus on mid-training design and efficiency
Its use of reinforcement learning techniques
Q2
2. During the second stage of Continual Pre-training (CPT), what percentage of the model's components were frozen?
Only the decoder was frozen
Only the vision encoder was frozen
Both the decoder and projection network were frozen
Q3
3. What is the most significant practical advantage of Apriel-1.5-15B-Thinker compared to other frontier models?
It can run on a single GPU while maintaining competitive performance
It has the highest accuracy on all benchmarks
It requires no training data
1/2

Paper 3

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Published: 2025-10-02

Link: http://arxiv.org/pdf/2510.02209

1. 📘 Topic and Domain: The paper introduces StockBench, a benchmark for evaluating Large Language Model (LLM) agents' ability to trade stocks profitably in real-world markets within the financial domain.
2. 💡 Previous Research and New Ideas: Based on previous financial benchmarks that focused mainly on static question-answering tasks, this paper proposes a new dynamic benchmark that tests LLMs' actual trading capabilities in realistic market conditions.
3. ❓ Problem: The paper addresses the gap between existing financial benchmarks that only test static knowledge and the need to evaluate LLMs' ability to make continuous, profitable trading decisions in dynamic market environments.
4. 🛠️ Methods: The authors created a contamination-free benchmark with daily market signals (prices, fundamentals, news) where LLM agents make sequential buy/sell/hold decisions over multiple months, evaluated using financial metrics like cumulative return and Sortino ratio.
5. 📊 Results and Evaluation: Most LLM agents struggled to outperform a simple buy-and-hold baseline, though some models showed potential for higher returns and better risk management, with Kimi-K2 and Qwen3-235B-Ins performing best among tested models.

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

StockBench: LLM Agents Trading Workflow Back-Trading Environment Investment Targets Top 20 DJIA Stocks March-June 2025 Contamination-Free Price & Fundamental Data Opening Prices, P/E Ratio Market Cap, Dividend Yield 52-week High/Low News Corpus Top 5 Articles Previous 48 Hours Daily Updates Evaluation Metrics Final Return Maximum Drawdown Sortino Ratio Stock Trading Agent Workflow Step 1: Portfolio Overview Scan all stocks, recent news, current holdings, opening prices Step 2: Stock Analysis Select stocks for deeper fundamental analysis Step 3: Decision Generation Generate trading decisions: Increase, Decrease, Hold Step 4: Execution Convert to share quantities, validate and execute Model Evaluation & Results Proprietary Models • GPT-5 • Claude-4-Sonnet • OpenAI-O3 Most struggle vs baseline Open-Weight Models • Qwen3-235B • Kimi-K2 • GLM-4.5 Some outperform baseline Key Findings • Most LLM agents fail to beat buy-and-hold baseline • Better risk management (lower max drawdown) • Static QA performance ≠ Trading success • Reasoning models don't guarantee better trading • Performance varies with market conditions Ablation Studies • News vs Fundamentals • Portfolio Size Impact • Market Condition Effects Best Performance Kimi-K2: 1.9% return -11.8% max drawdown 0.042 Sortino ratio Baseline Performance Buy-Hold: 0.4% return -15.2% max drawdown 0.0155 Sortino ratio Future Directions • Enhanced architectures • More market scenarios • Continuous updates
Q1
1. What was the main limitation discovered when testing LLM agents in StockBench?
They couldn't process real-time market data
Most struggled to outperform a simple buy-and-hold strategy
They were unable to read financial news
Q2
2. What unique feature of StockBench sets it apart from previous financial benchmarks?
It only tests static financial knowledge
It focuses on single-stock trading only
It requires continuous decision-making over multiple months in dynamic markets
Q3
3. During which time period was StockBench's evaluation conducted to ensure no data contamination?
January to December 2024
March to June 2025
January to March 2023