2025-11-19 Papers

1/2

Paper 1

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Published: 2025-11-17

Link: http://arxiv.org/pdf/2511.13254

1. 📘 Topic and Domain: Model souping (averaging weights from multiple language models) to enhance LLM performance without expensive retraining.
2. 💡 Previous Research and New Ideas: Based on previous uniform model weight averaging research, proposes new non-uniform weighted averaging based on benchmark category performance patterns.
3. ❓ Problem: Current LLM training is resource-intensive and time-consuming, requiring massive compute power and careful orchestration of training procedures.
4. 🛠️ Methods: Introduces Soup Of Category Experts (SoCE), which identifies expert models for weakly-correlated benchmark categories and combines them using optimized weighted averaging.
5. 📊 Results and Evaluation: Achieved state-of-the-art results on Berkeley Function Calling Leaderboard, with 80.68% accuracy for 70B models and 76.50% for 8B models, while showing consistent improvements across multilingual capabilities, tool calling, and math benchmarks.

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

SoCE: Soup Of Category Experts Methodology Input Data Benchmark Dataset D Categories {C1, C2, ..., Ck} Candidate Models M Step 1: Correlation Analysis Compute Pearson correlation between category pairs Identify low-correlation (|ρ| < τ) categories L Step 2: Expert Selection For each category Ci ∈ L: Select expert model M*i = argmax Performance Correlation Heatmap Strong correlations: 0.96-0.98 Weak correlations: ~0.07 (Multi-turn vs Live Accuracy) Step 3: Weight Optimization Generate weights w = {w1,...,wl} such that Σwi = 1 Search range: 0.1 to 0.9 Step size: 0.1 Step 4: Model Souping Create souped model: Msoup = Σ w*i · M*i Weighted combination of expert models BFCL Function Calling Leaderboard MGSM Multilingual Math Reasoning ∞-Bench Long Context Processing FLORES-101 Machine Translation Key Results BFCL Performance 70B: 80.68% (+2.7%) 8B: 76.50% (+5.7%) State-of-the-art MGSM Performance 51.7% accuracy +1.57% improvement over best baseline Model Consistency Higher Pearson correlations across benchmark categories Comparison with Baselines Uniform Souping (Wortsman et al., 2022) Equal weights for all models Uniform + Model Selection SoCE selection criteria with uniform weights SoCE (Full Method) Model selection + Optimized weights Shapley Value Analysis Quantifies individual model contributions SoCE candidates show higher Shapley values
Q1
1. What is the key innovation of SoCE compared to previous model souping approaches?
It uses larger language models for better performance
It applies non-uniform weighted averaging based on benchmark categories
It requires less computing resources for training
Q2
2. What was the accuracy improvement achieved by SoCE for 70B models on the Berkeley Function Calling Leaderboard compared to the previous best model?
5.7% improvement
1.2% improvement
2.7% improvement
Q3
3. Which of the following best describes how SoCE selects models for souping?
It randomly selects models to combine
It identifies expert models for weakly-correlated benchmark categories
It always selects the three highest performing models overall
1/2

Paper 2

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Published: 2025-11-18

Link: http://arxiv.org/pdf/2511.14295

1. 📘 Topic and Domain: The paper introduces AraLingBench, a human-annotated benchmark for evaluating Arabic language models' linguistic capabilities across grammar, morphology, spelling, reading comprehension, and syntax.
2. 💡 Previous Research and New Ideas: Previous research focused on knowledge-based benchmarks like BALSAM and CamelEval, while this paper proposes the first benchmark specifically targeting core linguistic competence rather than factual recall.
3. ❓ Problem: The paper addresses the lack of systematic evaluation methods for assessing true linguistic understanding in Arabic language models, as existing benchmarks focus mainly on knowledge and reasoning tasks.
4. 🛠️ Methods: The authors created 150 expert-designed multiple-choice questions across five linguistic categories, with rigorous quality control through expert validation and difficulty annotation, then evaluated 35 Arabic and bilingual LLMs.
5. 📊 Results and Evaluation: The evaluation revealed that current models show strong surface-level proficiency (74% accuracy for top models) but struggle with deeper grammatical and syntactic reasoning, with performance varying significantly across linguistic categories and difficulty levels.

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

AraLingBench Construction Methodology Phase 1 Question Generation 5 Arabic linguistics experts Original question-answer pairs 5 linguistic categories Phase 2 Difficulty & Diversity Native speakers review Clarity assessment Difficulty validation Phase 3 Expert Quality Control Senior linguist review Accuracy verification Category alignment Phase 4 Difficulty Annotation 3 independent annotators 3-point scale rating Majority vote decision Five Linguistic Categories (30 questions each) Grammar (Nahw) Morphology (Sarf) Spelling (Imlaa) Reading Comprehension (Fahm al-logha) Syntax (Tarkib Lughawi) Evaluation Setup 35+ Models Evaluated Zero-shot Prompting Multiple Choice Format Accuracy Metrics Four Key Research Questions RQ1: Balanced Competence? Category performance analysis RQ2: Skill Correlations? Inter-category relationships RQ3: Benchmark Predictability? Cross-benchmark comparison RQ4: Difficulty Alignment? Human vs model difficulty perception AraLingBench: 150 Human-Annotated Questions Diagnostic Framework for Arabic LLM Linguistic Competence
Q1
1. What was the most challenging linguistic category for Arabic LLMs according to the AraLingBench evaluation?
Reading Comprehension
Syntax
Spelling
Q2
2. How did the research team ensure the quality of questions in AraLingBench?
They used automated tools to generate and validate questions
They copied questions from existing Arabic textbooks
They employed a multi-stage process with expert linguists and native speaker validation
Q3
3. What surprising finding emerged from the difficulty level analysis of AraLingBench?
Hard questions sometimes yielded higher accuracy than Medium ones
All models performed consistently worse as difficulty increased
Easy questions were the most challenging for all models
1/2

Paper 3

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Published: 2025-11-14

Link: http://arxiv.org/pdf/2511.11793

1. 📘 Topic and Domain: Research on an open-source research agent (MiroThinker) focused on advancing tool-augmented reasoning and information-seeking capabilities in AI language models.
2. 💡 Previous Research and New Ideas: Based on previous agent foundation models and deep research models, introduces "interactive scaling" as a third dimension of performance improvement alongside model size and context length.
3. ❓ Problem: The performance gap between open-source and proprietary research agents, particularly in handling complex research tasks requiring deep reasoning and tool use.
4. 🛠️ Methods: Implements a three-stage training pipeline: supervised fine-tuning, preference optimization, and reinforcement learning, combined with a 256K context window supporting up to 600 tool calls per task.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance among open-source research agents across multiple benchmarks (GAIA: 81.9%, HLE: 37.7%, BrowseComp: 47.1%, BrowseComp-ZH: 55.6%), approaching commercial counterparts like GPT-5-high.

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroThinker v1.0: Interactive Scaling Research Agent Workflow Data Construction MultiDocQA Document Corpus Graph Construction Fact Extraction Question Generation Agentic Trajectory ReAct Single-Agent MiroFlow Multi-Agent Function Calling Model Context Protocol Open-Source Data: MuSiQue, HotpotQA, WebWalkerQA, etc. Converted to Agentic Trajectories Three-Stage Training Pipeline Stage 1: Supervised Fine-tuning (SFT) Learn expert trajectories with thought-action-observation L_SFT = -E[Σ log π_θ(T_t, A_t | x, H_<t)] Data filtering and repair for consistency Stage 2: Preference Optimization (DPO) Pairwise preference based on correctness L_PO = E[L_DPO(x, H+, H-)] + λL_SFT^(+) Quality control for trace completeness Stage 3: Reinforcement Learning (GRPO) Group Relative Policy Optimization L_GRPO = E[Â(x,H)·log π_θ(H|x) - β_KL·D_KL] Streaming rollout acceleration + trajectory curation Interactive Scaling: Up to 600 Tool Calls Deep agent-environment interactions Agent Architecture ReAct Workflow Think → Act → Observe Iterative Loop Tool Interface Execution Environment File Mgmt + Info Retrieval Context Management (256K tokens) Recency-based retention + Result truncation Model Variants: 8B, 30B, 72B (based on Qwen2.5/Qwen3) Evaluation & Results Benchmarks • GAIA: 81.9% (72B variant) • HLE: 37.7% • BrowseComp: 47.1% • BrowseComp-ZH: 55.6% • xBench-DeepSearch: 77.8% • SEAL-0: 51.0% Key Findings Interactive Scaling as 3rd dimension Performance ∝ Interaction Depth RL enables deeper exploration SOTA among open-source agents Three Scaling Dimensions Model Size (8B→30B→72B) Context Length (up to 256K) Interactive Scaling (up to 600 calls) Interactive Scaling Results: RL training leads to 8-10 point accuracy improvements SFT models use ~100 tool calls → RL models use up to 600 tool calls with deeper reasoning Establishes interaction depth as third critical dimension alongside model size and context length MiroThinker v1.0: Open-source research agent achieving SOTA performance through interactive scaling Comprehensive toolchain + Multi-stage training + Deep agent-environment interactions
Q1
1. What is the key innovation in scaling introduced by MiroThinker compared to previous models?
Larger model size scaling
Interactive scaling through deeper agent-environment interactions
Longer context window scaling
Q2
2. What is the maximum number of tool calls per task that MiroThinker can perform with its 256K context window?
100 calls
300 calls
600 calls
Q3
3. Which stage in MiroThinker's training pipeline helps the model discover creative solutions through direct interaction with environments?
Supervised fine-tuning
Preference optimization
Reinforcement learning