2025-11-19 Papers

1/2

Paper 1

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Published: 2025-11-17

Link: http://arxiv.org/pdf/2511.13254

1. 📘 Topic and Domain: Model souping (averaging weights from multiple language models) to enhance LLM performance without expensive retraining.

2. 💡 Previous Research and New Ideas: Based on previous uniform model weight averaging research, proposes new non-uniform weighted averaging based on benchmark category performance patterns.

3. ❓ Problem: Current LLM training is resource-intensive and time-consuming, requiring massive compute power and careful orchestration of training procedures.

4. 🛠️ Methods: Introduces Soup Of Category Experts (SoCE), which identifies expert models for weakly-correlated benchmark categories and combines them using optimized weighted averaging.

5. 📊 Results and Evaluation: Achieved state-of-the-art results on Berkeley Function Calling Leaderboard, with 80.68% accuracy for 70B models and 76.50% for 8B models, while showing consistent improvements across multilingual capabilities, tool calling, and math benchmarks.

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

1/2

Paper 2

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Published: 2025-11-18

Link: http://arxiv.org/pdf/2511.14295

1. 📘 Topic and Domain: The paper introduces AraLingBench, a human-annotated benchmark for evaluating Arabic language models' linguistic capabilities across grammar, morphology, spelling, reading comprehension, and syntax.

2. 💡 Previous Research and New Ideas: Previous research focused on knowledge-based benchmarks like BALSAM and CamelEval, while this paper proposes the first benchmark specifically targeting core linguistic competence rather than factual recall.

3. ❓ Problem: The paper addresses the lack of systematic evaluation methods for assessing true linguistic understanding in Arabic language models, as existing benchmarks focus mainly on knowledge and reasoning tasks.

4. 🛠️ Methods: The authors created 150 expert-designed multiple-choice questions across five linguistic categories, with rigorous quality control through expert validation and difficulty annotation, then evaluated 35 Arabic and bilingual LLMs.

5. 📊 Results and Evaluation: The evaluation revealed that current models show strong surface-level proficiency (74% accuracy for top models) but struggle with deeper grammatical and syntactic reasoning, with performance varying significantly across linguistic categories and difficulty levels.

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

1/2

Paper 3

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Published: 2025-11-14

Link: http://arxiv.org/pdf/2511.11793

1. 📘 Topic and Domain: Research on an open-source research agent (MiroThinker) focused on advancing tool-augmented reasoning and information-seeking capabilities in AI language models.

2. 💡 Previous Research and New Ideas: Based on previous agent foundation models and deep research models, introduces "interactive scaling" as a third dimension of performance improvement alongside model size and context length.

3. ❓ Problem: The performance gap between open-source and proprietary research agents, particularly in handling complex research tasks requiring deep reasoning and tool use.

4. 🛠️ Methods: Implements a three-stage training pipeline: supervised fine-tuning, preference optimization, and reinforcement learning, combined with a 256K context window supporting up to 600 tool calls per task.

5. 📊 Results and Evaluation: Achieves state-of-the-art performance among open-source research agents across multiple benchmarks (GAIA: 81.9%, HLE: 37.7%, BrowseComp: 47.1%, BrowseComp-ZH: 55.6%), approaching commercial counterparts like GPT-5-high.