2025-09-18 Papers

1/2

Paper 1

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

Published: 2025-09-17

Link: http://arxiv.org/pdf/2509.14008

1. 📘 Topic and Domain: Development of Arabic-centric large language models (HALA) focusing on instruction-following and translation capabilities.
2. 💡 Previous Research and New Ideas: Based on existing multilingual LLMs and Arabic NLP work (like AraBERT, JAIS), proposing a novel translate-and-tune pipeline for creating specialized Arabic models.
3. ❓ Problem: Addressing the scarcity of high-quality Arabic instruction data and the need for better Arabic-centric language models.
4. 🛠️ Methods: Used FP8 compression of translator models, created million-scale bilingual supervision data, translated English instruction datasets to Arabic, and fine-tuned models at various scales (350M to 9B parameters) with slerp merging.
5. 📊 Results and Evaluation: HALA models achieved state-of-the-art results in both nano (≤2B) and small (7B-9B) categories on Arabic benchmarks, outperforming base models while maintaining general capabilities.

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

HALA: Arabic-Centric Instruction & Translation Models Pipeline Teacher Translator Cohere Command-A-Translate Quantized to FP8 (2x faster inference) Data Sources Open-Orca (405K pairs) OPUS-100 (filtered) Total: 1.26M examples EN→AR Translation FP8 Teacher Model Bilingual Supervision Quality Filtering Lightweight Translator LiquidAI LFM2-1.2B Fine-tuned on bilingual data Large-Scale Dataset Translation Using Lightweight Translator: • Hermes-3 Dataset • SCP-116K • ReAlign-Alpaca • LaMini • Tulu-3 • Synthetic-Instruct-GPT-J Arabic Instruction Corpus ~4.5M samples High-fidelity instruction-following HALA-350M LiquidAI LFM2-350M Fine-tuned + Merged HALA-700M LiquidAI LFM2-700M Fine-tuned + Merged HALA-1.2B LiquidAI LFM2-1.2B Fine-tuned + Merged HALA-9B FANAR Architecture Fine-tuned + Merged SLERP Merging MergeKit t = 0.5 Balance Arabic gains with base strengths Arabic-Centric Evaluation AlGhafa • AraTrust • ArabicMMLU • ArbMMLU-HT EXAMS • MadinahQA SOTA in nano (≤2B) and small (≤9B) categories Open Release Models • Data Code • Evaluation Training Recipes Key Innovations FP8 Quantization: 2x faster inference with no quality loss Translate-and-Tune Pipeline: Efficient Arabic data bootstrapping Million-scale Bilingual Supervision: High-fidelity instruction data SLERP Merging: Preserve general capability while boosting Arabic performance Language-Centric Approach: Depth over breadth for Arabic specialization
Q1
1. What innovative compression technique did HALA use to improve translation efficiency?
Reduced model parameters to 8-bit floating point (FP8)
Used binary quantization
Applied model pruning techniques
Q2
2. What was the primary challenge that HALA aimed to address in Arabic NLP?
Lack of computing resources
Scarcity of high-quality Arabic instruction data
Poor model architecture design
Q3
3. Which technique did HALA use to balance Arabic specialization with base-model strengths?
Layer freezing
Knowledge distillation
Spherical linear interpolation (slerp) merging
1/2

Paper 2

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

Published: 2025-09-16

Link: http://arxiv.org/pdf/2509.13305

1. 📘 Topic and Domain: The paper focuses on developing an improved web agent system (WebSailor-V2) for autonomous information seeking and research tasks in the domain of artificial intelligence and natural language processing.
2. 💡 Previous Research and New Ideas: Based on the original WebSailor framework and ReAct paradigm, it introduces novel ideas including SailorFog-QA-V2 (an enhanced dataset with complex knowledge graphs) and a dual-environment reinforcement learning framework combining simulated and real-world training.
3. ❓ Problem: The paper aims to close the performance gap between open-source and proprietary web agents while addressing challenges in data quality and training scalability for autonomous research agents.
4. 🛠️ Methods: The authors use a comprehensive pipeline including: (1) SailorFog-QA-V2 dataset construction with dense knowledge graphs, (2) Supervised Fine-Tuning for initial training, and (3) a dual-environment Reinforcement Learning approach with both simulated and real-world components.
5. 📊 Results and Evaluation: WebSailor-V2 achieved state-of-the-art results on multiple benchmarks, scoring 35.3 on BrowseComp-EN, 44.1 on BrowseComp-ZH, and 30.6 on HLE, outperforming existing open-source agents and matching or exceeding some proprietary systems despite using a smaller model (30B parameters).

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

WebSailor-V2 Methodology Workflow Data Construction Dense Knowledge Graph Construction Random-Walk Subgraph Extraction Diverse Uncertainty QA Generation SailorFog-QA-V2 Training Pipeline SFT Cold Start (Qwen3-30B-A3B) Dual-Environment RL Framework GRPO-based Policy Optimization Post-Training RL Environments Simulated Environment (Wikipedia) Real Environment (Web APIs) Automated Data Curation Dual Training Agent Framework ReAct Framework Search Visit Scholar Python Toolkit Symbiotic Feedback Loop Data Policy Co-Evolution Performance Results BrowseComp-EN: 35.3 BrowseComp-ZH: 44.1 HLE: 30.6 State-of-the-Art Open-Source Agent Key Innovations Dense Knowledge Graph with Cyclic Structures Scalable RL Framework Sim + Real Environments Diverse Uncertainty Beyond Obfuscation Data-Policy Feedback Continuous Improvement WebSailor-V2-30B-A3B Competitive with Proprietary Systems Outperforms 671B DeepSeek-V3.1
Q1
1. What is the most innovative aspect of WebSailor-V2's data generation approach compared to previous methods?
It uses more complex obfuscation techniques
It creates densely interconnected cyclic knowledge graphs rather than tree-like structures
It generates larger volumes of training data
Q2
2. Despite using only a 30B parameter model, how did WebSailor-V2 achieve competitive performance with larger models?
By using more sophisticated prompting techniques
By implementing a complex multi-agent architecture
By focusing on enhancing core information retrieval and synthesis capabilities through better data and training
Q3
3. What unique approach did WebSailor-V2 take to handle the challenges of RL training?
Used only real-world environment training
Developed a dual-environment system with both simulated and real-world components
Relied solely on supervised learning without RL
1/2

Paper 3

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

Published: 2025-09-16

Link: http://arxiv.org/pdf/2509.13313

1. 📘 Topic and Domain: The paper introduces ReSum, a paradigm for enabling long-horizon web search capabilities in Large Language Model (LLM) agents through context summarization.
2. 💡 Previous Research and New Ideas: Based on the ReAct paradigm for LLM agents, it proposes a novel approach of periodically summarizing conversation history to overcome context window limitations.
3. ❓ Problem: The paper addresses the fundamental limitation of context window size in LLM-based web agents that prevents them from conducting extended multi-turn exploration needed for complex queries.
4. 🛠️ Methods: The paper develops ReSumTool-30B for specialized summarization and ReSum-GRPO, an algorithm that integrates GRPO with segmented trajectory training to help agents adapt to summary-based reasoning.
5. 📊 Results and Evaluation: ReSum achieved an average 4.5% improvement over ReAct across three benchmarks, with further gains up to 8.2% after ReSum-GRPO training, enabling WebResummer-30B to achieve 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en.

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

ReSum: Context Summarization for Long-Horizon Search User Query (q) Traditional ReAct Context Overflow ReSum Paradigm Periodic Summary Initialize H₀ = (q) Think & Act (τₜ, aₜ) Observation oₜ = R(aₜ) Context Limit? ReSumTool-30B s ~ πₛᵤₘ(·|Hₜ) • Extract Evidence • Identify Gaps Reset Context Hₜ ← (q, s) Generate Answer ReSum-GRPO Training Trajectory Segmentation H⁽¹⁾, H⁽²⁾, ..., H⁽ᴷ⁺¹⁾ Advantage Broadcasting Âₘ⁽ⁱ⁾ = Âₘ Summary-conditioned Reasoning Search Visit Performance Improvements Average +4.5% over ReAct Up to +8.2% with ReSum-GRPO WebResummer-30B: 33.3% on BrowseComp-zh 18.3% on BrowseComp-en With only 1K training samples Key Benefits ✓ Indefinite Exploration ✓ Context Constraint Bypass ✓ Plug-and-Play Compatibility ✓ Minimal ReAct Modifications ✓ Specialized Summarization Yes No Tools
Q1
1. What is the main challenge that ReSum aims to address in LLM-based web agents?
Slow processing speed of web search results
Limited context window preventing extended exploration
High computational costs of running web agents
Q2
2. How does ReSumTool-30B improve upon generic LLM summarization capabilities?
By using a much larger model architecture
By adding complex architectural modifications
By specializing in extracting key evidence and identifying information gaps
Q3
3. What makes ReSum-GRPO different from standard GRPO training?
It uses a completely different reward calculation method
It segments long trajectories and broadcasts advantages across segments
It requires 10x more training data than standard GRPO