2025-05-01 Papers

1/2

Paper 1

Sadeed: Advancing Arabic Diacritization Through Small Language Model

Published: 2025-04-30

Link: http://arxiv.org/pdf/2504.21635

1. 📘 Topic and Domain: Arabic text diacritization using small language models in natural language processing.

2. 💡 Previous Research and New Ideas: Based on previous work in Arabic diacritization using rule-based, machine learning, and deep learning approaches; proposes a novel small language model adapted from Kuwain 1.5B for more efficient diacritization.

3. ❓ Problem: Addressing challenges in Arabic text diacritization including data scarcity, writing style differences between Classical and Modern Arabic, contextual dependencies, and benchmark limitations.

4. 🛠️ Methods: Fine-tuned a decoder-only language model (Sadeed) on carefully curated diacritized datasets and introduced a new benchmark (SadeedDiac-25) for comprehensive evaluation.

5. 📊 Results and Evaluation: Sadeed achieved competitive results compared to proprietary large language models and outperformed traditional models, while identifying limitations in existing benchmarks and demonstrating strong performance particularly in Classical Arabic texts.

Sadeed: Advancing Arabic Diacritization Through Small Language Model

1/2

Paper 2

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.20734

1. 📘 Topic and Domain: The paper presents UniversalRAG, a framework for retrieval-augmented generation that works across multiple modalities (text, image, video) and granularities of information.

2. 💡 Previous Research and New Ideas: Previous RAG approaches were limited to single modalities or unified embeddings that suffered from modality gaps; this paper proposes modality-aware routing and multi-granular retrieval.

3. ❓ Problem: The paper aims to solve the limitations of existing RAG systems that can't effectively handle queries requiring different types of knowledge sources (text, images, videos) and different levels of detail.

4. 🛠️ Methods: The paper implements a routing mechanism that dynamically selects the most appropriate modality and granularity level for each query, maintaining separate embedding spaces for different modalities and offering both training-free and trained router options.

5. 📊 Results and Evaluation: UniversalRAG outperformed baseline approaches across 8 multimodal benchmarks, with trained routers achieving better performance on in-domain queries while GPT-4o showed stronger generalization to out-of-domain queries.

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

1/2

Paper 3

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.21233

1. 📘 Topic and Domain: Exploring how to enhance mathematical reasoning capabilities in small language models (3.8B parameters) through a systematic training approach.

2. 💡 Previous Research and New Ideas: Based on Chain-of-Thought (CoT) prompting and distillation techniques from larger models, proposes a novel multi-stage training recipe specifically designed for small models.

3. ❓ Problem: Addressing the challenge of improving reasoning abilities in small language models, which is typically more difficult than in larger models due to limited model capacity.

4. 🛠️ Methods: Implements a four-stage training process: large-scale mid-training on distilled CoT data, supervised fine-tuning on high-quality CoT data, rollout-based preference learning, and reinforcement learning with verifiable rewards.

5. 📊 Results and Evaluation: The resulting Phi-4-Mini-Reasoning model outperformed larger models, achieving 94.6% on Math-500, surpassing DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points.