2025-05-01 Papers

1/2

Paper 1

Sadeed: Advancing Arabic Diacritization Through Small Language Model

Published: 2025-04-30

Link: http://arxiv.org/pdf/2504.21635

1. 📘 Topic and Domain: Arabic text diacritization using small language models in natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous work in Arabic diacritization using rule-based, machine learning, and deep learning approaches; proposes a novel small language model adapted from Kuwain 1.5B for more efficient diacritization.
3. ❓ Problem: Addressing challenges in Arabic text diacritization including data scarcity, writing style differences between Classical and Modern Arabic, contextual dependencies, and benchmark limitations.
4. 🛠️ Methods: Fine-tuned a decoder-only language model (Sadeed) on carefully curated diacritized datasets and introduced a new benchmark (SadeedDiac-25) for comprehensive evaluation.
5. 📊 Results and Evaluation: Sadeed achieved competitive results compared to proprietary large language models and outperformed traditional models, while identifying limitations in existing benchmarks and demonstrating strong performance particularly in Classical Arabic texts.

Sadeed: Advancing Arabic Diacritization Through Small Language Model

Sadeed Model Workflow: Arabic Diacritization Kuwain 1.5B SLM (Pre-trained Arabic Model) Fine-tuning: Sadeed Reformulate as QA Task (Using Template) Next-Token Prediction Training Dataset Prep Sources: Tashkeela, ATB-3 Preprocessing: - Text Cleaning & Normalization - Unify Diacritics, Correct Words - Handle iltiqa’ assakinayn - Text Chunking (50-60 words) - Dataset Filtering (Completeness) - Remove Fadel Test Overlap Result: Cleaned Training Data Sadeed Model (Fine-tuned for Diacritization) Inference & Correction 1. Input: Non-diacritized text (QA template) 2. Generate: Diacritized text (Sadeed) 3. Correct Hallucinations (Needleman-Wunsch) Evaluation Metrics: WER, DER (various settings) Benchmarks Used: - Fadel (Original & Corrected), WikiNews, SadeedDiac-25 Benchmark: SadeedDiac-25 Goal: Fair & Comprehensive Eval Composition: 50% MSA, 50% CA (Curated, WikiNews, Fadel Test) Curation Process: - Data Collection (Diverse Web) - Initial Auto-Diacritization (LLM) - Two-Stage Expert Review Result: New Benchmark Dataset Analysis & Contributions - Analysis of Overlap (Fadel/Abbad) - Analysis of CATT Benchmark Issues - Release of Cleaned Dataset & New Benchmark
Q1
1. What are the three main contributions presented in the Sadeed paper to advance the field of Arabic diacritization?
A new rule-based diacritization system, a large-scale unscored Arabic corpus, and a focus on machine translation integration.
A fine-tuned small language model (Sadeed), a new comprehensive benchmark (SadeedDiac-25), and a high-quality cleaned diacritization dataset.
An improved deep learning architecture, an automatic data augmentation technique, and a novel evaluation metric for diacritics.
Q2
2. One of the key issues the Sadeed paper identifies with existing Arabic diacritization benchmarks like CATT is related to:
Their exclusive focus on modern scientific texts, neglecting historical documents.
The complete removal of punctuation marks and the presence of various errors (spelling, grammatical, diacritization).
Their over-reliance on noisy, automatically generated diacritics without expert review.
Q3
3. When evaluated on the novel SadeedDiac-25 benchmark, what was a primary limitation observed for the Sadeed model compared to leading proprietary models?
It failed to diacritize any words correctly in the Modern Standard Arabic sections.
It demonstrated a significantly higher rate of text hallucination, contributing substantially to its Word Error Rate.
It required drastically more computational resources for inference than other models.
1/2

Paper 2

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.20734

1. 📘 Topic and Domain: The paper presents UniversalRAG, a framework for retrieval-augmented generation that works across multiple modalities (text, image, video) and granularities of information.
2. 💡 Previous Research and New Ideas: Previous RAG approaches were limited to single modalities or unified embeddings that suffered from modality gaps; this paper proposes modality-aware routing and multi-granular retrieval.
3. ❓ Problem: The paper aims to solve the limitations of existing RAG systems that can't effectively handle queries requiring different types of knowledge sources (text, images, videos) and different levels of detail.
4. 🛠️ Methods: The paper implements a routing mechanism that dynamically selects the most appropriate modality and granularity level for each query, maintaining separate embedding spaces for different modalities and offering both training-free and trained router options.
5. 📊 Results and Evaluation: UniversalRAG outperformed baseline approaches across 8 multimodal benchmarks, with trained routers achieving better performance on in-domain queries while GPT-4o showed stronger generalization to out-of-domain queries.

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

UniversalRAG Workflow Retrieval-Augmented Generation over Diverse Modalities & Granularities User Query (q) Router Module Predicts Retrieval Type (r) (Train-free: GPT-4o / Trained: DistilBERT, T5) Modality & Granularity Specific Corpora & Retrieval No Retrieval (r = 'None') Paragraph (C_paragraph) (r = 'Paragraph') Document (C_document) (r = 'Document') Image (C_image) (r = 'Image') Clip (C_clip) (r = 'Clip') Video (C_video) (r = 'Video') Retrieved Context (c) (or None) Large Vision-Language Model (LVLM) Generates Answer: a = LVLM(q, c) Final Generated Answer (a)
Q1
1. What is a primary limitation of most existing RAG systems that UniversalRAG is designed to overcome?
They only work with large language models, not smaller ones.
They are typically limited to retrieving information from a single modality-specific corpus.
They struggle with simple, factoid-based questions.
Q2
2. How does UniversalRAG primarily address the issue of the 'modality gap' observed in unified embedding spaces?
By training a stronger multimodal encoder to align all modalities better.
By maintaining separate embedding spaces for each modality and using a router to select the appropriate one.
By retrieving content from all available modalities and then filtering based on relevance.
Q3
3. Beyond handling diverse modalities, UniversalRAG also incorporates awareness of what other data dimension to improve retrieval?
Data source reliability scores.
Temporal relevance of information.
Data granularity (e.g., paragraph vs. document, clip vs. full video).
1/2

Paper 3

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Published: 2025-04-29

Link: http://arxiv.org/pdf/2504.21233

1. 📘 Topic and Domain: Exploring how to enhance mathematical reasoning capabilities in small language models (3.8B parameters) through a systematic training approach.
2. 💡 Previous Research and New Ideas: Based on Chain-of-Thought (CoT) prompting and distillation techniques from larger models, proposes a novel multi-stage training recipe specifically designed for small models.
3. ❓ Problem: Addressing the challenge of improving reasoning abilities in small language models, which is typically more difficult than in larger models due to limited model capacity.
4. 🛠️ Methods: Implements a four-stage training process: large-scale mid-training on distilled CoT data, supervised fine-tuning on high-quality CoT data, rollout-based preference learning, and reinforcement learning with verifiable rewards.
5. 📊 Results and Evaluation: The resulting Phi-4-Mini-Reasoning model outperformed larger models, achieving 94.6% on Math-500, surpassing DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points.

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Phi-4-Mini-Reasoning: Training Workflow A Multi-Stage Continual Training Recipe for SLM Math Reasoning Synthetic CoT Data Generation 1. Aggregate Datasets: - Public (Bespoke, OpenThoughts, etc.) - In-house seeds 2. Generate CoT (DeepSeek-R1): - ~8 rollouts per question 3. Verify & Filter: - Math tool + GPT-4o-mini re-verify - Retain correct answers (for Distill) - Keep incorrect answers (for DPO) Base Model: Phi-4-Mini (3.8B Parameters) Large Diverse Correct CoT Data Stage 1: Distillation Mid-Training Goal: Embed foundational CoT capabilities. Method: Causal LM objective on large CoT corpus. Technique: Packing mode for efficiency. High-Quality Correct CoT Subset Stage 2: Distillation SFT Goal: Improve generalization, handle complexity. Method: SFT on compact, high-quality subset. Technique: Non-packing mode. Incorrect Rollouts (High-School+) + Correct Rollouts -> Preference Pairs Stage 3: Rollout Preference Learning Goal: Leverage rejected rollouts, align preferences. Method: Direct Preference Optimization (DPO). Data: (Correct, Incorrect) pairs from high-quality Qs. Stage 4: RL with Verifiable Reward Goal: Improve reasoning via online exploration. Algorithm: GRPO (modified) Reward: Verifiable (+1 Correct, -1 Incorrect) Stability Enhancements: - Prompt Optimization (Uniform Lengths) - Reward Rebalancing (Oversample + Filter) - Temperature Annealing Final Model: Phi-4-Mini-Reasoning (Enhanced Math Reasoning)
Q1
1. What is the primary challenge the paper aims to tackle regarding Small Language Models (SLMs) and reasoning?
Their inability to process natural language effectively.
Their limited capacity making it difficult to significantly improve their reasoning abilities.
The high computational cost associated with training SLMs.
Q2
2. Which of the following is NOT a distinct stage in the multi-stage training recipe proposed for Phi-4-Mini-Reasoning?
Rollout DPO leveraging a curated preference dataset.
Reinforcement Learning using a standard entropy reward.
Large-scale mid-training on diverse distilled long-CoT data.
Q3
3. How did Phi-4-Mini-Reasoning's performance on the Math-500 benchmark compare to DeepSeek-R1-Distill-Llama-8B, a larger model?
Phi-4-Mini-Reasoning performed significantly worse.
Phi-4-Mini-Reasoning achieved a slightly lower score.
Phi-4-Mini-Reasoning significantly outperformed it.