1. 📘 Topic and Domain: The paper introduces AraLingBench, a human-annotated benchmark for evaluating Arabic language models' linguistic capabilities across grammar, morphology, spelling, reading comprehension, and syntax.
2. 💡 Previous Research and New Ideas: Previous research focused on knowledge-based benchmarks like BALSAM and CamelEval, while this paper proposes the first benchmark specifically targeting core linguistic competence rather than factual recall.
3. ❓ Problem: The paper addresses the lack of systematic evaluation methods for assessing true linguistic understanding in Arabic language models, as existing benchmarks focus mainly on knowledge and reasoning tasks.
4. 🛠️ Methods: The authors created 150 expert-designed multiple-choice questions across five linguistic categories, with rigorous quality control through expert validation and difficulty annotation, then evaluated 35 Arabic and bilingual LLMs.
5. 📊 Results and Evaluation: The evaluation revealed that current models show strong surface-level proficiency (74% accuracy for top models) but struggle with deeper grammatical and syntactic reasoning, with performance varying significantly across linguistic categories and difficulty levels.