2025-04-23 Papers

1/2

Paper 1

TTRL: Test-Time Reinforcement Learning

Published: 2025-04-22

Link: http://arxiv.org/pdf/2504.16084

1. 📘 Topic and Domain: The paper explores Test-Time Reinforcement Learning (TTRL) for improving Large Language Models' reasoning capabilities on unlabeled data during inference time.
2. 💡 Previous Research and New Ideas: The paper builds on Test-Time Scaling methods and reinforcement learning for reasoning, proposing a novel approach that enables LLMs to self-evolve through reinforcement learning on unlabeled test data.
3. ❓ Problem: The paper aims to solve the challenge of applying reinforcement learning during inference on unlabeled data without access to ground-truth information for reward estimation.
4. 🛠️ Methods: The authors implement TTRL by using majority voting among multiple model-generated outputs to estimate labels and compute rule-based rewards, which are then used to optimize the model through reinforcement learning.
5. 📊 Results and Evaluation: TTRL achieved significant performance improvements, boosting Qwen-2.5-Math-7B's pass@1 performance by approximately 159% on AIME 2024 and an average gain of 84% across mathematical reasoning benchmarks, while consistently surpassing the upper limit of the initial model's performance.

TTRL: Test-Time Reinforcement Learning

TTRL Workflow: Test-Time Reinforcement Learning Unlabeled Test Data (e.g., Prompt x) LLM Policy (πθ) Generate N Candidate Outputs {y1, y2, ..., yN} (via repeated sampling) Reward Estimation (No Ground Truth) Extract Answers {ŷ1, ..., ŷN} Majority Voting (Find most frequent ŷ) Estimate Pseudo-Label (y*) Calculate Rewards R(ŷi, y*) (1 if ŷi == y*, else 0) Reinforcement Learning Update Use R(ŷi, y*) to update πθ (e.g., GRPO, PPO) Updated LLM (πθ') (Improved Performance)
Q1
1. What is the core challenge that Test-Time Reinforcement Learning (TTRL) aims to solve?
Optimizing computational resources during model pre-training
Estimating rewards during inference without access to ground-truth labels
Generating longer chain-of-thought reasoning sequences
Q2
2. What surprising phenomenon did researchers observe when applying TTRL?
TTRL models required less computational resources than traditional methods
TTRL models could surpass their own training signal and exceed the Maj@N upper bound
TTRL only worked on small models but failed on larger architectures
Q3
3. When is TTRL most likely to fail according to the paper?
When applied to small datasets with fewer than 100 examples
When the model lacks sufficient prior knowledge on the target task
When the model generates outputs that are too consistent
1/2

Paper 2

Kuwain 1.5B: An Arabic SLM via Language Injection

Published: 2025-04-21

Link: http://arxiv.org/pdf/2504.15120

1. 📘 Topic and Domain: The paper introduces Kuwain 1.5B, a small language model that enhances Arabic language capabilities through a novel injection method into an existing English-centric language model.
2. 💡 Previous Research and New Ideas: The paper builds on previous work in multilingual language model adaptation but proposes a more efficient approach by injecting a new language through selective layer extension and vocabulary expansion rather than complete retraining.
3. ❓ Problem: The paper addresses how to effectively expand a monolingual language model to support a new language (Arabic) while preserving its original language (English) capabilities without expensive retraining from scratch.
4. 🛠️ Methods: The authors extended TinyLlama 1.1B by adding 8 new trainable layers, expanding its vocabulary with 26K Arabic tokens while keeping original layers frozen, and training on 90 billion Arabic tokens and 20 billion English tokens.
5. 📊 Results and Evaluation: The approach improved Arabic language performance by 8% across various benchmarks while maintaining (and slightly improving by 1%) English language performance, achieving competitive results compared to much larger models while reducing training costs by 70%.

Kuwain 1.5B: An Arabic SLM via Language Injection

Kuwain 1.5B Methodology: Arabic Language Injection Problem Definition English-centric LLMs, High Cost, Catastrophic Forgetting Input: Base Model TinyLlama 1.1B (English-centric SLM) Input: Training Data 90B Arabic Tokens 20B English Tokens (20% Ratio) 1. Data Preparation Cleaning & Filtering (Arabic + English) Core Language Injection Method 2. Vocabulary Expansion Train new Arabic SentencePiece tokenizer (26K tokens) Extend TinyLlama tokenizer (+26K = 54K total) (Optimized Expansion Ratio) 3. Model Layer Extension Insert new identity blocks (layers) into TinyLlama Optimal: 8 distributed layers (+~30% size) (Inspired by Llama-Pro, adapted) 4. Selective Continual Pre-training Freeze original TinyLlama layers Train ONLY new 8 layers + expanded embeddings Output: Kuwain 1.5B (Arabic-Enhanced SLM) Evaluation & Key Findings • Compared Kuwain vs. TinyLlama (Base): ✓ Improved Arabic (+8% avg) ✓ Preserved/Slightly Improved English (+1% avg) • Compared vs. Kuwain-Naive (No Layer Ext.): ✓ Kuwain avoids catastrophic forgetting (English) • Data Ratio: ✓ 20% English data sufficient • Arabic Leaderboard: ✓ Competitive performance despite small size • Cost: ✓ Reduced training cost (~70%)
Q1
1. What is the key innovation in Kuwain's approach to language injection?
Training the entire model from scratch with both Arabic and English data
Adding new trainable layers while keeping original layers frozen and expanding vocabulary
Translating English training data into Arabic for better linguistic alignment
Q2
2. What was the optimal proportion of English data needed during training to maintain the model's original capabilities?
50% English data
20% English data
5% English data
Q3
3. What happened when the authors tried to stack multiple new layers consecutively instead of distributing them throughout the model?
It improved Arabic performance but degraded English capabilities
It led to unstable training and degraded overall performance
It reduced training time but required more GPU memory
1/2

Paper 3

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

Published: 2025-04-21

Link: http://arxiv.org/pdf/2504.15521

1. 📘 Topic and Domain: Analysis of multilingual benchmarks in natural language processing and artificial intelligence evaluation.
2. 💡 Previous Research and New Ideas: Based on previous multilingual evaluation frameworks; proposes principles for effective multilingual benchmarking and emphasizes the need for culturally authentic evaluation resources rather than just translations.
3. ❓ Problem: The paper addresses the significant disparities in how language models perform across different languages and the limitations of current multilingual evaluation practices.
4. 🛠️ Methods: Comprehensive analysis of over 2,000 multilingual benchmarks from 148 countries published between 2021-2024, examining language distribution, task types, translation methods, and correlation with human judgments.
5. 📊 Results and Evaluation: Found that English remains overrepresented despite efforts to promote diversity; STEM-related tasks show stronger correlation with human judgments (0.70-0.85) than traditional NLP tasks (0.11-0.30); and localized benchmarks align better with human judgments (0.68) than translated ones (0.47).

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

Workflow: The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Data Collection & Preparation (Sec 3) Define Scope: Labeled datasets (x->y) (Exclude: train, unlabeled, etc.) Collect Papers: arXiv API (cs.CL, 2021-24) Initial 370K papers Filter Papers: LLM Screening (Abstracts) + Manual Expert Review Annotate Papers: 3 Experts, Scheme (Table 1) Result: 2,024 papers PAST: Analysis of Collected Benchmarks (Sec 4) Language Dist. (Fig 2) English Overrepresented HRLs >> LRLs Translation Methods (Fig 3) 61.4% Original Lang. 13.2% Human Trans. MT varies (Google, GPT...) Tasks (Fig 4a) 66.5% Discriminative 23.5% Generative TC dominant, QA growing Dataset Sizes (Fig 4b) Growing Trend Large datasets increase Est. Cost > $11M Domains (Fig 5) Public sources (News, Social Media) dominate Specialized underrep. Countries/Institutions (Fig 6) G5 Lead (CN, IN, DE, UK, US) Mostly Academic PRESENT: Current Status (Sec 5) 5.1 Analyze Multilingual User Interests Data Source: Chatbot Arena / WildChat (6 Langs, 10K instructions each) Method: Categorize using Qwen2.5-Max Findings (Fig 7): Similar interests across languages (Writing dominant). Commonsense & Programming also high. (Note: Potential research context bias) 5.2 Analyze Benchmark Correlation with Human Judgment Data: 30 LLMs, 8 Benchmarks, Elo Rankings (Chatbot Arena) Method: Evaluate LLMs on Benchmarks, Compare rankings (Spearman's ρ) vs Elo Findings (Table 2): STEM tasks (ARC, MGSM) correlate better. Translation quality matters; Translation alone insufficient. Localized benchmarks crucial (e.g., CMMLU). FUTURE: Needs & Directions (Sec 6, 7) 6.1 Needs for Effective Multilingual Benchmarks Accurate & Contamination-free Challenging Enough Practically Relevant Linguistically Diverse Culturally Authentic 6.2 Critical Research Directions Address NLG Imbalance Improve LRL Representation Create Localized Benchmarks Leverage LLM- as-a-Judge (Carefully) Develop Efficient Benchmarking Call to Action (Sec 7) Global Collaboration for Inclusive Benchmarking Human-Aligned Evaluation Application-Oriented Benchmarking
Q1
1. According to the paper, which type of tasks showed the strongest correlation with human judgments across languages?
Traditional NLP tasks like question answering
STEM-related tasks like mathematics and science reasoning
Translation and cultural knowledge tasks
Q2
2. What did the paper identify as the 'bitter lesson' regarding multilingual benchmarks?
Despite significant investments, English remains overrepresented in benchmarks
Translated benchmarks are just as effective as localized ones
Users from different countries have completely different interests when using LLMs
Q3
3. Which benchmark showed significantly higher correlation with Chinese human judgments compared to translated benchmarks?
XNLI
GlobalMMLU
CMMLU