2025-11-14 Papers

1/2

Paper 1

Black-Box On-Policy Distillation of Large Language Models

Published: 2025-11-13

Link: http://arxiv.org/pdf/2511.10643

1. 📘 Topic and Domain: The paper discusses black-box knowledge distillation of Large Language Models (LLMs), focusing on training smaller student models using only text outputs from teacher models without access to their internal parameters.
2. 💡 Previous Research and New Ideas: Building upon previous work in knowledge distillation and generative adversarial networks, the paper introduces a novel Generative Adversarial Distillation (GAD) framework that enables on-policy learning in black-box settings.
3. ❓ Problem: The paper addresses the challenge of effectively distilling knowledge from proprietary LLMs when only their text outputs are available, without access to internal logits or parameters.
4. 🛠️ Methods: The authors implement GAD by framing the student model as a generator and training a discriminator to distinguish between teacher and student responses in a minimax game, using reinforcement learning techniques.
5. 📊 Results and Evaluation: The results show GAD consistently outperforms sequence-level knowledge distillation across multiple datasets, with Qwen2.5-14B-Instruct achieving comparable performance to GPT-5-Chat teacher on LMSYS-Chat evaluation, validated through both automatic and human evaluations.

Black-Box On-Policy Distillation of Large Language Models

GAD: Generative Adversarial Distillation Workflow Data Preparation LMSYS-Chat-1M + GPT-5 Model Initialization Student G + Discriminator D Warmup Stage 1 Epoch Joint Training Generator Training Cross-Entropy Loss Policy Gradient GRPO Algorithm Discriminator Training Bradley-Terry Loss Pairwise Preference Score Prediction Minimax Game max G min D V(G,D) σ(D(y_t) - D(G(x))) GAD Training Loop 2 Epochs Adversarial Training On-Policy Reward Modeling Evaluation GPT-4o Scoring Human Assessment Key Innovations Black-box Access On-Policy Learning Adaptive Discriminator Mode-Seeking Behavior Reward Stability Results Qwen2.5-14B-Instruct ≈ GPT-5-Chat Performance
Q1
1. What is the main advantage of GAD over traditional black-box distillation methods?
It requires less computational resources
It enables on-policy learning through adversarial feedback
It can work with any size of language models
Q2
2. In the GAD framework, what role does the discriminator play?
It generates new training data
It evaluates the grammar of responses
It acts as an adaptive reward model that provides feedback to the student
Q3
3. What surprising result did the experiments show about model performance?
Qwen2.5-3B with GAD matched Qwen2.5-7B with SeqKD
GAD performed worse than baseline methods
The student models completely failed to learn
1/2

Paper 2

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Published: 2025-11-09

Link: http://arxiv.org/pdf/2511.08633

1. 📘 Topic and Domain: Training-free motion-controlled video generation using dual-clock denoising in the domain of computer vision and AI-generated video.
2. 💡 Previous Research and New Ideas: Based on SDEdit's coarse layout cues for image editing and extends it to video, introducing a novel dual-clock denoising process that allows different regions to denoise at different rates.
3. ❓ Problem: Existing video generation methods lack precise motion control and require expensive model-specific fine-tuning.
4. 🛠️ Methods: Uses crude reference animations as motion guides, employs image conditioning to preserve appearance, and introduces dual-clock denoising that applies different noise schedules to motion-specified regions versus background.
5. 📊 Results and Evaluation: Outperformed existing training-based baselines on object and camera motion benchmarks, achieving better motion control and visual quality while being training-free and compatible with multiple video diffusion models.

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Time-to-Move: Training-Free Motion Controlled Video Generation Input Image I ∈ R³×H×W Motion Control User Trajectory Motion Signal Generation Cut-and-Drag / Depth Warping Creates Warped Video V^w SDEdit Adaptation Noise Injection x_t* ~ q(x_t*|V^w) Dual-Clock Denoising Strong Alignment t_strong (Masked) Weak Alignment t_weak (Background) x_{t-1} ← (1-M)⊙x̂_{t-1} + M⊙x^w_{t-1} I2V Diffusion Model Image Conditioning Preserves Appearance p_θ(x_0|x_t*, I) Output Video Realistic Motion Preserved Identity Training-Free No Model Retraining Plug-and-Play Region-Dependent Spatial Varying Control Dual Timesteps Joint Control Motion + Appearance Pixel-Level Conditioning Architecture Agnostic SVD, CogVideoX WAN2.2 Compatible Object Motion Local Control Camera Motion Global Control Appearance Edit Style Control Crude Animation → Motion Injection → Region-Dependent Denoising → Realistic Video
Q1
1. What is the key innovation of the Time-to-Move framework compared to previous approaches?
It requires extensive model training
It uses dual-clock denoising with different noise schedules for different regions
It only works with a single specific video diffusion model
Q2
2. Why does the paper use crude reference animations as motion guides?
They are easier to produce and flexible for user intent
They provide perfect visual quality
They require specialized hardware to generate
Q3
3. What unique capability does Time-to-Move introduce that goes beyond existing methods?
Faster video generation speed
Smaller model size requirements
Joint control of both motion and appearance through pixel-level conditioning
1/2

Paper 3

Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

Published: 2025-11-10

Link: http://arxiv.org/pdf/2511.07003

1. 📘 Topic and Domain: Large-scale multilingual machine translation focused on both Chinese and English language pairs, covering 60 languages and 234 translation directions.
2. 💡 Previous Research and New Ideas: Based on previous LLM-based translation research but addresses English-centric bias by introducing Chinese as a second pivot language, while proposing Strategic Downsampling and Parallel Multilingual Prompting.
3. ❓ Problem: Addressing the challenges of broad language coverage, consistent translation quality, and English-centric bias in multilingual machine translation systems.
4. 🛠️ Methods: Used a two-stage adaptation framework combining Continued Pre-training (CPT) and Supervised Fine-tuning (SFT), with Strategic Downsampling to prevent directional degeneration and Parallel Multilingual Prompting to enhance cross-lingual transfer.
5. 📊 Results and Evaluation: The 4B model (LMT-60-4B) achieved state-of-the-art performance among comparable models, surpassing larger models like Aya-101-13B and NLLB-54B, with consistent performance across high, medium, and low-resource languages.

Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

LMT: Large-scale Multilingual Translation Framework Data Curation Pipeline Monolingual SlimPajama (EN) Skywork (ZH) CulturaX (Others) Bilingual OPUS Corpus Pseudo-synthesis 2.1B EN-X, 2.9B ZH-X Quality Control OpusFilter CometKiwi Multi-dimensional SFT Dataset Flores-200 NTREX, SMol 596K pairs Final Corpus 60 Languages 234 Directions 90B Tokens Two-Stage Training Pipeline Qwen3-Base 0.6B/1.7B/4B/8B Continued Pre-training (CPT) 90B tokens (1:1:1 ratio) Informative Formatting Supervised Fine-tuning (SFT) Strategic Downsampling PMP Integration LMT Model Final Output Key Methodological Innovations Strategic Downsampling Problem: Directional Degeneration • Symmetric multi-way data causes excessive many-to-one mappings • X→En/Zh performance degradation Solution: Retain only 5% of X→En/Zh Parallel Multilingual Prompting Enhancement: Cross-lingual Transfer • Auxiliary language context • En↔X: Typologically similar • Zh↔X: English as pivot • 50% STP + 50% PMP mixing Performance Results SOTA among similar coverage models • LMT-60-4B > Aya-101-13B • LMT-60-4B > NLLB-54B • Strong parameter efficiency • Robust across all directions Ablation Study Findings SD: +11.45 X→Zh, +5.83 X→En | CPT: +3.80 to +8.23 all directions | PMP: Steady boost across all Synergistic effect of all three components essential for robust MMT adaptation Language Coverage & Evaluation 60 languages: 13 high, 18 medium, 29 low-resource | Chinese-English centric design FLORES-200 devtest, COMET-22 metric, includes regional Chinese languages (Uyghur, Tibetan, Mongolian, Cantonese)
Q1
1. What key phenomenon did the researchers identify and address in multilingual fine-tuning?
Language drift in low-resource pairs
Directional degeneration in X→En/Zh translations
Vocabulary interference between languages
Q2
2. What is unique about the LMT model's approach compared to previous multilingual translation models?
It only focuses on English-centric translation
It uses Chinese as the sole pivot language
It centers on both Chinese and English as pivot languages
Q3
3. How does Parallel Multilingual Prompting (PMP) enhance translation quality?
By using larger training datasets
By adding auxiliary language context as semantic guidance
By increasing model parameters