2026-01-27 Papers

1/2

Paper 1

Qwen3-TTS Technical Report

Published: 2026-01-21

Link: http://arxiv.org/pdf/2601.15621

1. 📘 Topic and Domain: The paper presents Qwen3-TTS, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models in the speech synthesis domain.
2. 💡 Previous Research and New Ideas: The paper builds on discrete speech tokenization methods and autoregressive language modeling for TTS, proposing a novel dual-track LM architecture with two new speech tokenizers (25Hz semantic-focused and 12Hz ultra-low-latency multi-codebook) for real-time synthesis.
3. ❓ Problem: The paper aims to solve the challenges of achieving stable, controllable, and human-like speech synthesis with low latency while supporting multiple languages, voice cloning, and fine-grained control through natural language instructions.
4. 🛠️ Methods: The authors use a dual-track autoregressive architecture with two custom tokenizers, train on over 5 million hours of speech data across 10 languages, employ a three-stage pre-training process followed by post-training with DPO and GSPO, and implement streaming capabilities through block-wise attention mechanisms.
5. 📊 Results and Evaluation: Qwen3-TTS achieves state-of-the-art performance in zero-shot voice cloning (lowest WER on Seed-TTS benchmark), superior speaker similarity across all 10 evaluated languages compared to commercial baselines, exceptional cross-lingual synthesis (66% error reduction in Chinese-to-Korean), and can generate over 10 minutes of natural speech with first-packet latency as low as 97ms.

Qwen3-TTS Technical Report

Qwen3-TTS Workflow Speech Tokenizers Qwen-TTS-Tokenizer-25Hz • 25 Hz single-codebook • Semantic-acoustic balanced • Flow Matching decoder • Block-wise streaming • 150ms first-packet Qwen-TTS-Tokenizer-12Hz • 12.5 Hz multi-codebook • 16-layer RVQ design • Causal ConvNet decoder • Ultra-low latency • 97ms first-packet Training Pipeline Pre-training S1: General Stage • 5M+ hours multilingual data S2: High-Quality Stage S3: Long-Context Stage • Up to 32,768 tokens Post-training • DPO alignment • GSPO optimization • Speaker fine-tuning • Instruction following Dual-Track Architecture • Text + Speech tokens • Channel concatenation • Real-time synthesis • MTP module (12Hz) Data Processing • ChatML format • 10+ languages • Quality stratification • Speaker embeddings Key Features Voice Cloning • 3-second cloning • In-context learning Voice Design • Natural language desc. • Thinking pattern Multilingual • 10+ languages • Cross-lingual Streaming • Low latency • Real-time output Controllable • Fine-grained control • Instruction following Model Family Base Models • 0.6B & 1.7B variants • 12Hz & 25Hz versions CustomVoice • Speaker fine-tuned • High naturalness VoiceDesign • Text-based creation • Novel voice synthesis VoiceEditing • Attribute manipulation • Style control
Q1
1. What unique approach does Qwen3-TTS use to achieve ultra-low first-packet latency of 97ms?
A dual-track architecture with a 12.5Hz multi-codebook tokenizer and lightweight causal ConvNet
A single high-frequency 50Hz tokenizer with parallel processing
A chunk-based diffusion model with pre-computed speaker embeddings
Q2
2. In cross-lingual speech synthesis, what remarkable achievement did Qwen3-TTS demonstrate compared to CosyVoice3?
50% reduction in computational requirements for real-time synthesis
66% error rate reduction in Chinese-to-Korean voice cloning (4.82 vs 14.4)
Perfect accent preservation across all 10 supported languages
Q3
3. What training innovation enables Qwen3-TTS to follow complex natural language instructions for voice design?
A probabilistically activated 'thinking pattern' during training
Reinforcement learning from human feedback on voice quality
Pre-training exclusively on professional voice actor recordings
1/2

Paper 2

DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal

Published: 2026-01-25

Link: http://arxiv.org/pdf/2601.18081

1. 📘 Topic and Domain: The paper focuses on automated academic rebuttal generation in the domain of natural language processing and AI-assisted peer review systems.
2. 💡 Previous Research and New Ideas: The paper builds on existing LLM-based rebuttal approaches and debate/persuasion techniques, proposing a novel four-stage agentic framework (DRPG) that explicitly plans rebuttal strategies before generation.
3. ❓ Problem: The paper aims to solve the challenge of generating high-quality academic rebuttals automatically, addressing LLMs' limitations with long-context understanding and their tendency to produce generic, unconvincing responses.
4. 🛠️ Methods: The authors use a four-component pipeline (Decompose, Retrieve, Plan, Generate) with a trained Planner module that selects optimal rebuttal perspectives based on paper content support scores.
5. 📊 Results and Evaluation: DRPG achieves 40 points higher Elo score than existing pipelines and surpasses average human performance using only an 8B model, with the Planner achieving 98% accuracy in perspective selection, evaluated through LLM-based pairwise comparison and judge scoring.

DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal

DRPG Framework Workflow The Review (Input) The Paper (Full-length) Decomposer Divide review into atomic points Motivation not clear Missing Reference Data size too small Retriever Select relevant paragraphs Encoder Similarity Scores 0.77 0.65 0.81 0.48 0.71 0.66 0.40 0.38 0.88 Planner Select perspective most supported by the paper LLM-generated perspective 1 Author's actual perspective LLM-generated perspective 2 0.25 0.51 0.17 Executor Generate coherent and persuasive rebuttal response The Rebuttal Train/Inference Support score calculation
Q1
1. What is the key innovation that distinguishes DRPG's Planner from the Jiu-Jitsu baseline's planning approach?
DRPG uses Monte Carlo tree search to simulate multiple rebuttal outcomes
DRPG selects from predefined canonical rebuttal templates based on question types
DRPG trains a content-aware scorer to evaluate perspective-paragraph support relationships
Q2
2. According to the experiments, what percentage of review points receive a valid perspective from the Planner when using a confidence threshold of 0.8?
Approximately 98% of review points
Approximately 62% of review points
Approximately 75% of review points
Q3
3. Which two high-level rebuttal strategies does DRPG's Planner explicitly consider when generating perspectives?
Clarification and Justification
Acceptance and Rejection
Argumentation and Concession
1/2

Paper 3

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Published: 2026-01-26

Link: http://arxiv.org/pdf/2601.18778

1. 📘 Topic and Domain: The paper investigates self-improvement in large language models through meta-reinforcement learning for mathematical reasoning tasks.
2. 💡 Previous Research and New Ideas: Building on curriculum learning and self-play methods that use intrinsic rewards, the paper proposes grounding teacher rewards in actual student performance on hard problems rather than proxy metrics.
3. ❓ Problem: The paper addresses the challenge of training models on problems with near-zero initial success rates where standard reinforcement learning fails due to sparse rewards.
4. 🛠️ Methods: SOAR uses asymmetric teacher-student meta-RL where a teacher generates synthetic problems, a student trains on them, and the teacher is rewarded based on measurable student improvement on hard problems.
5. 📊 Results and Evaluation: On mathematical benchmarks with 0/128 baseline success, SOAR achieves 4× improvement in pass@1 and 2× in pass@32 on MATH, with generated questions transferring to unseen datasets and maintaining diversity unlike intrinsic reward methods.

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

SOAR: Teaching Models to Teach Themselves Teacher Model πφT (Generates Q&A) Student Model πθS (Learns from Q&A) Outer RL Loop Teacher Training (RLOO) Inner RL Loop Student Training (RLVR) Synthetic Dataset X = {(q₁,a₁),...,(qₙ,aₙ)} n=64 Q&A pairs Hard Dataset Dtrain (fail@128) 0/128 success rate Evaluation Student performs on QR ~ Dtrain Reward R(Xk) Student improvement Promotion If R̄ > τ: Update baseline generates trains on produces uses evaluates grounded reward baseline update Key Insights Teacher generates synthetic Q&A pairs without seeing hard problems Grounded rewards based on actual student improvement on hard dataset Student baseline promoted when performance improves (R̄ > τ = 0.01)
Q1
1. What surprising finding did the researchers discover about the quality of synthetic questions generated by SOAR?
The questions needed to be 100% mathematically correct to improve student performance
Only 32.8% of generated questions had fully correct solutions, yet they still enabled learning
The teacher could solve all the hard problems it generated stepping stones for
Q2
2. How does SOAR avoid the computational complexity of traditional bilevel optimization?
By using RLOO in the outer loop to treat student improvement as a black-box reward signal
By completely eliminating the inner training loop and using direct gradient descent
By pre-computing all possible teacher-student interactions offline
Q3
3. What happened when the researchers compared grounded rewards (SOAR) versus intrinsic rewards for teacher training?
Both methods performed equally well with no significant differences
Intrinsic rewards led to higher question correctness but worse student performance
Grounded rewards preserved question diversity while intrinsic rewards collapsed to narrow concepts