2025-08-15 Papers

1/2

Paper 1

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Published: 2025-08-14

Link: http://arxiv.org/pdf/2508.10711

1. 📘 Topic and Domain: The paper introduces NextStep-1, a large-scale autoregressive model for text-to-image generation and editing, operating in the domain of artificial intelligence and computer vision.
2. 💡 Previous Research and New Ideas: Based on previous autoregressive language models and diffusion models, it proposes a novel approach using continuous tokens and flow matching for image generation, rather than traditional vector quantization or heavy diffusion models.
3. ❓ Problem: The paper aims to solve the limitations of existing autoregressive text-to-image models that either rely on computationally-intensive diffusion models or suffer from quantization loss through vector quantization.
4. 🛠️ Methods: The paper implements a 14B parameter autoregressive model with a 157M flow matching head, combining a Transformer backbone for text processing with continuous image tokens, trained on a diverse dataset including text-only corpus, image-text pairs, and interleaved data.
5. 📊 Results and Evaluation: The model achieves state-of-the-art performance for autoregressive models in text-to-image generation, scoring 0.54 on WISE, 0.67 on GenAI-Bench advanced prompts, 85.28 on DPG-Bench, and 0.417 on OneIG-Bench English prompts, while also demonstrating strong capabilities in image editing tasks.

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep-1 Method Flow Data Construction • Text-only Corpus (400B) • Image-Text Pairs (550M) • Image-to-Image Data (1M) • Interleaved Data (80M) • Character-centric Dataset Image Tokenizer Fine-tuned from Flux VAE 16-channel latents Channel-wise normalization Stochastic perturbation Space-to-depth transform Text Tokenizer Standard discrete tokens From Qwen2.5-14B Language understanding Reasoning capabilities Causal Transformer (14B) Initialized from Qwen2.5-14B Unified multimodal sequence modeling Next-token prediction: p(x_i | x_<i) 1D RoPE positional encoding Training Recipe Stage 1: 256×256 (200K steps) Stage 2: Dynamic resolution (100K) Annealing: High-quality subset SFT + DPO alignment LM Head Cross-entropy loss Discrete text tokens Standard sampling Language modeling Flow Matching Head 157M parameters MSE loss (velocity prediction) Continuous image tokens Patch-wise generation Loss Function L_total = λ_text × L_text + λ_visual × L_visual Weighted combination End-to-end training Image Generation Autoregressive patch-by-patch generation Classifier-free guidance for quality High-fidelity image synthesis Image Editing NextStep-1-Edit variant Instruction-guided editing Competitive performance Key Technical Innovations Channel-wise normalization prevents CFG instability Stochastic perturbation creates robust latent space Lightweight FM head acts as token sampler (157M vs 14B transformer) Pure autoregressive paradigm without heavy diffusion models Multi-stage curriculum learning for stable convergence Self-CoT reasoning enhances complex prompt understanding
Q1
1. What is the key innovation of NextStep-1 compared to previous autoregressive image generation models?
It uses discrete tokens with vector quantization
It uses continuous tokens with flow matching
It uses pure diffusion models for generation
Q2
2. What was an unexpected finding about the Flow Matching Head in NextStep-1?
Larger head sizes always produced better results
The head size had minimal impact on generation quality
The head could only work with small images
Q3
3. What counterintuitive relationship was discovered during the training of NextStep-1's tokenizer?
Lower generation loss led to better image quality
Higher noise in training led to worse image quality
Higher generation loss with more noise actually improved image quality
1/2

Paper 2

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Published: 2025-08-14

Link: http://arxiv.org/pdf/2508.10881

1. 📘 Topic and Domain: AI-assisted cartoon animation production, specifically focusing on streamlining the process of generating cartoon videos from sparse keyframe sketches.
2. 💡 Previous Research and New Ideas: Based on previous work in video diffusion models and cartoon generation, introduces a novel "post-keyframing" paradigm that unifies inbetweening and colorization into a single automated process.
3. ❓ Problem: Traditional cartoon production requires intensive manual effort in inbetweening and colorization stages, while existing AI methods handle these stages separately leading to error accumulation and artifacts.
4. 🛠️ Methods: Develops ToonComposer, a DiT-based model with sparse sketch injection mechanism for precise control and spatial low-rank adapter (SLRA) for cartoon domain adaptation, requiring only sparse keyframe sketches and a colored reference frame.
5. 📊 Results and Evaluation: Outperforms existing methods in both synthetic and real benchmarks (PKBench), achieving superior visual quality, motion consistency, and production efficiency, with 70.99% user preference rate for aesthetic quality.

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

ToonComposer Workflow Input Sparse Keyframe Sketches + Color Reference Frame VAE Encoder Sparse Sketch Injection Position Encoding Position-aware Residual Region-wise Control Diffusion Transformer (DiT) Wan 2.1 Foundation Model DiT Block #1 DiT Block #2 DiT Block #N Spatial Low-Rank Adapter (SLRA) Cartoon Adaptation Spatial Behavior Adaptation Preserve Temporal Prior VAE Decoder Output High-Quality Cartoon Video PKData 37K Cartoon Clips Diverse Sketches Training Dataset PKBench 30 Scenes Human-drawn Evaluation Benchmark Training Objective Rectified Flow Velocity Prediction L = E[||vt - ε(xin)||²] Key Features • Post-keyframing Stage • Unified Inbetweening & Colorization • Sparse Input Control
Q1
1. What is the main innovation of ToonComposer compared to previous AI-assisted cartoon production methods?
It uses a completely new neural network architecture
It unifies inbetweening and colorization into a single post-keyframing stage
It requires more keyframes but produces better quality
Q2
2. What is the minimum input requirement for ToonComposer to generate a cartoon video sequence?
One colored reference frame and one sketch frame
Multiple colored frames and multiple sketches
One sketch frame and a text prompt
Q3
3. What is the purpose of the Spatial Low-Rank Adapter (SLRA) in ToonComposer?
To reduce the model's computational requirements
To enable processing of higher resolution videos
To adapt the model's spatial behavior to cartoons while preserving temporal priors
1/2

Paper 3

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Published: 2025-08-14

Link: http://arxiv.org/pdf/2508.10893

1. 📘 Topic and Domain: 3D reconstruction from streaming images/video using transformers in computer vision.
2. 💡 Previous Research and New Ideas: Based on DUSt3R's pointmap prediction approach, introduces a novel decoder-only transformer architecture with causal attention for sequential processing, inspired by large language models.
3. ❓ Problem: Existing 3D reconstruction methods either require expensive global optimization or use limited memory mechanisms that don't scale well with sequence length for processing streaming inputs.
4. 🛠️ Methods: Uses a causal transformer architecture that caches features from previous frames and processes new frames sequentially, with dual coordinate prediction (local and global) and KV-cache for efficient inference.
5. 📊 Results and Evaluation: Outperforms existing methods on benchmarks like Sintel, KITTI, and NYU-v2 for depth estimation and 7-scenes for 3D reconstruction, while being 40% faster than state-of-the-art methods.

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

STREAM3R: Scalable Sequential 3D Reconstruction with Causal Transformer Streaming Input Images I₁, I₂, ..., Iₜ (Uncalibrated) ViT Encoder Shared Weights Fₜ = Encoder(Iₜ) Causal Transformer Decoder Self Attn Causal Attn Memory Cache KV Cache Previous Features 🔥 Reg Token (First Frame) Local Head X̂ₜˡᵒᶜᵃˡ, Ĉₜˡᵒᶜᵃˡ (Camera Coord) Global Head X̂ₜᵍˡᵒᵇᵃˡ, Ĉₜᵍˡᵒᵇᵃˡ (World Coord) Pose Head P̂ₜ (R, t, f) (Camera Pose) Training Objective Lconf = Σ(ĉ·||x̂/ŝ - x/s||² - α log ĉ) Lpose = Σ(||q̂ₜ - qₜ||² + ||τ̂ₜ/ŝ - τₜ/s||² + ||f̂ₜ - fₜ||²) Confidence-aware regression + Pose loss Sequential 3D Reconstruction Output Dense Point Maps + Confidence Maps + Camera Poses Compatible with Gaussian Splatting, SLAM, Novel View Synthesis Key Features ✓ Causal Attention ✓ KV Cache Efficiency ✓ LLM-style Training ✓ Streaming Processing ✓ No Global Alignment Architecture Details Encoder: CroCo ViT (24 layers) Decoder: 12 layers with Causal Attention Heads: DPT-L for regression FlashAttention + QK-Norm 29 diverse 3D datasets End-to-end training: 400K iterations, 8 A100 GPUs, 7 days
Q1
1. What is the main architectural innovation of STREAM3R compared to previous methods?
Using a bi-directional transformer with global attention
Using a decoder-only transformer with causal attention
Using a RNN-based architecture with fixed memory
Q2
2. How does STREAM3R achieve efficient processing of streaming inputs?
By using expensive global optimization for each frame
By maintaining a fixed-size memory buffer
By caching features from previous frames as context using KV-cache
Q3
3. What performance improvement does STREAM3R achieve compared to the state-of-the-art CUT3R?
20% faster inference with slightly worse accuracy
40% faster inference with better accuracy
Same speed but 40% better accuracy