2025-08-15 Papers

1/2

Paper 1

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Published: 2025-08-14

Link: http://arxiv.org/pdf/2508.10711

1. 📘 Topic and Domain: The paper introduces NextStep-1, a large-scale autoregressive model for text-to-image generation and editing, operating in the domain of artificial intelligence and computer vision.

2. 💡 Previous Research and New Ideas: Based on previous autoregressive language models and diffusion models, it proposes a novel approach using continuous tokens and flow matching for image generation, rather than traditional vector quantization or heavy diffusion models.

3. ❓ Problem: The paper aims to solve the limitations of existing autoregressive text-to-image models that either rely on computationally-intensive diffusion models or suffer from quantization loss through vector quantization.

4. 🛠️ Methods: The paper implements a 14B parameter autoregressive model with a 157M flow matching head, combining a Transformer backbone for text processing with continuous image tokens, trained on a diverse dataset including text-only corpus, image-text pairs, and interleaved data.

5. 📊 Results and Evaluation: The model achieves state-of-the-art performance for autoregressive models in text-to-image generation, scoring 0.54 on WISE, 0.67 on GenAI-Bench advanced prompts, 85.28 on DPG-Bench, and 0.417 on OneIG-Bench English prompts, while also demonstrating strong capabilities in image editing tasks.

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

1/2

Paper 2

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Published: 2025-08-14

Link: http://arxiv.org/pdf/2508.10881

1. 📘 Topic and Domain: AI-assisted cartoon animation production, specifically focusing on streamlining the process of generating cartoon videos from sparse keyframe sketches.

2. 💡 Previous Research and New Ideas: Based on previous work in video diffusion models and cartoon generation, introduces a novel "post-keyframing" paradigm that unifies inbetweening and colorization into a single automated process.

3. ❓ Problem: Traditional cartoon production requires intensive manual effort in inbetweening and colorization stages, while existing AI methods handle these stages separately leading to error accumulation and artifacts.

4. 🛠️ Methods: Develops ToonComposer, a DiT-based model with sparse sketch injection mechanism for precise control and spatial low-rank adapter (SLRA) for cartoon domain adaptation, requiring only sparse keyframe sketches and a colored reference frame.

5. 📊 Results and Evaluation: Outperforms existing methods in both synthetic and real benchmarks (PKBench), achieving superior visual quality, motion consistency, and production efficiency, with 70.99% user preference rate for aesthetic quality.

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

1/2

Paper 3

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Published: 2025-08-14

Link: http://arxiv.org/pdf/2508.10893

1. 📘 Topic and Domain: 3D reconstruction from streaming images/video using transformers in computer vision.

2. 💡 Previous Research and New Ideas: Based on DUSt3R's pointmap prediction approach, introduces a novel decoder-only transformer architecture with causal attention for sequential processing, inspired by large language models.

3. ❓ Problem: Existing 3D reconstruction methods either require expensive global optimization or use limited memory mechanisms that don't scale well with sequence length for processing streaming inputs.

4. 🛠️ Methods: Uses a causal transformer architecture that caches features from previous frames and processes new frames sequentially, with dual coordinate prediction (local and global) and KV-cache for efficient inference.

5. 📊 Results and Evaluation: Outperforms existing methods on benchmarks like Sintel, KITTI, and NYU-v2 for depth estimation and 7-scenes for 3D reconstruction, while being 40% faster than state-of-the-art methods.