2025-04-08 Papers

1/2

Paper 1

One-Minute Video Generation with Test-Time Training

Published: 2025-04-07

Link: http://arxiv.org/pdf/2504.05298

1. 📘 Topic and Domain: The paper addresses one-minute video generation from text storyboards using Test-Time Training (TTT) layers to overcome the limitations of Transformer models in handling long contexts.
2. 💡 Previous Research and New Ideas: The paper builds on Diffusion Transformers but proposes using TTT layers with neural network hidden states instead of traditional RNN approaches like Mamba or DeltaNet which use matrix hidden states.
3. ❓ Problem: The paper aims to solve the inefficiency of self-attention in generating long videos, as traditional Transformers struggle with one-minute videos due to quadratic complexity with context length.
4. 🛠️ Methods: The authors add TTT-MLP layers to a pre-trained Diffusion Transformer (CogVideo-X 5B), fine-tune on Tom and Jerry cartoons, and implement on-chip tensor parallelism for efficiency while limiting self-attention to 3-second segments.
5. 📊 Results and Evaluation: TTT-MLP outperformed baselines (Mamba 2, Gated DeltaNet, sliding-window attention) by 34 Elo points in human evaluation across four metrics, generating more coherent videos with complex stories, though still containing some artifacts.

One-Minute Video Generation with Test-Time Training

Workflow: One-Minute Video Generation with Test-Time Training Problem & Goal Generate long (1-min), coherent videos with complex stories. Self-attention is too costly. Core Idea: Test-Time Training (TTT) RNN layer with expressive hidden state (MLP). Hidden state updated via gradient descent on self-supervised loss during processing. Starting Point Pre-trained Diffusion Transformer (CogVideo-X 5B) - generates 3-sec clips. Architecture Modification 1. Integrate TTT-MLP layers into Transformer. 2. Add Learnable Gating: tanh(α) ⊗ TTT(X) + X (init α ≈ 0) 3. Use Bi-direction (TTT & TTT') for non-causal Diffusion model. Result: Modified Transformer Block Input Processing Pipeline 1. Text Prompt (Formats 1/2 -> 3: Storyboard) 2. Video Segmentation (Scenes -> 3-sec Segments) 3. Tokenization (Text + Noisy Video per segment) 4. Sequence Concatenation (Interleaved Segments) 5. Processing Strategy: - Local Self-Attention (within 3-sec segments) - Global TTT Layers (across full sequence) Dataset Creation 1. Source: ~7h Tom & Jerry Cartoons 2. Preprocessing: Super-Resolution (720x480) 3. Annotation: Human-written storyboards (Format 3) for 3-sec segments. 4. Multi-stage Data: Concatenate segments into 3, 9, 18, 30, 63 sec videos. Multi-Stage Fine-Tuning Strategy Stage 1 (Domain Adaptation): - Data: 3-sec segments - Train: Entire Model (higher LR for TTT/Gates) Stages 2-5 (Context Extension): - Data: 9, 18, 30, 63 sec videos - Train: Only TTT, Gates, Local Attention (lower LR) - Goal: Gradually increase context length handling. TTT Implementation & Optimization Parallelization (Inner Loop): - Update TTT hidden state (W) on mini-batches of tokens (b=64) for parallelism. On-Chip Tensor Parallel (GPU Efficiency): - Shard TTT-MLP hidden state (W) across SMs. - Use SMEM/DSMEM to compute updates on-chip. - Minimize slow HBM transfers (load/store only). - Use fused kernels, async transfers (ThunderKittens). Evaluation Setup Baselines Compared: - Local Attention (no modification) - TTT-Linear (simpler TTT hidden state) - Mamba 2, Gated DeltaNet (matrix hidden states) - Sliding Window Attention Protocol: - Human pairwise preference (blind comparison) - Metrics: Text following, Motion naturalness, Aesthetics, Temporal consistency (Elo scores) - 18s elimination round -> 63s final evaluation Results & Limitations Key Findings: - TTT-MLP significantly outperforms baselines on 63s videos (+34 Elo avg), esp. consistency. - Gated DeltaNet better on shorter 18s videos. Limitations: - Video Artifacts persist (motion, aesthetics). - Efficiency: TTT-MLP slower than Mamba/DeltaNet (1.4x inference, 2.1x train vs GDeltaNet). - Performance potentially limited by base model. Output: One-Minute Coherent Videos
Q1
1. What is the key innovation that allows TTT layers to generate more coherent long videos compared to Mamba and DeltaNet?
They use a more efficient self-attention mechanism
Their hidden states are neural networks rather than matrices
They combine multiple 3-second video segments with transitions
Q2
2. Why did the authors choose Tom and Jerry cartoons as their dataset for the proof of concept?
To focus on complex, multi-scene stories with dynamic motion rather than visual realism
Because cartoon generation is easier than photorealistic video generation
To compete directly with OpenAI's Sora model which specializes in cartoons
Q3
3. What was the most significant limitation of the TTT-MLP approach compared to other methods?
It performed worse on shorter videos (18 seconds) than Gated DeltaNet
It required much more training data than other approaches
It was significantly slower in both inference and training compared to Gated DeltaNet
1/2

Paper 2

SmolVLM: Redefining small and efficient multimodal models

Published: 2025-04-07

Link: http://arxiv.org/pdf/2504.05299

1. 📘 Topic and Domain: This paper introduces SmolVLM, a family of compact multimodal models for efficient vision-language understanding that can process both images and videos.
2. 💡 Previous Research and New Ideas: The paper builds on previous large-scale VLMs like Flamingo and Idefics, proposing architectural innovations specifically for small models rather than simply scaling down larger models.
3. ❓ Problem: The paper addresses the high computational requirements of current Vision-Language Models (VLMs) that limit their deployment on mobile and edge devices.
4. 🛠️ Methods: The authors systematically explore architectural configurations (balanced encoder-LM parameters), tokenization strategies (pixel shuffle), positional encoding (learned tokens), and training data composition optimized for small models.
5. 📊 Results and Evaluation: SmolVLM-256M (smallest model) uses less than 1GB GPU memory yet outperforms the 300-times larger Idefics-80B, while SmolVLM-2.2B rivals VLMs that consume twice the GPU memory, with all variants demonstrating strong performance on both image and video tasks.

SmolVLM: Redefining small and efficient multimodal models

SmolVLM Methodology Flowchart Inputs Image / Video Text Prompt Vision Processing 1. Image Splitting / Video Frame Sampling (Finding 4: Prefer Splitting) 2. Vision Encoder (SigLIP) (Finding 1: Balance w/ LM size) Encoded Features Text Processing Text Tokenizer Text Embeddings Feature Transform 3. Pixel Shuffle (Finding 3: Aggressive OK) 4. MLP Projection Visual Tokens Token Combination Combine/Interleave Visual & Text Tokens (Finding 5: Learned Positional) (Finding 6: Media Markers) Input Sequence (Finding 2: Extended Context) Language Model SmolLM2 Backbone (135M, 360M, 1.7B) (Finding 1: Balance w/ Encoder) Text Output Key Design Choices & Findings (Architecture) F1: Balanced Encoder-LM parameters crucial for small models. F2: Extended context length (8k/16k) significantly improves performance. F3: Aggressive pixel shuffle (e.g., r=4) beneficial for smaller VLMs. F4: Image splitting useful; video frame averaging harmful for small models. Key Design Choices & Findings (Instruction Tuning) F5: Learned positional tokens outperform string tokens for sub-images. F6: System prompts, media intro/outro tokens boost performance. Masking user prompts during SFT improves generalization. F7: Reusing LLM-SFT text data degrades small VLM performance. F8: Minimal Chain-of-Thought (CoT) data is optimal; excess harms. F9: Moderate video sequence length (~3.5 min avg) is beneficial. Data: Two-stage training (Vision -> Video) with specific data mixes (Fig 8). Resulting Models & Evaluation SmolVLM-256M: 93M Enc + 135M LM (0.8 GB RAM) SmolVLM-500M: 93M Enc + 360M LM (1.2 GB RAM) SmolVLM-2.2B: 400M Enc + 1.7B LM (4.9 GB RAM) Evaluation Focus: Performance (VLMEvalKit Benchmarks) vs. GPU RAM Usage (Efficiency)
Q1
1. What is the main innovation of SmolVLM compared to previous Vision-Language Models?
Using larger language models with smaller vision encoders
Designing architecture specifically optimized for small-scale efficiency rather than scaling down large models
Focusing exclusively on image processing while ignoring video capabilities
Q2
2. Which tokenization strategy did the authors find most effective for small multimodal models?
Frame averaging for video processing
String-based position tokens for image splitting
Aggressive pixel shuffle with learned positional tokens
Q3
3. What surprising finding did the researchers discover about Chain-of-Thought (CoT) data when training small multimodal models?
CoT data should be completely avoided in small models
A minimal fraction (0.02-0.05%) of CoT data is optimal, while higher proportions degrade performance
CoT data should constitute at least 50% of the training mix for optimal reasoning
1/2

Paper 3

URECA: Unique Region Caption Anything

Published: 2025-04-07

Link: http://arxiv.org/pdf/2504.05305

1. 📘 Topic and Domain: The paper introduces URECA, a system for generating unique captions for specific regions within images at multiple levels of granularity in the computer vision and natural language processing domain.
2. 💡 Previous Research and New Ideas: The paper builds upon previous region-level captioning research but proposes a novel dataset with unique region-caption mapping and a new model architecture that preserves spatial properties of multi-granularity regions.
3. ❓ Problem: The paper addresses the challenge of generating distinctive captions for regions at any level of granularity that uniquely describe the target region while differentiating it from surrounding areas.
4. 🛠️ Methods: The authors created a stage-wise data curation pipeline using mask tree structures to generate unique captions, and developed a model with a mask encoder and dynamic mask modeling to effectively condition regions without losing details.
5. 📊 Results and Evaluation: URECA achieved state-of-the-art performance on the authors' test dataset and demonstrated strong generalization on benchmark datasets like Visual Genome and RefCOCOg, outperforming previous methods in generating unique captions for multi-granularity regions.

URECA: Unique Region Caption Anything

URECA Paper Workflow: Method Focus Part 1: URECA Dataset Creation Part 2: URECA Model Architecture Input: SA-1B Dataset (Images + Multi-Granularity Masks) Stage 1: Mask Tree Generation Build hierarchical tree based on mask IoU (Subset/Superset relationships) Stage 2: Top-Down Short Caption Generation MLLM generates short captions (root -> leaves) Input: Parent caption, Cropped/Blurred Images Goal: Incorporate parent context Stage 3: Bottom-Up Detailed Caption Generation MLLM refines captions (leaves -> root) Input: Child captions, Short caption, Contoured Image Goal: Incorporate child details, maintain context Stage 4: Uniqueness Refinement Identify similar regions (DINOv2 features) MLLM refines caption to differentiate target Goal: Ensure uniqueness among similar regions Output: URECA Dataset (Unique, Multi-Granularity Region Captions) (+ Test set verification via GPT-4o) Input: Image, Target Region Mask, Query Image Encoder (e.g., ViT) -> Image Tokens Query Text ("Describe this region") -> Query Tokens Mask Processing Dynamic Masking Split High-Res Mask -> Sub-Masks Mask Encoder (CNNs) -> Mask Tokens Combine Tokens (Image + Mask + Query) Feed into LLM (Frozen + LoRA) -> Generate Caption Output: Unique, Multi-Granularity Caption Part 3: Training & Evaluation Train URECA Model on URECA Dataset (LoRA) Evaluate: URECA Test Set, VG/RefCOCOg (Zero-Shot), Ablations Used for Training
Q1
1. What is the primary innovation in the URECA dataset compared to previous captioning datasets?
It contains more images than any previous dataset
It ensures unique caption-region mapping across multiple granularities
It only focuses on salient objects in images
Q2
2. What technical approach does URECA use to preserve region details that previous methods often lost?
Directly overlaying contours on the original image
Translating region coordinates into natural language
Dynamic mask modeling with a high-resolution mask encoder
Q3
3. How does the URECA data curation pipeline ensure caption uniqueness?
By using human annotators to manually verify each caption
By using a stage-wise process with mask tree structures and visual similarity analysis
By limiting captions to only include object class names