2025-04-08 Papers

1/2

Paper 1

One-Minute Video Generation with Test-Time Training

Published: 2025-04-07

Link: http://arxiv.org/pdf/2504.05298

1. 📘 Topic and Domain: The paper addresses one-minute video generation from text storyboards using Test-Time Training (TTT) layers to overcome the limitations of Transformer models in handling long contexts.

2. 💡 Previous Research and New Ideas: The paper builds on Diffusion Transformers but proposes using TTT layers with neural network hidden states instead of traditional RNN approaches like Mamba or DeltaNet which use matrix hidden states.

3. ❓ Problem: The paper aims to solve the inefficiency of self-attention in generating long videos, as traditional Transformers struggle with one-minute videos due to quadratic complexity with context length.

4. 🛠️ Methods: The authors add TTT-MLP layers to a pre-trained Diffusion Transformer (CogVideo-X 5B), fine-tune on Tom and Jerry cartoons, and implement on-chip tensor parallelism for efficiency while limiting self-attention to 3-second segments.

5. 📊 Results and Evaluation: TTT-MLP outperformed baselines (Mamba 2, Gated DeltaNet, sliding-window attention) by 34 Elo points in human evaluation across four metrics, generating more coherent videos with complex stories, though still containing some artifacts.

One-Minute Video Generation with Test-Time Training

1/2

Paper 2

SmolVLM: Redefining small and efficient multimodal models

Published: 2025-04-07

Link: http://arxiv.org/pdf/2504.05299

1. 📘 Topic and Domain: This paper introduces SmolVLM, a family of compact multimodal models for efficient vision-language understanding that can process both images and videos.

2. 💡 Previous Research and New Ideas: The paper builds on previous large-scale VLMs like Flamingo and Idefics, proposing architectural innovations specifically for small models rather than simply scaling down larger models.

3. ❓ Problem: The paper addresses the high computational requirements of current Vision-Language Models (VLMs) that limit their deployment on mobile and edge devices.

4. 🛠️ Methods: The authors systematically explore architectural configurations (balanced encoder-LM parameters), tokenization strategies (pixel shuffle), positional encoding (learned tokens), and training data composition optimized for small models.

5. 📊 Results and Evaluation: SmolVLM-256M (smallest model) uses less than 1GB GPU memory yet outperforms the 300-times larger Idefics-80B, while SmolVLM-2.2B rivals VLMs that consume twice the GPU memory, with all variants demonstrating strong performance on both image and video tasks.

SmolVLM: Redefining small and efficient multimodal models

1/2

Paper 3

URECA: Unique Region Caption Anything

Published: 2025-04-07

Link: http://arxiv.org/pdf/2504.05305

1. 📘 Topic and Domain: The paper introduces URECA, a system for generating unique captions for specific regions within images at multiple levels of granularity in the computer vision and natural language processing domain.

2. 💡 Previous Research and New Ideas: The paper builds upon previous region-level captioning research but proposes a novel dataset with unique region-caption mapping and a new model architecture that preserves spatial properties of multi-granularity regions.

3. ❓ Problem: The paper addresses the challenge of generating distinctive captions for regions at any level of granularity that uniquely describe the target region while differentiating it from surrounding areas.

4. 🛠️ Methods: The authors created a stage-wise data curation pipeline using mask tree structures to generate unique captions, and developed a model with a mask encoder and dynamic mask modeling to effectively condition regions without losing details.

5. 📊 Results and Evaluation: URECA achieved state-of-the-art performance on the authors' test dataset and demonstrated strong generalization on benchmark datasets like Visual Genome and RefCOCOg, outperforming previous methods in generating unique captions for multi-granularity regions.