2025-04-14 Papers

1/2

Paper 1

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Published: 2025-04-11

Link: http://arxiv.org/pdf/2504.08685

1. 📘 Topic and Domain: The paper presents Seaweed-7B, a cost-effective video generation foundation model with 7 billion parameters, focusing on efficient training strategies in the domain of AI-generated video.
2. 💡 Previous Research and New Ideas: The paper builds on prior video generation models like Sora and MovieGen, proposing that medium-sized models can match or exceed larger models through optimized architecture, training strategies, and data curation.
3. ❓ Problem: The paper addresses the excessive computational costs of training and deploying video generation models, which typically require thousands of GPUs and substantial resources.
4. 🛠️ Methods: The authors trained a 7B-parameter diffusion transformer with a hybrid-stream architecture, using multi-stage training on mixed-resolution data, specialized variational autoencoder designs, and model optimization techniques to maximize efficiency.
5. 📊 Results and Evaluation: Seaweed-7B achieved performance comparable to or better than larger models trained with substantially more resources, ranking second in image-to-video generation in Elo ratings while requiring only 665,000 H100 GPU hours (27.7 days on 1,000 GPUs).

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Seaweed-7B Methodological Flowchart 1. Data Curation & Processing Raw Video Sources Splitting, Cropping, Quality Filtering Balancing, Deduplication, Synthetic Data Augmentation Video Captioning (CLIP+LLM, Distillation), System Prompts Curated Data (High-Throughput Pipeline using BMF & Ray) 2. VAE Training Causal 3D Conv Architecture Mixed-Resolution Training (Images -> Videos) High Compression (e.g., 64x) Trained VAE (Stability: Adversarial Loss + SpectralNorm) 3. Diffusion Transformer (DiT) Training Architecture: Hybrid-Stream DiT (Full Attention, MM-RoPE, AdaSingle) Pre-training: Multi-Stage (Low -> High Res) & Multi-Task (Image-only -> Joint T2V/I2V) Post-training: SFT (Aesthetics) + DPO/RLHF (Motion/Structure) VAE Latent Space 4. Optimization & Infrastructure Training: 3D Parallelism (FSDP, Ulysses), Runtime Balance, MLAC, Fused Kernels (Target: 38% MFU) Inference: Distillation (TSCD, CFG), VAE Opt., Rephraser Supports Training Applied Post-Training 5. Output: Seaweed-7B Model & Applications Seaweed-7B Foundation Model (7B Parameters, Cost-Effective Training) Image/Text-to-Video Human Video (OmniHuman-1) Subject-Consistent (Phantom) Video-Audio Gen (CAVP) Long Video/Story (LCT) Real-Time Gen (Seaweed-APT) Super-Resolution (SeedVR) Camera Control (CameraCtrl II) Video Editing/Transition Enabled by Lightweight Finetuning or Zero-Shot
Q1
1. What is the primary innovation of Seaweed-7B compared to other video generation models?
Using a new type of neural architecture never seen before in video generation
Achieving competitive performance with a medium-sized model using significantly fewer computational resources
Being the first model to generate videos directly from audio input
Q2
2. How many H100 GPU hours were required to train the Seaweed-7B model?
665,000 hours (equivalent to 27.7 days on 1,000 GPUs)
1.2 million hours (equivalent to 50 days on 1,000 GPUs)
6.5 million hours (equivalent to 270 days on 1,000 GPUs)
Q3
3. Which architectural design choice did the authors find most beneficial for efficient video generation?
Using window attention instead of full attention for all transformer layers
Compressing sequences within the VAE instead of using DiT patchification
Training exclusively on low-resolution videos rather than mixed-resolution data
1/2

Paper 2

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

Published: 2025-04-10

Link: http://arxiv.org/pdf/2504.07964

1. 📘 Topic and Domain: The paper introduces C3PO (Critical-Layer, Core-Expert, Collaborative Pathway Optimization), a test-time optimization method for Mixture-of-Experts (MoE) Large Language Models to improve expert pathway selection.
2. 💡 Previous Research and New Ideas: The paper builds on MoE architectures and test-time adaptation techniques, proposing novel collaborative pathway optimization that leverages successful reference samples to re-mix expert weights during inference.
3. ❓ Problem: The paper addresses the sub-optimal expert pathways in MoE LLMs, where naive expert selection during pretraining leaves a 10-20% accuracy gap for potential improvement.
4. 🛠️ Methods: The authors optimize expert routing weights at test time using three surrogate objectives: mode-finding, kernel regression, and neighborhood gradient descent, focusing only on critical layers and core experts to balance performance and efficiency.
5. 📊 Results and Evaluation: C3PO consistently improves MoE base models by 7-15% in accuracy across six benchmarks, outperforming test-time learning baselines like in-context learning and prompt tuning, and enabling MoE LLMs with 1-3B active parameters to outperform dense LLMs of 7-9B parameters.

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

C3PO: Test-Time Expert Re-Mixing Workflow Input Test Sample (x) Goal: Improve prediction for x Pretrained MoE LLM Generates initial (suboptimal) pathway ω_initial Reference Set {(xi, yi, ωi) | model successful} xi: sample, yi: label, ωi: pathway Problem: ω_initial is suboptimal (10-20% accuracy gap) 1. Find Successful Neighbors N(x) Compute Embeddings E(x), E(xi); Use kNN or ε-ball on embeddings 2. Collaborative Pathway Optimization (CPO) - Choose One Method A) NGD (Gradient Descent) Define Surrogate Loss L(ω): Weighted avg. loss of neighbors xi using current pathway ω Update ω iteratively: ω ← ω - λ∇ω L(ω) (Requires Backpropagation) B) Kernel Regression Estimate Target Pathway ˆω: Weighted avg. of neighbor pathways ωi based on K(xi, x) Interpolate: ω ← α*ω_initial + (1-α*)ˆω (Gradient-Free Estimation) C) Mode Finding (Meanshift) Find Dense Region in Pathway Space: Compute local avg. ¯ω based on pathway similarity K(ωi, ω) Interpolate: ω ← αω_initial + (1-α)¯ω (Gradient-Free) 3. C3PO Efficiency Enhancement Apply selected optimization (A, B, or C) *only* to: • Critical Layers (e.g., Last 5) • Core Experts (e.g., Top-20) • Last Token's pathway weights 4. Optimized Pathway (ω_optimized) Refined expert weights for test sample x 5. Final Inference: f(x, ω_optimized)
Q1
1. What is the main innovation of C3PO compared to traditional test-time adaptation methods for LLMs?
It fine-tunes all parameters in the MoE model during inference
It optimizes expert routing weights based on similar successful samples
It adds new experts to the model dynamically during test time
Q2
2. According to the paper's findings, which layer optimization strategy yielded the best performance in C3PO?
Optimizing all 16 layers of the MoE model
Optimizing only the first 5 layers (early layers)
Optimizing only the last 5 layers (deep layers)
Q3
3. What surprising efficiency finding did the authors discover about expert selection in MoE models?
Optimizing all 64 experts per layer is necessary for maximum performance
Optimizing only the top-20 experts achieves the same performance as optimizing all 64 experts
Random expert selection performs just as well as router-based selection
1/2

Paper 3

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Published: 2025-04-11

Link: http://arxiv.org/pdf/2504.08736

1. 📘 Topic and Domain: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation.
2. 💡 Previous Research and New Ideas: Based on vector-quantized tokenizer research; proposes semantic regularization to overcome the reconstruction vs. generation dilemma when scaling tokenizers.
3. ❓ Problem: Solving the dilemma where naively scaling visual tokenizers improves reconstruction quality but degrades downstream generation performance.
4. 🛠️ Methods: Introduces GigaTok with semantic regularization that aligns tokenizer features with pretrained visual representations, uses 1D tokenizers with hybrid CNN-Transformer architecture, prioritizes decoder scaling, and employs entropy loss.
5. 📊 Results and Evaluation: GigaTok achieves state-of-the-art performance in reconstruction, downstream autoregressive generation, and representation quality on ImageNet, with the 2.9B tokenizer enabling a 1.4B AR model to outperform previous approaches.

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

GigaTok Methodology Flowchart Problem: Reconstruction vs. Generation Dilemma (Scaling Tokenizers improves reconstruction but hurts generation) Investigation Tool: AR Probing (Lightweight AR model eval) Finding: Increased Latent Space Complexity hinders AR learning Core Solution: Semantic Regularization (Align tokenizer features w/ DINOv2) GigaTok Design & Scaling Architecture Hybrid CNN-Transformer VQ Tokenizer Supports 1D (Q-Former) & 2D (ViT) Backbones Key Scaling Practices 1. Prefer 1D Tokenizers (Better Scalability) 2. Asymmetric Scaling (Prioritize Decoder Size) 3. Entropy Loss (Stabilizes Billion-Scale) Semantic Regularization Applied During Training (Mitigates complexity, enables scaling) Training Stage 1: Train GigaTok Tokenizer (VQGAN loss + Semantic Reg. + Entropy Loss) Stage 2: Train Downstream AR Model Evaluation Tokenizer: Reconstruction (rFID, LPIPS) AR Probing: Proxy for gFID, Val Loss, Lin. Acc. Large AR Model: System-level gFID, Lin. Acc. Outcome: GigaTok (up to 3B Params) Solves Reconstruction vs. Generation Dilemma SOTA Reconstruction, AR Generation & Representation
Q1
1. What is the key innovation in GigaTok that helps solve the reconstruction vs. generation dilemma?
Using larger codebook sizes for vector quantization
Semantic regularization that aligns tokenizer features with pretrained visual representations
Implementing a pure Transformer architecture without CNN components
Q2
2. According to the paper, when scaling tokenizers, which architectural design choice proved most effective?
Using 1D tokenizers with symmetric encoder-decoder scaling
Using 2D tokenizers with larger encoders than decoders
Using 1D tokenizers with asymmetric scaling that prioritizes decoder size
Q3
3. What critical component did the authors find necessary to enable convergence when training billion-scale tokenizers?
Layer normalization in the CNN modules
Entropy loss to encourage higher codebook utilization
Dropout in the Transformer layers