2025-07-16 Papers

1/2

Paper 1

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Published: 2025-07-13

Link: http://arxiv.org/pdf/2507.09862

1. 📘 Topic and Domain: The paper introduces SpeakerVid-5M, a large-scale high-quality dataset for audio-visual dyadic interactive human generation in the domain of digital human technology and computer vision.
2. 💡 Previous Research and New Ideas: Based on previous work in GAN-based and diffusion-based virtual human generation, this paper proposes the first large-scale dataset specifically designed for interactive virtual humans, moving beyond passive avatar driving to autonomous engagement.
3. ❓ Problem: The paper addresses the critical lack of large-scale, high-quality open-source datasets for training interactive virtual humans, which has hindered research progress in this emerging field.
4. 🛠️ Methods: The authors curated 5.2M video clips through a comprehensive pipeline including source collection, pre-processing (scene splitting, speaker diarization, human detection, lip sync), rich multi-modal annotation, and rigorous quality filtering.
5. 📊 Results and Evaluation: The dataset contains 8,743 hours of high-quality video data with 83,756 unique IDs, achieving superior performance metrics in visual quality (93% in 1080P), audio-visual sync, and diverse body compositions, evaluated through their VidChatBench benchmark.

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

SpeakerVid-5M Dataset Curation Pipeline 1. Source Data Collection 153K Videos from YouTube 64K Hours, 93% 1080P+ 2. Audio-Visual Processing Scene Splitting (SceneDetect) Speaker Diarization (3D-Speaker) Human Detection (YOLO) Lip Sync (SyncNet) ID Correction (ArcFace) 3. Audio-Visual Annotation Structured Textual Caption (Qwen2.5-VL) Audio Annotation (Whisper ASR) Skeleton Sequence (DWpose) Face & Hand Blur Score 4. Data Quality Filter Luminance Filtering Video Quality (DOVER) Clear Score Filtering Blur Filtering Audio Filtering Language Detection Dataset Stratification Single Branch 5.2M clips 8.7K hours 83K speaker IDs Monadic talking Multiple body compositions Dialogue Branch 770K clip pairs 1.8K hours 16K speaker IDs Dyadic conversations Input-Response pairs Listening Branch Non-speaking listener behaviors Co-present listening Non-co-present listening SyncNet filtered Multi-turn Branch Sequential clips Temporal order Contextual multi-turn Sequential multi-turn Conversation continuity High-Quality SFT Subset 571K clips, 1368 hours Hand blur > 0.5, Face blur > 0.7 DOVER > 0.6, Motion > 2 Large-Scale Pretraining Remaining data 7375 hours Lower quality thresholds Autoregressive Baseline Model Qwen2.5-Omni + Next-chunk prediction Joint audio-visual generation + Diffusion MLP VidChatBench evaluation
Q1
1. What is the main innovation of SpeakerVid-5M compared to previous datasets in the field?
It has higher video resolution quality
It is the first large-scale dataset specifically designed for audio-visual dyadic interaction
It contains more diverse camera angles
Q2
2. In the data pre-processing pipeline, what tool is used for lip synchronization verification?
YOLO
ArcFace
SyncNet
Q3
3. What is the total duration of video content in the SpeakerVid-5M dataset?
5,218 hours
8,743 hours
64,386 hours
1/2

Paper 2

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Published: 2025-07-11

Link: http://arxiv.org/pdf/2507.08441

1. 📘 Topic and Domain: Building an efficient image tokenizer using pre-trained vision foundation models for autoregressive image generation.
2. 💡 Previous Research and New Ideas: Based on VQGAN and vision foundation models research; proposes using frozen pre-trained vision models directly as tokenizers with region-adaptive quantization.
3. ❓ Problem: Current image tokenizers are inefficient, require extensive training, and produce latent spaces with poor semantic quality and high redundancy.
4. 🛠️ Methods: Uses frozen vision foundation model as encoder, introduces region-adaptive quantization framework, and applies semantic reconstruction objective to preserve feature fidelity.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance with gFID of 2.07 on ImageNet, 3x faster convergence, high-fidelity class-conditional synthesis without classifier-free guidance, and better token efficiency using only 256 tokens versus standard 576.

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

VFMTok: Vision Foundation Models as Visual Tokenizers Input Image 336×336 Frozen VFM Encoder DINOv2/CLIP/SigLIP Multi-level Features (6th,12th,18th,24th layers) Region-Adaptive Tokenization Deformable Attention Learnable Anchor Queries → 256 tokens Vector Quantization Codebook: 16384×12 Shared ViT Decoder Mask Tokens Position Embeddings Image Reconstruction Pixel Loss + LPIPS Feature Reconstruction Cosine Similarity Autoregressive Generation LLaMA Transformer Next-Token Prediction Generated Image High Quality Key Innovations & Results 1 Frozen VFM as Encoder • No encoder training required • Rich semantic representations 2 Region-Adaptive Quantization • Reduces redundancy in 2D grids • Only 256 tokens vs 576 in VQGAN 3 Dual Reconstruction Loss • Image + Feature reconstruction • Preserves semantic fidelity Performance Results • gFID: 2.07 on ImageNet (SOTA) • 3× faster convergence • CFG-free high-quality generation • 4× inference speedup • Better semantic preservation (rIS: 215.4) • 100% codebook utilization • Outperforms LlamaGen-3B with 1.4B params
Q1
1. What is the main innovation in how VFMTok processes image regions compared to traditional tokenizers?
It uses random sampling of image regions
It applies region-adaptive quantization based on semantic coherence
It only processes regions at fixed grid locations
Q2
2. How many tokens does VFMTok use to represent an image compared to previous methods while achieving better performance?
576 tokens, same as previous methods
1024 tokens, more than previous methods
256 tokens, less than previous methods
Q3
3. What unique capability does VFMTok demonstrate regarding classifier-free guidance (CFG)?
It requires CFG for all image generation tasks
It can only work with limited CFG settings
It enables high-fidelity synthesis without needing CFG
1/2

Paper 3

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10065

1. 📘 Topic and Domain: Feed-forward dynamic 3D scene reconstruction and novel view synthesis from monocular videos using motion-aware Gaussian primitives.
2. 💡 Previous Research and New Ideas: Based on static scene reconstruction and 3D Gaussian Splatting methods, proposes novel "dynamic splatter pixels" that unify appearance, geometry and motion modeling in a single framework.
3. ❓ Problem: Existing methods treat 3D tasks in isolation, focus mainly on static scenes, and require costly per-scene optimization without learning prior knowledge.
4. 🛠️ Methods: Uses a transformer backbone to encode video frames and three specialized heads (depth, splatter, motion) to predict 3D Gaussian primitives and their temporal deformation, trained on diverse datasets.
5. 📊 Results and Evaluation: Achieves competitive performance on novel view synthesis and 3D point tracking benchmarks while being orders of magnitude faster (0.93s vs 10-45min per scene), and enables zero-shot applications like scene flow estimation.

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

MoVieS: Motion-Aware 4D Dynamic View Synthesis Workflow Input Video {I_i, P_i, K_i, t_i} Feature Backbone Patchify & PE ViT Encoder Camera Emb Time Emb Attention Depth Head Geometry Prediction Splatter Head Appearance Attributes Motion Head Deformation Prediction Dynamic Splatter Pixels g = {x, a} + m(t) = {Δx(t), Δa(t)} Unified representation for appearance, geometry & motion Training Datasets • RealEstate10K (Static) • TartanAir (Depth) • PointOdyssey (Tracking) • DynamicReplica • Spring (Dynamic) • VKITTI2 • Stereo4D • MatrixCity Multi-Task Loss L = λ_d L_depth + λ_r L_rendering + λ_m L_motion Point-wise + Distribution Loss 3D Gaussian Rendering Novel View Synthesis 3D Point Tracking Scene Flow Estimation Moving Object Segmentation Query Time t_q Key Innovations ✓ First feed-forward 4D reconstruction ✓ Dynamic splatter pixels representation ✓ Joint appearance, geometry & motion ✓ Orders of magnitude speedup ✓ Zero-shot applications Performance Inference: 0.93s per scene vs 10-45 minutes (optimization) Competitive quality Curriculum Training Strategy 1. Static scenes (224×224) 2. Dynamic scenes (5→13 views) 3. High resolution (518×518) Training: 5 days, 32 H20 GPUs Speed Comparison MoVieS: 0.93s | Shape-of-Motion: 10min MoSca: 45min | Splatter-a-Video: 37min
Q1
1. What is the main innovation in how MoVieS represents dynamic 3D scenes compared to previous methods?
Using multiple neural networks to process each frame separately
Using dynamic splatter pixels that combine appearance, geometry and motion
Using traditional point cloud representations with temporal interpolation
Q2
2. What is the most significant practical advantage of MoVieS over existing state-of-the-art methods?
It achieves perfect reconstruction quality
It requires no training data
It processes scenes in under 1 second compared to 10-45 minutes
Q3
3. Which capability was enabled by MoVieS without requiring explicit training for it (zero-shot)?
Scene flow estimation and moving object segmentation
Camera pose estimation
Face recognition in videos