2025-07-16 Papers

1/2

Paper 1

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Published: 2025-07-13

Link: http://arxiv.org/pdf/2507.09862

1. 📘 Topic and Domain: The paper introduces SpeakerVid-5M, a large-scale high-quality dataset for audio-visual dyadic interactive human generation in the domain of digital human technology and computer vision.

2. 💡 Previous Research and New Ideas: Based on previous work in GAN-based and diffusion-based virtual human generation, this paper proposes the first large-scale dataset specifically designed for interactive virtual humans, moving beyond passive avatar driving to autonomous engagement.

3. ❓ Problem: The paper addresses the critical lack of large-scale, high-quality open-source datasets for training interactive virtual humans, which has hindered research progress in this emerging field.

4. 🛠️ Methods: The authors curated 5.2M video clips through a comprehensive pipeline including source collection, pre-processing (scene splitting, speaker diarization, human detection, lip sync), rich multi-modal annotation, and rigorous quality filtering.

5. 📊 Results and Evaluation: The dataset contains 8,743 hours of high-quality video data with 83,756 unique IDs, achieving superior performance metrics in visual quality (93% in 1080P), audio-visual sync, and diverse body compositions, evaluated through their VidChatBench benchmark.

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

1/2

Paper 2

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Published: 2025-07-11

Link: http://arxiv.org/pdf/2507.08441

1. 📘 Topic and Domain: Building an efficient image tokenizer using pre-trained vision foundation models for autoregressive image generation.

2. 💡 Previous Research and New Ideas: Based on VQGAN and vision foundation models research; proposes using frozen pre-trained vision models directly as tokenizers with region-adaptive quantization.

3. ❓ Problem: Current image tokenizers are inefficient, require extensive training, and produce latent spaces with poor semantic quality and high redundancy.

4. 🛠️ Methods: Uses frozen vision foundation model as encoder, introduces region-adaptive quantization framework, and applies semantic reconstruction objective to preserve feature fidelity.

5. 📊 Results and Evaluation: Achieved state-of-the-art performance with gFID of 2.07 on ImageNet, 3x faster convergence, high-fidelity class-conditional synthesis without classifier-free guidance, and better token efficiency using only 256 tokens versus standard 576.

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

1/2

Paper 3

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

Published: 2025-07-14

Link: http://arxiv.org/pdf/2507.10065

1. 📘 Topic and Domain: Feed-forward dynamic 3D scene reconstruction and novel view synthesis from monocular videos using motion-aware Gaussian primitives.

2. 💡 Previous Research and New Ideas: Based on static scene reconstruction and 3D Gaussian Splatting methods, proposes novel "dynamic splatter pixels" that unify appearance, geometry and motion modeling in a single framework.

3. ❓ Problem: Existing methods treat 3D tasks in isolation, focus mainly on static scenes, and require costly per-scene optimization without learning prior knowledge.

4. 🛠️ Methods: Uses a transformer backbone to encode video frames and three specialized heads (depth, splatter, motion) to predict 3D Gaussian primitives and their temporal deformation, trained on diverse datasets.

5. 📊 Results and Evaluation: Achieves competitive performance on novel view synthesis and 3D point tracking benchmarks while being orders of magnitude faster (0.93s vs 10-45min per scene), and enables zero-shot applications like scene flow estimation.