1. 📘 Topic and Domain: Human image animation using diffusion transformers for generating realistic videos from single images, within the computer vision and deep learning domain.
2. 💡 Previous Research and New Ideas: Based on previous GAN and diffusion-based animation methods, proposing new hybrid guidance combining implicit facial representations, 3D head spheres, and body skeletons along with complementary appearance guidance.
3. ❓ Problem: Addressing limitations in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence in human image animation.
4. 🛠️ Methods: Uses a DiT-based framework with hybrid motion guidance, progressive training strategy, and complementary appearance guidance through multi-reference protocols and bone length adjustment.
5. 📊 Results and Evaluation: Outperforms state-of-the-art methods across metrics (FID, SSIM, PSNR, LPIPS, FVD), demonstrating better fine-grained motions, identity preservation, temporal consistency and high fidelity in both portrait and full-body animations.