2025-07-03 Papers

1/2

Paper 1

LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

Published: 2025-07-02

Link: http://arxiv.org/pdf/2507.01945

1. 📘 Topic and Domain: Long animation colorization using diffusion models in computer vision and animation generation.
2. 💡 Previous Research and New Ideas: Based on existing short-term animation colorization methods that use local paradigms for feature fusion, proposes a novel dynamic global-local paradigm to maintain long-term color consistency.
3. ❓ Problem: Solving the challenge of maintaining color consistency in long animation sequences (300-1000 frames), which current methods fail to achieve due to their focus on local features and short-term generation.
4. 🛠️ Methods: Introduces LongAnimation framework with three key components: SketchDiT for reference feature extraction, Dynamic Global-Local Memory for historical feature compression and fusion, and Color Consistency Reward for refining color consistency.
5. 📊 Results and Evaluation: Achieves significant improvements over previous methods, with 35.1% improvement in short-term (14 frames) and 49.1% improvement in long-term (500 frames) animation colorization based on FVD metrics.

LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

LongAnimation Workflow Input Data Reference Image I Sketches {Sf} Text Description SketchDiT 3D VAE Encoder Ev(·) Text Encoder Et(·) Feature Concatenation Hybrid Features S(ct, ci, ck) First Segment Direct to CogVideoX Dynamic Global-Local Memory (DGLM) Global Memory Video-XL LVU Model Historical Features KV Cache {kg, vg} Dynamic Compression Long-term Consistency Local Memory Recent Segments KV Cache {kl, vl} Local Video Vl Smooth Transitions Short-term Features Cross Attention Q = Wq · S(ct,ci,ck) K,V = [kg,kl], [vg,vl] Adaptive Fusion CogVideoX DiT Video Generation Skip-layer Control Color Consistency Reward (CCR) KV Cache Alignment Non-gradient Reward Color Consistency Fusion (CCF) Late Denoising Stage Latent Blending Long Animation 500+ Frames Color Consistent Training: SketchDiT → DGLM → CCR | Inference: CCF for smooth transitions
Q1
1. What is the main innovation in LongAnimation's approach compared to previous methods?
Using multiple reference images instead of one
Dynamic extraction of global-local color features
Frame-by-frame colorization with AI
Q2
2. Why does LongAnimation perform color consistency fusion only in the late denoising stage?
To save computational resources
To maintain better brightness consistency
To speed up the generation process
Q3
3. What is the typical length of animation sequences that LongAnimation can handle compared to previous methods?
About 100 frames vs 14 frames
About 500 frames vs 100 frames
About 1000 frames vs 500 frames
1/2

Paper 2

Depth Anything at Any Condition

Published: 2025-07-02

Link: http://arxiv.org/pdf/2507.01634

1. 📘 Topic and Domain: A foundation monocular depth estimation model called DepthAnything-AC for handling diverse environmental conditions in computer vision and depth estimation.
2. 💡 Previous Research and New Ideas: Based on previous foundation MDE models like Depth Anything series that work well in general scenes but struggle with complex conditions; proposes new unsupervised consistency regularization and spatial distance constraint approaches.
3. ❓ Problem: Existing foundation MDE models perform poorly in complex real-world environments involving challenging lighting, weather conditions, and sensor distortions, while also struggling with boundary delineation and detail preservation.
4. 🛠️ Methods: Uses perturbation-based consistency framework to generate consistent predictions under different corruptions, and spatial distance constraint to enforce geometric relationships between patches; fine-tuned on 540K unlabeled images with various augmentations.
5. 📊 Results and Evaluation: Outperformed state-of-the-art approaches across multiple benchmarks including DA-2K, real-world adverse weather datasets, and synthetic corruption benchmarks, while maintaining performance on general scenes; showed particular improvements in boundary definition and detail preservation.

Depth Anything at Any Condition

DepthAnything-AC Methodology Flow Unlabeled Images (540K samples) Perturbation Augmentation (Dark, Weather, Blur, Contrast) Normal Image x^w Perturbed Image x^s Teacher Model (Frozen) DepthAnything V2 Student Model (Trainable) ViT-S + DPT Decoder Spatial Distance Constraint SD = √(S²_p + S²_d) Geometric Relations Consistency Loss L_c Knowledge Distill. L_kd Spatial Dist. Loss L_s Total Loss L = λ₁L_c + λ₂L_kd + λ₃L_s DepthAnything-AC Output Robust Depth Maps Key Features • Perturbation-based consistency • Unsupervised framework • Spatial geometric relationships • Semantic boundary enhancement • Complex weather robustness • Fine-grained detail recovery • Zero-shot capabilities Evaluation Benchmarks Multi-condition DA-2K Real Complex NuScenes, RobotCar Synthetic KITTI-C General KITTI, NYU-D, Sintel Performance Zero-shot SOTA Results: Superior performance on complex conditions while maintaining general scene capabilities Trained on only 540K unlabeled images (~1% of DepthAnything training data)
Q1
1. What is the main innovation in how DepthAnything-AC handles training data compared to previous approaches?
It uses a massive labeled dataset of adverse weather conditions
It uses unsupervised learning with perturbation-based consistency on unlabeled data
It combines multiple existing datasets with manual annotations
Q2
2. Which of these is NOT one of the four typical scenarios considered for image perturbation in the paper?
Lighting conditions
Color temperature
Blurriness
Q3
3. What is the key advantage of using the Spatial Distance Constraint in the model?
It reduces the computational complexity of the model
It helps recover object boundaries and details from corrupted images
It improves the model's speed during inference
1/2

Paper 3

FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model

Published: 2025-07-02

Link: http://arxiv.org/pdf/2507.01953

1. 📘 Topic and Domain: Image morphing using diffusion models in computer vision, specifically focusing on generating smooth transitions between two input images.
2. 💡 Previous Research and New Ideas: Based on previous work in image warping, GANs, and diffusion models, proposing a novel tuning-free approach that doesn't require per-instance training like existing methods.
3. ❓ Problem: Addressing the challenge of creating high-quality image morphing transitions between images with different semantics or layouts without requiring extensive fine-tuning or training.
4. 🛠️ Methods: Introduces FreeMorph with two key innovations: guidance-aware spherical interpolation for maintaining identity and directional transitions, and step-oriented variation trend for controlled transitions between inputs.
5. 📊 Results and Evaluation: Outperforms existing methods by being 10-50x faster (under 30 seconds per morphing), achieving superior results in FID, PPL, and LPIPS metrics, and receiving 60.13% preference in user studies.

FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model

FreeMorph: Tuning-Free Image Morphing Workflow Input Images I_left, I_right Caption via LLaVA VAE Encoding z_0-left, z_0-right Spherical Interpolation Forward Diffusion Process λ₁T: Original Attention λ₂T: Prior-driven Self-Attention Step-oriented Variation High-frequency Noise Injection (FFT/IFFT) Core Innovation Components Guidance-aware Spherical Interpolation • Spherical Feature Aggregation Step-oriented Variation Trend • Controlled Transitions Modified Self-Attention Modules ATT(Q, K, V) with blended features from input images Reverse Denoising Process λ₃T: Step-oriented Variation λ₄T: Spherical Feature Aggregation Original Attention Text-conditioned Features DDIM Framework T = 50 steps CFG = 7.5 Generated Sequence J=5 Intermediate Images < 30 seconds Performance Advantages • 10×-50× faster than existing methods • Handles different semantics/layouts • No fine-tuning required Tuning-Free!
Q1
1. What is the main advantage of FreeMorph over previous image morphing methods?
It produces higher quality images
It is tuning-free and requires no per-instance training
It can only work with similar images
Q2
2. Which of the following is NOT one of the two key innovations introduced in FreeMorph?
Guidance-aware spherical interpolation
Step-oriented variation trend
Neural cross-attention mapping
Q3
3. How much faster is FreeMorph compared to existing methods according to the paper?
2-5x faster
10-50x faster
100-200x faster