2026-01-07 Papers

1/2

Paper 1

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.03252

1. 📘 Topic and Domain: The paper focuses on monocular depth estimation using neural implicit fields in computer vision, specifically for producing arbitrary-resolution and fine-grained depth maps from single RGB images.
2. 💡 Previous Research and New Ideas: Previous work used discrete grid-based depth representations which limited resolution and detail, while this paper proposes representing depth as continuous neural implicit fields that can be queried at any 2D coordinate.
3. ❓ Problem: The paper addresses the limitations of traditional grid-based depth estimation methods which are constrained to fixed resolutions and struggle to capture fine geometric details.
4. 🛠️ Methods: The method uses a Vision Transformer encoder with a multi-scale local implicit decoder that queries features from multiple layers and uses an MLP to predict depth at continuous coordinates, combined with a depth query strategy for uniform 3D point sampling.
5. 📊 Results and Evaluation: The approach achieves state-of-the-art performance on both synthetic (Synth4K) and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions and novel view synthesis under large viewpoint shifts.

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

InfiniDepth: Neural Implicit Fields for Depth Estimation Input RGB Image I ∈ R^(H×W×3) ViT Encoder DINOv3 ViT-Large Reassemble Block Feature Pyramid {f^k}_{k=1}^L Query Coordinates (x, y) ∈ [0,W]×[0,H] Continuous 2D Feature Query Multi-scale Local Bilinear Interpolation f^k_{(x,y)} for each scale k Depth Decoding Hierarchical Fusion Learnable Gates Feed Forward h^{k+1} = FFN_k(...) MLP Head Depth Prediction d_I(x,y) = MLP(h^L) Neural Implicit Function Continuous Depth Arbitrary Resolution Fine-grained Details Training Loss Sparse Supervision L1 Loss on N samples Applications • Infinite Depth Query • Uniform 3D Points • Novel View Synthesis • Surface Normal Estimation Key Innovation: Neural Implicit Fields for Depth Representation d_I(x,y) = N_θ(I, (x,y)) - Maps continuous 2D coordinates to depth values Enables arbitrary resolution output without retraining • Preserves fine geometric details Localized prediction mechanism for better geometric variation capture Evaluation • Synth4K: High-quality 4K benchmark • Real-world datasets: KITTI, NYUv2, etc. • High-frequency detail evaluation Results • SOTA performance on Synth4K • Superior fine-detail prediction • Enhanced NVS quality Technical Details • Multi-scale feature pyramid • Bilinear feature interpolation • Residual gated fusion
Q1
1. What is the main limitation of traditional depth estimation methods that InfiniDepth addresses?
High computational cost
Being restricted to fixed grid resolutions
Requiring multiple input images
Q2
2. Why does InfiniDepth introduce a depth query strategy for novel view synthesis?
To reduce computation time
To generate uniform 3D points on object surfaces
To improve color accuracy
Q3
3. What unique dataset did the authors create to evaluate their method?
A collection of low-resolution real photos
A dataset of synthetic indoor scenes
Synth4K: A high-quality 4K benchmark from five different games
1/2

Paper 2

LTX-2: Efficient Joint Audio-Visual Foundation Model

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.03233

1. 📘 Topic and Domain: A novel efficient joint audio-visual foundation model called LTX-2 for generating synchronized high-quality video and audio content from text descriptions.
2. 💡 Previous Research and New Ideas: Based on previous text-to-video and text-to-audio models, introduces new dual-stream transformer architecture with asymmetric design (14B video, 5B audio parameters) and bidirectional cross-attention for joint audio-visual generation.
3. ❓ Problem: Current text-to-video models generate silent videos lacking audio, while separate audio generation models don't capture the joint dependencies between visual and audio elements.
4. 🛠️ Methods: Uses dual-stream transformer with modality-specific VAEs, cross-attention layers, multilingual text conditioning, and modality-aware classifier-free guidance for synchronized audio-visual generation.
5. 📊 Results and Evaluation: Achieves state-of-the-art audiovisual quality among open-source systems, with performance comparable to proprietary models but 18x faster inference speed, and can generate up to 20 seconds of synchronized content.

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2: Efficient Joint Audio-Visual Foundation Model - Method Flow Raw Video Input Raw Audio Input Text Prompt Input Video VAE Spatiotemporal Causal Encoder 3D RoPE Audio VAE Mel Spectrogram 1D Temporal 1D RoPE Gemma3-12B Multi-layer Feature Extractor + Thinking Tokens Video Stream (14B) • Self-Attention • Text Cross-Attention • Audio-Visual Cross-Attention • Feed-Forward Network 3D Positional Encoding Audio Stream (5B) • Self-Attention • Text Cross-Attention • Video-Audio Cross-Attention • Feed-Forward Network 1D Temporal Encoding Bidirectional Cross-Attention Modality-Aware CFG Text Guidance (st) Cross-Modal Guidance (sm) M̂(x,t,m) = M(x,t,m) + st(M(x,t,m) - M(x,∅,m)) + sm(M(x,t,m) - M(x,t,∅)) Video VAE Decoder Multi-scale Multi-tile Audio VAE Decoder + HiFi-GAN Vocoder HD Video Output Stereo Audio 24kHz Output Key Innovations • Asymmetric Dual-Stream • Bidirectional Cross-Attention • Cross-Modality AdaLN • Temporal 1D RoPE • Thinking Tokens • Modality-Aware CFG • Decoupled Latent Spaces • Multi-layer Feature Extract • 18x Faster Inference • 20s Generation Capability Performance • 19B Total Parameters • 14B Video + 5B Audio • State-of-art Open Source • Comparable to Proprietary • 1.22s/step vs 22.3s (Wan)
Q1
1. What is the main architectural innovation of LTX-2 compared to previous models?
Using a single unified transformer for both audio and video
Having an asymmetric dual-stream design with different capacities for audio and video
Using separate independent models for audio and video generation
Q2
2. What is the maximum duration of synchronized audio-visual content that LTX-2 can generate?
10 seconds
15 seconds
20 seconds
Q3
3. Which limitation does LTX-2 currently face?
Cannot generate high-resolution videos
Has inconsistent performance across different languages
Can only generate black and white videos
1/2

Paper 3

DreamStyle: A Unified Framework for Video Stylization

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.02785

1. 📘 Topic and Domain: Video stylization using a unified framework that supports multiple style conditions (text, style image, and first frame) for video-to-video transformation.
2. 💡 Previous Research and New Ideas: Based on previous single-condition video stylization methods, this paper introduces a unified framework that combines multiple style conditions and proposes a novel token-specific LoRA architecture with a systematic data curation pipeline.
3. ❓ Problem: Existing video stylization methods are limited to single style conditions, suffer from style inconsistency, and lack high-quality datasets for training.
4. 🛠️ Methods: Uses a two-stage training approach with CT and SFT datasets, builds on an I2V model with token-specific LoRA, and employs a data curation pipeline combining image stylization and I2V generation with ControlNets.
5. 📊 Results and Evaluation: Outperforms competitors across all three stylization tasks (text-guided, style-image-guided, and first-frame-guided) in terms of style consistency and video quality, as demonstrated through both quantitative metrics and user studies.

DreamStyle: A Unified Framework for Video Stylization

DreamStyle Framework Workflow Data Curation Pipeline Step 1: Image Stylization SDXL + InstantStyle Seedream 4.0 Step 2: I2V Generation ControlNet (Depth/Pose) CT Dataset (40K) Large-scale + VLM Filtering Continual Training SFT Dataset (5K) High-quality + Manual Filter Supervised Fine-tuning Wan14B-I2V Base Model DiT Architecture Image Condition Channels Text Condition Cross-Attention Original Layers Style Image Frame Concatenation CLIP Features First Frame Image Channels Mask = 1.0 Raw Video Channel Concat Mask = 0.0 Token-specific LoRA Shared Down Matrix (W_down) Token-specific Up Matrices (W_up^i) Reduces Inter-token Confusion Rank = 64 Stage 1: CT Training 6,000 iterations Foundation Capability Stage 2: SFT Training 3,000 iterations Quality Enhancement Flow Matching Loss Sampling Ratio 1:2:1 Text:Style:First-frame Unified Framework 3 Stylization Tasks Multi-Style Fusion Extended Application Long Video Frame Chaining Evaluation CSD, DINO, VBench Data Curation Pipeline Unified V2V Framework Token-specific LoRA Multi-modal Support
Q1
1. What is the key innovation in DreamStyle's LoRA architecture that helps handle multiple style conditions?
Using multiple separate LoRA modules for each condition
Using a shared down matrix with token-specific up matrices
Using parallel LoRA paths for different tokens
Q2
2. How does DreamStyle handle long video stylization beyond the 5-second duration limit?
By processing the entire video at once with expanded memory
By using the last frame of a generated segment as the first frame for the next segment
By reducing video resolution for longer sequences
Q3
3. In the data curation pipeline, why does DreamStyle generate both stylized and raw videos using the same control conditions?
To reduce computation time during training
To create larger training datasets
To mitigate motion mismatches between paired videos