2026-01-07 Papers

1/2

Paper 1

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.03252

1. 📘 Topic and Domain: The paper focuses on monocular depth estimation using neural implicit fields in computer vision, specifically for producing arbitrary-resolution and fine-grained depth maps from single RGB images.

2. 💡 Previous Research and New Ideas: Previous work used discrete grid-based depth representations which limited resolution and detail, while this paper proposes representing depth as continuous neural implicit fields that can be queried at any 2D coordinate.

3. ❓ Problem: The paper addresses the limitations of traditional grid-based depth estimation methods which are constrained to fixed resolutions and struggle to capture fine geometric details.

4. 🛠️ Methods: The method uses a Vision Transformer encoder with a multi-scale local implicit decoder that queries features from multiple layers and uses an MLP to predict depth at continuous coordinates, combined with a depth query strategy for uniform 3D point sampling.

5. 📊 Results and Evaluation: The approach achieves state-of-the-art performance on both synthetic (Synth4K) and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions and novel view synthesis under large viewpoint shifts.

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

1/2

Paper 2

LTX-2: Efficient Joint Audio-Visual Foundation Model

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.03233

1. 📘 Topic and Domain: A novel efficient joint audio-visual foundation model called LTX-2 for generating synchronized high-quality video and audio content from text descriptions.

2. 💡 Previous Research and New Ideas: Based on previous text-to-video and text-to-audio models, introduces new dual-stream transformer architecture with asymmetric design (14B video, 5B audio parameters) and bidirectional cross-attention for joint audio-visual generation.

3. ❓ Problem: Current text-to-video models generate silent videos lacking audio, while separate audio generation models don't capture the joint dependencies between visual and audio elements.

4. 🛠️ Methods: Uses dual-stream transformer with modality-specific VAEs, cross-attention layers, multilingual text conditioning, and modality-aware classifier-free guidance for synchronized audio-visual generation.

5. 📊 Results and Evaluation: Achieves state-of-the-art audiovisual quality among open-source systems, with performance comparable to proprietary models but 18x faster inference speed, and can generate up to 20 seconds of synchronized content.

LTX-2: Efficient Joint Audio-Visual Foundation Model

1/2

Paper 3

DreamStyle: A Unified Framework for Video Stylization

Published: 2026-01-06

Link: http://arxiv.org/pdf/2601.02785

1. 📘 Topic and Domain: Video stylization using a unified framework that supports multiple style conditions (text, style image, and first frame) for video-to-video transformation.

2. 💡 Previous Research and New Ideas: Based on previous single-condition video stylization methods, this paper introduces a unified framework that combines multiple style conditions and proposes a novel token-specific LoRA architecture with a systematic data curation pipeline.

3. ❓ Problem: Existing video stylization methods are limited to single style conditions, suffer from style inconsistency, and lack high-quality datasets for training.

4. 🛠️ Methods: Uses a two-stage training approach with CT and SFT datasets, builds on an I2V model with token-specific LoRA, and employs a data curation pipeline combining image stylization and I2V generation with ControlNets.

5. 📊 Results and Evaluation: Outperforms competitors across all three stylization tasks (text-guided, style-image-guided, and first-frame-guided) in terms of style consistency and video quality, as demonstrated through both quantitative metrics and user studies.