2025-07-08 Papers

1/2

Paper 1

4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

Published: 2025-07-07

Link: http://arxiv.org/pdf/2507.05163

1. 📘 Topic and Domain: 4D reconstruction of high-speed dynamic scenes using asynchronous multi-camera capture and video diffusion models in computer vision.
2. 💡 Previous Research and New Ideas: Based on previous 4D Gaussian Splatting work limited to 30 FPS, introduces novel asynchronous capture scheme and video diffusion model refinement.
3. ❓ Problem: Current 4D capture systems are limited to low frame rates (<30 FPS), making it difficult to reconstruct fast-moving scenes with high fidelity.
4. 🛠️ Methods: Combines asynchronous camera capture (staggering camera start times) with a video diffusion model for artifact removal, implemented through 4D Gaussian Splatting and LoRA-based fine-tuning.
5. 📊 Results and Evaluation: Achieves superior reconstruction quality compared to synchronous methods on both synthetic and real datasets, with significant improvements in PSNR, SSIM, and LPIPS metrics.

4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

4DSloMo: Method Workflow Asynchronous Capture • Stagger camera start times • 25 FPS → 100-200 FPS • Divide cameras into groups • ti = i·(τ/K) + j·τ Initial 4D Gaussian • GS4D reconstruction • 7k iterations training • Sparse view artifacts • p(x|μ,Σ) = exp(-½(x-μ)ᵀΣ⁻¹(x-μ)) Video Rendering • Render high-frame-rate videos • Contains floater artifacts • All camera viewpoints • Vrender ∈ R^(C×T×H×W) Data Curation • Temporal sub-sampling • Simulate async capture • 750 noisy-clean pairs • DNA-Rendering dataset Video Diffusion Training • Wan2.1 backbone • LoRA fine-tuning • DiT component • Ltune = E[||ε - εθ||²] Artifact-Fix Model • Video enhancement • Temporal consistency • Remove floater artifacts • Vrender → V̂ Refined 4D Gaussian • Additional 7k iterations • Diffusion supervision • Ldiff = ||Vrender - V̂||₁ + Lp • Enhanced visual quality High-Quality 4D Output • Temporal consistency • Reduced artifacts • High-speed motion • 100-200 FPS equivalent Key Technical Components Asynchronous Scheme K camera groups Effective rate × K 4D Gaussian Splatting Spatiotemporal modeling x = (x,y,z,t) Video Diffusion Wan2.1 + LoRA Temporal consistency Artifact Removal Sparse view handling Quality enhancement Performance Improvements PSNR: 26.76 ↑ SSIM: 0.845 ↑ LPIPS: 0.293 ↓ Frame Rate: 100-200 FPS
Q1
1. What is the main innovation in the camera capture system proposed by this paper?
Using specialized high-speed cameras
Staggering the start times of different cameras
Increasing the number of cameras in the setup
Q2
2. Why does the paper use a video diffusion model instead of an image diffusion model for artifact removal?
Video diffusion models are faster to train
Video diffusion models require less data
Video diffusion models maintain better temporal consistency
Q3
3. What effective frame rate can be achieved using the paper's asynchronous capture method with 8 cameras at 25 FPS?
50 FPS
100-200 FPS
25 FPS
1/2

Paper 2

Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Published: 2025-07-07

Link: http://arxiv.org/pdf/2507.05108

1. 📘 Topic and Domain: The paper focuses on historical document restoration using AI, specifically in the domain of computer vision and digital heritage preservation.
2. 💡 Previous Research and New Ideas: Based on previous work in single-modal restoration and limited-size patch restoration, this paper proposes a novel automated three-stage restoration approach that mimics historians' workflow and introduces a comprehensive full-page historical document dataset.
3. ❓ Problem: The paper addresses the limitations of existing historical document restoration methods that focus only on single modality or limited-size restoration, failing to provide a fully automated solution for comprehensive document restoration.
4. 🛠️ Methods: The authors developed AutoHDR, a three-stage approach combining OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration, along with creating the FPHDR dataset containing both real and synthetic damaged documents.
5. 📊 Results and Evaluation: The method improved OCR accuracy from 46.83% to 84.05% for severely damaged documents, with further enhancement to 94.25% through human-machine collaboration, demonstrating superior performance in both text restoration accuracy and historical appearance preservation.

Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

AutoHDR: Historical Document Restoration Workflow Stage 1: OCR-Assisted Damage Localization Character OCR DINO Damage Detection Localization Fusion Character & Damage Positions Stage 2: Damaged Content Prediction Qwen2 LLM Fine-tuning Vision-Language Context Prediction Composite Scoring Predicted Text Content Stage 3: Appearance Restoration Diffusion Model Patch Auto- Regressive Page-level Restoration Restored Document Image FPHDR Dataset Real Data 1,633 samples Manual annotation Synthetic Data 6,543 samples Automated generation Damage Grades Light • Medium • Severe Character-level annotation Bounding boxes Collaboration Modular design Human intervention Quality enhancement Performance Results OCR Accuracy 46.83% → 84.05% (Severe damage) + Collaboration → 94.25% (Human-AI) Damage Detection F1: 94.1% (DINO-based) Text Prediction Top-5: 97.75% (Qwen2-7B) Key Innovations Full-page Processing capability Context preservation vs. patch-level methods Fully Automated Three-stage pipeline No manual intervention End-to-end solution Multimodal Text + Appearance Vision-Language fusion OCR + LLM synergy Historian-inspired Workflow mimicking Cultural principles "Old as old/new" Collaborative Modular design Expert integration Quality assurance
Q1
1. What is the main innovation of AutoHDR compared to previous historical document restoration methods?
It uses higher resolution cameras to capture documents
It provides a fully automated three-stage restoration process mimicking historians' workflow
It only focuses on text restoration without considering appearance
Q2
2. How does the FPHDR dataset categorize document damage levels?
High, Medium, Low damage
Complete, Partial, Minor damage
Severe, Medium, Light damage
Q3
3. What was the improvement in OCR accuracy for severely damaged documents when using AutoHDR with human collaboration?
From 46.83% to 84.05%
From 46.83% to 94.25%
From 84.05% to 94.25%
1/2

Paper 3

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Published: 2025-07-07

Link: http://arxiv.org/pdf/2507.04952

1. 📘 Topic and Domain: The paper introduces ArtifactsBench, a benchmark framework for evaluating Large Language Models' ability to generate interactive visual code artifacts in software development.
2. 💡 Previous Research and New Ideas: Based on existing code generation benchmarks that focus mainly on static code evaluation, this paper proposes a novel framework that evaluates both visual fidelity and interactive behavior of generated code.
3. ❓ Problem: The paper addresses the critical gap in evaluating LLMs' ability to generate dynamic, interactive visual artifacts, as current benchmarks cannot assess visual quality and interactive functionality comprehensively.
4. 🛠️ Methods: The authors developed a multi-stage evaluation pipeline using Multimodal LLMs as judges, programmatically rendering generated artifacts and capturing their dynamic behavior through temporal screenshots against fine-grained checklists.
5. 📊 Results and Evaluation: The automated evaluation achieved 94.4% ranking consistency with WebDev Arena (human preference benchmark) and over 90% agreement with human experts, while revealing that generalist models often outperform domain-specific ones in visual code generation tasks.

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

ArtifactsBench Methodology Flow Dataset Construction Pipeline 1. Extraction & Filtering (4 sources) 2. Manual & LLM Rewrite & Polish 3. Classification & Difficulty Filtering 4. Checklist Generation 5. Quality Control & Final Consolidation Dataset Composition • 1,825 diverse tasks • 9 primary categories • 3 difficulty levels • Web Apps, Games, SVG • Data Science, Simulations • Management Systems • Multimedia Editing Multimodal Evaluation Pipeline 1. Code Extraction (Regex) 2. Dynamic Rendering (Playwright) 3. Temporal Screenshot Capture 4. MLLM-as-Judge Assessment 5. Fine-grained Checklist Scoring Validation & Analysis Framework Human Expert Validation • 280 sampled instances • Double-blind protocol • >90% pairwise agreement • Multiple annotators • Median scoring WebDev Arena Correlation • 94.4% ranking consistency • Normalized Footrule metric • Human preference alignment • Gold-standard validation • Real-world relevance Comprehensive Evaluation Dimensions Visual Fidelity Layout, Aesthetics Color Harmony Functionality Core Features Robustness Interactivity Dynamic Effects User Experience Code Quality Engineering Practices Innovation Creativity Features Key Findings & Results • 30+ LLMs evaluated • Generalist > Specialist models • Performance scales with size • Proprietary models lead • Complex tasks remain challenging • First automated visual evaluation
Q1
1. What was the most surprising finding revealed by ArtifactsBench regarding model performance?
Domain-specific models performed best at visual code generation
Generalist models outperformed specialized models
All models performed equally well on visual tasks
Q2
2. How did ArtifactsBench evaluate the dynamic behavior of generated code?
By manually testing each interaction
Through static code analysis only
By capturing temporal screenshots during programmatic rendering
Q3
3. What level of agreement did ArtifactsBench achieve with human preference benchmarks?
74.4% ranking consistency
84.4% ranking consistency
94.4% ranking consistency