2025-04-10 Papers

1/2

Paper 1

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Published: 2025-04-09

Link: http://arxiv.org/pdf/2504.07083

1. 📘 Topic and Domain: The paper focuses on auto-regressive camera trajectory generation for cinematography, operating in the domain of computer vision and video production.

2. 💡 Previous Research and New Ideas: The paper builds on previous trajectory generation methods that used geometric optimization, procedural systems, or diffusion models, but proposes a novel auto-regressive approach to generate more artistic and expressive camera movements.

3. ❓ Problem: The paper addresses the limitation of existing camera trajectory generation methods that lack artistic expression, directorial intent, and fine-grained textual alignment for creative video production.

4. 🛠️ Methods: The authors introduce GenDoP, an auto-regressive model that treats camera parameters as discrete tokens and leverages a decoder-only Transformer architecture, conditioned on text descriptions and optional RGBD information.

5. 📊 Results and Evaluation: GenDoP outperforms state-of-the-art methods across fine-grained textual controllability, motion stability, and complexity metrics, with extensive human validation confirming its superior performance in generating artistic, expressive camera trajectories.

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

1/2

Paper 2

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Published: 2025-04-09

Link: http://arxiv.org/pdf/2504.07096

1. 📘 Topic and Domain: The paper introduces OLMoTrace, a system for tracing language model outputs back to their training data in real-time.

2. 💡 Previous Research and New Ideas: The paper builds on infini-gram (a text search engine) and extends it with a novel parallel algorithm to efficiently trace language model outputs to their training data, which was previously computationally intractable at trillion-token scale.

3. ❓ Problem: The paper addresses the challenge of understanding why language models generate certain responses by tracing their outputs back to training data, which was previously impossible at scale due to computational constraints.

4. 🛠️ Methods: The authors use a five-step inference pipeline that finds maximal matching spans in LM outputs, filters for long and unique spans, retrieves enclosing documents, merges spans and documents, and ranks documents by relevance using BM25 scoring.

5. 📊 Results and Evaluation: The system achieves an average inference latency of 4.46 seconds per query on responses averaging 458 tokens, with document relevance evaluations showing the top documents displayed having an average relevance score of 1.82 (on a 0-3 scale) according to LLM-as-a-Judge evaluation.

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

1/2

Paper 3

A Unified Agentic Framework for Evaluating Conditional Image Generation

Published: 2025-04-09

Link: http://arxiv.org/pdf/2504.07046

1. 📘 Topic and Domain: The paper introduces CIGEVAL, a unified agentic framework for evaluating conditional image generation across various tasks such as text-guided image generation, subject-driven image editing, and control-guided image generation.

2. 💡 Previous Research and New Ideas: The paper builds upon previous image evaluation metrics like CLIP-Score, LPIPS, and VIESCORE, but proposes a novel approach that integrates large multimodal models (LMMs) with specialized tools to overcome limitations in task specificity, explainability, and human alignment.

3. ❓ Problem: The paper addresses the challenge of developing task-agnostic, reliable, and explainable evaluation metrics for conditional image generation that can align with human judgment across diverse generation tasks.

4. 🛠️ Methods: The authors implement an agentic framework that combines LMMs (like GPT-4o or open-source models) with a multi-functional toolbox (including Grounding, Highlight, Difference, and Scene Graph tools) and fine-grained evaluation through task decomposition, tool selection, and analysis.

5. 📊 Results and Evaluation: CIGEVAL with GPT-4o achieves a Spearman correlation of 0.4625 with human assessments across seven tasks, closely matching the human-to-human correlation of 0.47, and when implemented with fine-tuned 7B open-source LMMs using only 2.3K training trajectories, it surpasses previous GPT-4o-based state-of-the-art methods.