2025-04-10 Papers

1/2

Paper 1

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Published: 2025-04-09

Link: http://arxiv.org/pdf/2504.07083

1. 📘 Topic and Domain: The paper focuses on auto-regressive camera trajectory generation for cinematography, operating in the domain of computer vision and video production.
2. 💡 Previous Research and New Ideas: The paper builds on previous trajectory generation methods that used geometric optimization, procedural systems, or diffusion models, but proposes a novel auto-regressive approach to generate more artistic and expressive camera movements.
3. ❓ Problem: The paper addresses the limitation of existing camera trajectory generation methods that lack artistic expression, directorial intent, and fine-grained textual alignment for creative video production.
4. 🛠️ Methods: The authors introduce GenDoP, an auto-regressive model that treats camera parameters as discrete tokens and leverages a decoder-only Transformer architecture, conditioned on text descriptions and optional RGBD information.
5. 📊 Results and Evaluation: GenDoP outperforms state-of-the-art methods across fine-grained textual controllability, motion stability, and complexity metrics, with extensive human validation confirming its superior performance in generating artistic, expressive camera trajectories.

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

GenDoP Methodology Flowchart 1. DataDoP Dataset Construction Input: Raw Videos (Movies, Documentaries) Pre-processing: - Shot Segmentation (PySceneDetect) - Quality/Semantic Filtering (Length, Light, GPT-4o Motion Type) Trajectory Extraction & Refinement: - Extract Pose & Depth (MonST3R) - Clean, Smooth (Kalman), Interpolate Trajectories Motion Tagging: - Segment Trajectories - Assign Tags: Translation (27 types) + Rotation (7 types) - Combine & Smooth Tags Caption Generation (GPT-4o): - Motion Captions (from Motion Tags) - Directorial Captions (Tags + Scene Grid + Intent) Output: DataDoP Dataset (Trajectories, RGBD Frames, Captions) 2. GenDoP Trajectory Generation Input: Text Caption (Motion/Directorial) [Optional: Initial Frame RGBD] Multi-modal Encoding: - Text Encoder (SD2.1 based) - RGBD Encoders (CLIP Vision based) -> Concatenated Latent Code Z Trajectory Tokenization - Canonical Norm. - Param Conversion (Quat, Trans, Intr, Scale) - Discretization (Bins) - Codebook Lookup -> Pose Tokens Auto-regressive Decoding (OPT Transformer) - Input: Latent Code Z + Previous Pose Tokens - Predicts Next Pose Token Sequentially Output: Generated Pose Token Sequence -> De-tokenize -> Generated Camera Trajectory Evaluation & Application Evaluation - Metrics (CLaTr, F1) - User Study (AUR) - Ablation Studies Application - Camera Control for Text/Image-to-Video Generation Uses/Produces Provides Training Data
Q1
1. What is the primary innovation of GenDoP compared to previous camera trajectory generation methods?
It uses a reinforcement learning approach to optimize camera movements
It employs an auto-regressive model treating camera parameters as discrete tokens
It introduces a diffusion-based framework with human-centric tracking
Q2
2. What type of camera trajectories does the DataDoP dataset focus on?
Object/Scene-centric trajectories that focus on specific objects
Tracking trajectories that follow moving subjects
Free-moving trajectories that enable unrestricted 3D camera motion
Q3
3. How many types of captions are generated for each trajectory in the DataDoP dataset?
One type: Technical captions describing camera parameters
Two types: Motion captions and Directorial captions
Three types: Translation, Rotation, and Intent captions
1/2

Paper 2

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Published: 2025-04-09

Link: http://arxiv.org/pdf/2504.07096

1. 📘 Topic and Domain: The paper introduces OLMoTrace, a system for tracing language model outputs back to their training data in real-time.
2. 💡 Previous Research and New Ideas: The paper builds on infini-gram (a text search engine) and extends it with a novel parallel algorithm to efficiently trace language model outputs to their training data, which was previously computationally intractable at trillion-token scale.
3. ❓ Problem: The paper addresses the challenge of understanding why language models generate certain responses by tracing their outputs back to training data, which was previously impossible at scale due to computational constraints.
4. 🛠️ Methods: The authors use a five-step inference pipeline that finds maximal matching spans in LM outputs, filters for long and unique spans, retrieves enclosing documents, merges spans and documents, and ranks documents by relevance using BM25 scoring.
5. 📊 Results and Evaluation: The system achieves an average inference latency of 4.46 seconds per query on responses averaging 458 tokens, with document relevance evaluations showing the top documents displayed having an average relevance score of 1.82 (on a 0-3 scale) according to LLM-as-a-Judge evaluation.

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

OLMoTrace Inference Pipeline Tracing LM Outputs to Training Data (Focus on Method) Input: LM Response & User Prompt Step 1: Find Maximal Matching Spans - Tokenize LM output (Llama-2). - Identify verbatim spans in training data meeting: Existence, Self-contained, Maximality. - Key Tech: Parallel algorithm using infini-gram (Suffix Array index on Trillion+ tokens). - Fast lookup: O(1) FIND query per suffix (parallelized). Step 2: Filter for Long & Unique Spans Keep top K spans with lowest span unigram probability. Step 3: Retrieve Enclosing Documents Retrieve up to 10 document snippets per kept span (sample if >10). Step 4: Merge Spans & Documents Merge overlapping spans for UI; merge snippets from same source doc. Step 5: Rerank & Color by Relevance - Rerank documents using BM25 (Query: Prompt+Response, Corpus: Retrieved Docs). - Color document sidebars & span highlights based on relevance score (High/Med/Low). Output: Highlighted Spans & Ranked Source Docs
Q1
1. What is the primary innovation that allows OLMoTrace to efficiently trace language model outputs back to training data?
A novel tokenization algorithm that reduces the size of training data
A parallel algorithm built on infini-gram that processes suffixes simultaneously
A reinforcement learning approach that predicts likely training sources
Q2
2. How does OLMoTrace highlight spans in language model responses?
Using a single color for all matching spans regardless of document relevance
Using different colors based on the length of the matching span
Using color saturation levels to indicate the relevance of source documents
Q3
3. What is the total size of the training data that OLMoTrace indexes and searches for OLMo-2-32B-Instruct?
Approximately 460 billion tokens
Approximately 4.6 trillion tokens
Approximately 46 trillion tokens
1/2

Paper 3

A Unified Agentic Framework for Evaluating Conditional Image Generation

Published: 2025-04-09

Link: http://arxiv.org/pdf/2504.07046

1. 📘 Topic and Domain: The paper introduces CIGEVAL, a unified agentic framework for evaluating conditional image generation across various tasks such as text-guided image generation, subject-driven image editing, and control-guided image generation.
2. 💡 Previous Research and New Ideas: The paper builds upon previous image evaluation metrics like CLIP-Score, LPIPS, and VIESCORE, but proposes a novel approach that integrates large multimodal models (LMMs) with specialized tools to overcome limitations in task specificity, explainability, and human alignment.
3. ❓ Problem: The paper addresses the challenge of developing task-agnostic, reliable, and explainable evaluation metrics for conditional image generation that can align with human judgment across diverse generation tasks.
4. 🛠️ Methods: The authors implement an agentic framework that combines LMMs (like GPT-4o or open-source models) with a multi-functional toolbox (including Grounding, Highlight, Difference, and Scene Graph tools) and fine-grained evaluation through task decomposition, tool selection, and analysis.
5. 📊 Results and Evaluation: CIGEVAL with GPT-4o achieves a Spearman correlation of 0.4625 with human assessments across seven tasks, closely matching the human-to-human correlation of 0.47, and when implemented with fine-tuned 7B open-source LMMs using only 2.3K training trajectories, it surpasses previous GPT-4o-based state-of-the-art methods.

A Unified Agentic Framework for Evaluating Conditional Image Generation

CIGEVAL: Methodology Flowchart Input: Conditional Image Generation Task Generated Image (O), Conditions (C*), Instruction (I) CIGEVAL Agent Core Large Multimodal Model (LMM) Multi-functional Toolbox Grounding Highlight Difference Scene Graph Uses Fine-grained Evaluation Framework (1) Task Decomposition (based on C* & I) (2) Tool Selection (Agent decides, uses Toolbox if needed) (3) Analysis (ReAct Style: Observation, Thought, Action) (Analyzes inputs & tool outputs) (4) Fine-grained Scoring (5) Score Aggregation (min) Output Rationale & Final Score (0.0 - 1.0) Agent Tuning (for Open-Source LMMs) Generate Evaluation Trajectories using GPT-4o Agent Filter Trajectories (Keep if agent score ≈ human score) Supervised Fine-Tuning (SFT) on Filtered Data (Loss on Thought & Action) Tuned OS-LMM Resulting Tuned Agent Framework Evaluation Benchmarked on ImagenHub against baselines & human correlation. Ablation studies performed to validate tool contributions.
Q1
1. What is the main innovation of CIGEVAL compared to previous image evaluation metrics?
It uses only GPT-4o as the evaluation model
It integrates LMMs with specialized tools in an agentic framework
It focuses exclusively on text-guided image generation
Q2
2. How many training trajectories were used to fine-tune the open-source 7B LMMs in CIGEVAL?
47,000 trajectories
23,000 trajectories
2,300 trajectories
Q3
3. What tool in CIGEVAL's toolbox is used to detect subtle differences between two similar images?
Scene Graph
Grounding
Difference