2025-10-07 Papers

1/2

Paper 1

Paper2Video: Automatic Video Generation from Scientific Papers

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.05096

1. 📘 Topic and Domain: Automatic generation of academic presentation videos from research papers using AI agents, in the domain of computer vision and AI for research.
2. 💡 Previous Research and New Ideas: Based on prior work in slide generation and video synthesis, proposing the first comprehensive framework to generate complete academic presentations including slides, speech, talking head, and cursor movements.
3. ❓ Problem: The highly labor-intensive process of creating academic presentation videos (taking hours to produce 2-10 minute videos), which involves slide design, recording, and editing.
4. 🛠️ Methods: Developed PaperTalker, a multi-agent framework that integrates slide generation with layout refinement, subtitling, speech synthesis, cursor grounding, and talking-head rendering, while enabling parallel slide-wise generation.
5. 📊 Results and Evaluation: The system outperformed human-made presentations by 10% in PresentQuiz accuracy and achieved comparable ratings in user studies, evaluated using a new benchmark (Paper2Video) with 101 paired papers and presentations and four novel metrics (Meta Similarity, PresentArena, PresentQuiz, IP Memory).

Paper2Video: Automatic Video Generation from Scientific Papers

Paper2Video: Automatic Video Generation from Scientific Papers Input Paper LaTeX Project Author Portrait Voice Sample PaperTalker Multi-Agent Framework 1. Slide Builder LaTeX Beamer Code Generation Focused Debugging Tree Search Visual Choice Layout Refinement 2. Subtitle Builder VLM-based Subtitling Visual Focus Prompts Sentence-level Analysis 3. Cursor Builder GUI Grounding (UI-TARS) WhisperX Temporal Alignment Spatial-Temporal Sync 4. Talker Builder F5-TTS Speech Synthesis Hallo2 Talking Head Parallel Generation Slide-wise Parallel Generation 6× Speedup | O(1) Complexity Independent Slide Processing Generated Presentation Video Academic Slides Synchronized Subtitles Cursor Guidance Personalized Speaker High-Quality Audio Paper2Video Benchmark 101 Paper-Video Pairs Meta Similarity VLM-based Alignment PresentArena Pairwise Comparison PresentQuiz Knowledge Assessment IP Memory Author Recognition Key Technical Innovations Tree Search Visual Choice for Layout Optimization GUI Grounding Spatial-Temporal Cursor Alignment Parallel Slide-wise Generation Multi-Modal Long-Context Understanding 10% Better Quiz Accuracy 6× Faster Generation Human-Level Quality Ready-to-Use Output
Q1
1. What is the key innovation in how PaperTalker handles slide generation compared to previous approaches?
Using XML-based templates and manual editing
Using LaTeX code with tree search visual choice for layout optimization
Using PowerPoint templates with automatic content filling
Q2
2. How does the Paper2Video benchmark evaluate a presentation video's effectiveness in promoting research visibility?
By counting the number of views and likes
By measuring download statistics of the paper
By testing if viewers can recall and pose relevant questions about the work later
Q3
3. What unique efficiency improvement does PaperTalker implement in video generation?
Using cloud computing resources
Parallelizing generation across slides for 6x speedup
Reducing video quality for faster processing
1/2

Paper 2

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.05094

1. 📘 Topic and Domain: Video generation with improved reasoning capabilities through chain-of-visual-thought framework.
2. 💡 Previous Research and New Ideas: Based on recent video generation models and large language/multimodal models, proposing a novel framework called VChain that combines the reasoning capabilities of multimodal models with video generation.
3. ❓ Problem: Current video generation models struggle to produce coherent sequences with logical state transitions and causal relationships, despite having good visual quality.
4. 🛠️ Methods: Uses a three-stage approach: Visual Thought Reasoning (generating key frames using GPT-4), Sparse Inference-Time Tuning (fine-tuning video generator on key frames), and Video Sampling.
5. 📊 Results and Evaluation: VChain significantly improved video generation quality across multiple metrics including physics, commonsense reasoning, and causal reasoning, achieving up to 62.12% improvement in causal reasoning compared to baselines.

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation Stage 1: Visual Thought Reasoning GPT-4o Chat Image Generation Perception & Editing Chain of Visual Thoughts Stage 2: Sparse Inference-Time Tuning Pre-trained DiT Model LoRA Adaptation Flow Matching Loss Stage 3: Video Sampling Concatenated Text Prompt Fine-tuned Model Generated Video Key Components & Methods Visual Thought Reasoning GPT-4o infers consequences Generate context frame Iterative perception & editing Create visual keyframes Sparse Inference-Time Tuning Use keyframes as supervision LoRA adaptation Flow matching objective Efficient fine-tuning Video Generation Concatenate textual thoughts Sample with fine-tuned model Produce coherent video Maintain visual quality Technical Foundation Diffusion Models Flow Matching GPT-4o Multimodal LLM LoRA Low-rank Adaptation Wan Model Video Generator Benefits: Self-contained • Efficient • Effective • No external training data • Minimal overhead
Q1
1. What is the main limitation that VChain aims to address in current video generation models?
Poor visual quality and resolution
Lack of coherent causal relationships and logical state transitions
Slow processing speed and high computational requirements
Q2
2. Which component of VChain is responsible for generating the key frames that guide the video generation process?
Sparse Inference-Time Tuning
Video Sampling
Visual Thought Reasoning
Q3
3. If you want to generate a video of an ice cream melting, which aspect would VChain most likely improve compared to traditional methods?
The color accuracy of the ice cream
The gradual progression of the melting process
The background scene details
1/2

Paper 3

Imperceptible Jailbreaking against Large Language Models

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.05025

1. 📘 Topic and Domain: The paper explores imperceptible jailbreaking attacks against Large Language Models (LLMs) using invisible Unicode variation selector characters.
2. 💡 Previous Research and New Ideas: Previous research focused on visible jailbreak modifications, while this paper introduces a novel approach using invisible Unicode characters for the first time.
3. ❓ Problem: The paper addresses how to create effective jailbreak prompts without any visible modifications to malicious questions, making attacks harder to detect visually.
4. 🛠️ Methods: The authors propose a chain-of-search pipeline that optimizes invisible variation selector suffixes through multiple rounds of random search, reusing successful components to improve attack effectiveness.
5. 📊 Results and Evaluation: The method achieved high attack success rates (80-100%) against four aligned LLMs while remaining visually indistinguishable from original prompts, and successfully generalized to prompt injection attacks.

Imperceptible Jailbreaking against Large Language Models

Imperceptible Jailbreaking Method Workflow Original Malicious Questions From AdvBench Unicode Variation Selectors 256 invisible characters Chain-of-Search Pipeline R=5 rounds Chain-of-Search Optimization Process Initialize Random Suffixes Random Search T=10,000 iterations Target-Start Token Optimization Collect Successful Suffixes & Tokens Bootstrap Process Reuse for failed questions Parameters: L = 800-1200 VS M = 10 modifications R = 5 rounds T = 10K iterations Imperceptible Jailbreak Prompts Visually identical to original Evaluation GPT-4 Judge ASR Metric Extension to Prompt Injection Generalization capability Key Innovation: Invisible Modifications No visible changes to malicious questions but different tokenization by LLMs High ASR 80-100% success across 4 LLMs
Q1
1. What is the key innovation of this paper's jailbreaking method compared to previous approaches?
Using Unicode variation selectors as invisible characters
Applying gradient-based optimization techniques
Creating longer prompt templates
Q2
2. In the chain-of-search pipeline, what happens when a successful jailbreak is found?
The search immediately terminates
The successful suffix and target-start tokens are reused as initialization for other cases
The model is retrained from scratch
Q3
3. What makes Llama-3.1-Instruct-8B different from other tested models in terms of attack implementation?
It requires a shorter suffix length
It needs no variation selectors
It requires a longer suffix length of 1,200 variation selectors while others need 800