2025-10-07 Papers

1/2

Paper 1

Paper2Video: Automatic Video Generation from Scientific Papers

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.05096

1. 📘 Topic and Domain: Automatic generation of academic presentation videos from research papers using AI agents, in the domain of computer vision and AI for research.

2. 💡 Previous Research and New Ideas: Based on prior work in slide generation and video synthesis, proposing the first comprehensive framework to generate complete academic presentations including slides, speech, talking head, and cursor movements.

3. ❓ Problem: The highly labor-intensive process of creating academic presentation videos (taking hours to produce 2-10 minute videos), which involves slide design, recording, and editing.

4. 🛠️ Methods: Developed PaperTalker, a multi-agent framework that integrates slide generation with layout refinement, subtitling, speech synthesis, cursor grounding, and talking-head rendering, while enabling parallel slide-wise generation.

5. 📊 Results and Evaluation: The system outperformed human-made presentations by 10% in PresentQuiz accuracy and achieved comparable ratings in user studies, evaluated using a new benchmark (Paper2Video) with 101 paired papers and presentations and four novel metrics (Meta Similarity, PresentArena, PresentQuiz, IP Memory).

Paper2Video: Automatic Video Generation from Scientific Papers

1/2

Paper 2

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.05094

1. 📘 Topic and Domain: Video generation with improved reasoning capabilities through chain-of-visual-thought framework.

2. 💡 Previous Research and New Ideas: Based on recent video generation models and large language/multimodal models, proposing a novel framework called VChain that combines the reasoning capabilities of multimodal models with video generation.

3. ❓ Problem: Current video generation models struggle to produce coherent sequences with logical state transitions and causal relationships, despite having good visual quality.

4. 🛠️ Methods: Uses a three-stage approach: Visual Thought Reasoning (generating key frames using GPT-4), Sparse Inference-Time Tuning (fine-tuning video generator on key frames), and Video Sampling.

5. 📊 Results and Evaluation: VChain significantly improved video generation quality across multiple metrics including physics, commonsense reasoning, and causal reasoning, achieving up to 62.12% improvement in causal reasoning compared to baselines.

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

1/2

Paper 3

Imperceptible Jailbreaking against Large Language Models

Published: 2025-10-06

Link: http://arxiv.org/pdf/2510.05025

1. 📘 Topic and Domain: The paper explores imperceptible jailbreaking attacks against Large Language Models (LLMs) using invisible Unicode variation selector characters.

2. 💡 Previous Research and New Ideas: Previous research focused on visible jailbreak modifications, while this paper introduces a novel approach using invisible Unicode characters for the first time.

3. ❓ Problem: The paper addresses how to create effective jailbreak prompts without any visible modifications to malicious questions, making attacks harder to detect visually.

4. 🛠️ Methods: The authors propose a chain-of-search pipeline that optimizes invisible variation selector suffixes through multiple rounds of random search, reusing successful components to improve attack effectiveness.

5. 📊 Results and Evaluation: The method achieved high attack success rates (80-100%) against four aligned LLMs while remaining visually indistinguishable from original prompts, and successfully generalized to prompt injection attacks.