2025-10-03 Papers

1/2

Paper 1

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Published: 2025-10-02

Link: http://arxiv.org/pdf/2510.02283

1. 📘 Topic and Domain: Long-form video generation using diffusion models, specifically focused on extending video generation beyond traditional short-duration limits.
2. 💡 Previous Research and New Ideas: Based on prior work in diffusion models and autoregressive video generation; introduces a novel approach called Self-Forcing++ that extends beyond the traditional 5-second limit of teacher models.
3. ❓ Problem: The challenge of generating high-quality long videos, as current models suffer from quality degradation, over-exposure, and error accumulation when generating videos beyond 5-10 seconds.
4. 🛠️ Methods: Uses backward noise initialization, extended distribution matching distillation, and rolling KV cache to train a student model on self-generated long rollouts while leveraging guidance from a teacher model.
5. 📊 Results and Evaluation: Achieved generation of high-quality videos up to 4 minutes and 15 seconds long (50x improvement over baseline), while maintaining visual stability and outperforming baseline methods in both fidelity and consistency metrics.

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Self-Forcing++ Methodology Flow ODE Initialization Distill bidirectional teacher into autoregressive student Student Self-Rollout Generate N frames (N >> T, e.g. 100s) Backward Noise Init Re-inject noise to degraded rollouts Windowed Sampling Uniform slice of K frames from long sequence Extended DMD Distribution matching with teacher model Rolling KV Cache Train-inference alignment No overlapping frames GRPO Enhancement Optical flow reward for temporal smoothness Key Innovation Teacher corrects student's own long-horizon error accumulation without long video supervision Result: 20x longer videos Problems Addressed 1. Temporal Mismatch: Training (5s) vs Inference (100s) 2. Supervision Misalignment: Error accumulation in long rollouts Achievements • 100s generation (20x baseline) • 4min 15s with scaling (50x) • High visual stability • No quality degradation New Evaluation Visual Stability Metric Using Gemini-2.5-Pro Addresses VBench bias for long video evaluation Training Budget Scaling Effect 1x: 5s coherent → 4x: semantic coherence → 8x: detailed backgrounds 20x: 50s stable → 25x: 255s high-fidelity generation Extended DMD Loss Formula ∇θL_DMD^extended = E_t E_i~Unif{1,...,N-K+1} [∫ s_T(Φ(G_θ(z_i), t), t) - s_S_θ(Φ(G_θ(z_i), t), t) dG_θ(z_i)/dθ dz_i] Where N >> K, enabling supervision beyond teacher's horizon
Q1
1. What is the main innovation of Self-Forcing++ compared to previous methods?
Using a larger transformer architecture
Training the student model on its own long, error-accumulated rollouts
Increasing the size of the training dataset
Q2
2. What is the maximum video length that Self-Forcing++ achieved in the experiments?
100 seconds
2 minutes and 30 seconds
4 minutes and 15 seconds
Q3
3. Why did the authors propose a new evaluation metric called Visual Stability?
To measure computational efficiency
Because VBench favors over-exposed and degraded frames
To evaluate audio-visual synchronization
1/2

Paper 2

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Published: 2025-10-02

Link: http://arxiv.org/pdf/2510.02314

1. 📘 Topic and Domain: A novel data poisoning attack method for 3D Gaussian Splatting (3DGS) in computer vision, specifically targeting neural rendering systems.
2. 💡 Previous Research and New Ideas: Based on prior poisoning attacks on Neural Radiance Fields (NeRF), proposes new density-guided poisoning specifically for 3DGS's explicit representation, which was previously unexplored.
3. ❓ Problem: Addresses the challenge of injecting visible illusory objects into specific target views of 3D Gaussian Splatting while keeping other viewpoints unaffected.
4. 🛠️ Methods: Uses Kernel Density Estimation (KDE) to identify low-density regions for placing poisoned Gaussian points, combined with adaptive noise scheduling to disrupt multi-view consistency during training.
5. 📊 Results and Evaluation: Achieves superior poisoning performance compared to baselines across multiple datasets, with PSNR >25 on poisoned views while maintaining PSNR drop ≤3 on innocent views, demonstrating successful illusion embedding while preserving scene fidelity.

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

StealthAttack: Density-Guided 3D Gaussian Splatting Poisoning Input Dataset {I_k}^N_k=1 Initial 3DGS Point Cloud G Problem Formulation min ||Ĩ_ILL - I_ILL||² + Σ||R(G̃,v_k) - R(G,v_k)||² Inject illusion O_ILL on target view v_p while preserving innocent views Density-Guided Poisoning Method Strategy 1: Density-Guided Point Cloud Attack Scene Space Analysis AABB Computation Grid Voxelization Density ρ(s) = Σα(g) Kernel Density Estimation (KDE) f(x) = 1/|S| Σ K_h(x-c(s))·ρ(s) Continuous density estimate Ray Casting from Target Camera Cast rays through illusory object pixels Optimal Position Selection x_min = argmin f(x) Place poison points Strategic Point Placement Insert Gaussian points at minimum density locations Assign colors from illusory object Strategy 2: View Consistency Disruption Adaptive Noise Application I'_k = 1_{v_k≠v_p} · CLIP(I_k + η) η ~ N(0, σ²_t) Noise Scheduling Linear: σ₀·(1-t/T) Cosine: σ₀·cos(πt/2T) Sqrt: σ₀·√(1-t/T) Multi-view Consistency Disruption Strong noise in early optimization Gradually reduce noise strength Preserve poisoned view clean 3DGS Training with Disruption Weakens multi-view consistency Preserves injected illusions Poisoned View Clear illusion O_ILL visible Innocent Views High fidelity Minimal artifacts KDE-based Evaluation Attack difficulty
Q1
1. What is the main innovation of the StealthAttack method compared to previous poisoning attacks?
It uses machine learning to generate realistic illusions
It identifies low-density regions using KDE to place poisoned points
It completely removes the need for training data
Q2
2. Why is attacking 3D Gaussian Splatting more challenging than attacking NeRF?
3DGS has stronger multi-view consistency constraints
3DGS requires more computational resources
3DGS uses simpler mathematical models
Q3
3. What metric combination defines a successful attack according to the paper?
PSNR > 30 on poisoned views with no PSNR drop on innocent views
PSNR > 25 on poisoned views with PSNR drop ≤ 3 on innocent views
PSNR > 20 on poisoned views with PSNR drop ≤ 5 on innocent views
1/2

Paper 3

Interactive Training: Feedback-Driven Neural Network Optimization

Published: 2025-10-02

Link: http://arxiv.org/pdf/2510.02297

1. 📘 Topic and Domain: The paper introduces Interactive Training, a framework for real-time, feedback-driven neural network optimization in machine learning.
2. 💡 Previous Research and New Ideas: Based on traditional static neural network training approaches, it proposes a novel interactive paradigm where humans or AI agents can dynamically intervene during the training process.
3. ❓ Problem: The paper addresses the limitations of static training paradigms that lack flexibility to respond to training issues like instabilities or underperformance without restarting the entire process.
4. 🛠️ Methods: The authors implemented a control server architecture with a React-based frontend dashboard that enables real-time monitoring and intervention through commands to adjust hyperparameters, training data, and model checkpoints.
5. 📊 Results and Evaluation: Through three case studies, they demonstrated superior training stability with human intervention, successful automated LLM-based hyperparameter adjustment, and effective real-time model adaptation using user-generated data.

Interactive Training: Feedback-Driven Neural Network Optimization

Interactive Training Framework Workflow Frontend Dashboard React-based Interface Real-time Visualization Two-way Communication Control Server FastAPI-based Command Queues State Management Interactive Trainer HuggingFace Extension Callback Functions Dynamic Training REST API Commands Command Dispatch Training Updates WebSocket Updates Supported Interventions Optimizer Learning Rate Momentum Weight Decay Model Parameter Reset Layer Operations Gradient Clipping Checkpoint Save/Load Branching Rollback Dataset Data Updates Mixing Ratios Real-time Data Control Pause/Resume Evaluation Stop Training Case Studies & Applications Human-in-the-Loop GPT-2 on WikiText-2 Expert adjusts learning rate based on real-time loss Better convergence achieved LLM-in-the-Loop Automated Agent AI agent corrects suboptimal hyperparameters automatically Recovers from instability Real-time Data Updates NeuralOS Application Continuously incorporates real user interactions Improves deployed model Implementation Built on HuggingFace Transformers Minimal code changes required Open-source framework WebSocket + REST API Callback-based architecture Extensible command system
Q1
1. What is the main limitation of traditional neural network training that this paper addresses?
Training takes too much computational power
Lack of flexibility to respond to training issues without restarting
Models are too large to train effectively
Q2
2. In the paper's case studies, which approach did NOT demonstrate successful interactive training?
Using LLMs to automatically adjust hyperparameters
Real-time updates with user-generated data
Using reinforcement learning for optimization
Q3
3. What analogy does the paper use to explain the difference between static and interactive training?
Driving a car vs riding a train
Baking in an oven vs cooking on a stove
Reading a book vs watching a movie