2026-02-06 Papers

1/2

Paper 1

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.05386

1. 📘 Topic and Domain: The paper addresses security vulnerabilities in LLM-powered autonomous agents by proposing an intrinsic defense mechanism for real-world agent deployments.
2. 💡 Previous Research and New Ideas: Building on existing mandatory checking paradigms that forcibly trigger security validation at predefined stages, the paper proposes Intrinsic Risk Sensing (IRS) - a novel event-driven defense that embeds risk awareness directly into the agent's execution flow.
3. ❓ Problem: The paper solves the problem of excessive latency and false positives in agent security systems caused by mandatory stage-wise security checks that accumulate overhead in complex, multi-step agent workflows.
4. 🛠️ Methods: The authors use instruction-level conditioning to enable autonomous risk sensing across four critical stages (query, planning, action, observation), combined with Hierarchical Adaptive Screening that balances fast similarity matching with deep reasoning for threat validation.
5. 📊 Results and Evaluation: SPIDER-SENSE achieves the lowest Attack Success Rate and False Positive Rate on benchmarks including their new S2Bench dataset, with only 8.3% latency overhead compared to 197-381% for baseline methods.

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Spider-Sense Framework Flow Query Stage User Input q Plan Stage Internal Plan Pt Action Stage Execute at Observation Stage Receive ot Intrinsic Risk Sensing (IRS) - Continuous Monitoring φ¹ φ² φ³ φ⁴ ! Hierarchical Adaptive Screening (HAC) Coarse-grained Detection Fine-grained Analysis Vector DB Decision: ACCEPT / REJECT / SANITIZE Legend: Query Plan Action Observation Risk Triggered
Q1
1. What pop culture reference inspired the name and concept of the SPIDER-SENSE framework?
The spider's web-like neural networks used in deep learning
Spider-Man's ability to sense danger before it happens
The distributed nature of spider colonies working together
Q2
2. How does SPIDER-SENSE's latency overhead compare to existing guardrail-based defense methods?
8.3% for SPIDER-SENSE vs 197-381% for baselines
83% for SPIDER-SENSE vs 19-38% for baselines
0.83% for SPIDER-SENSE vs 97-181% for baselines
Q3
3. What are the four security-critical stages where SPIDER-SENSE monitors for potential risks?
Input, Processing, Output, Feedback
Authentication, Authorization, Execution, Logging
Query, Plan, Action, Observation
1/2

Paper 2

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.06028

1. 📘 Topic and Domain: The paper addresses autoregressive video generation with long context, focusing on maintaining consistency in long-form video synthesis within the domain of generative AI and computer vision.
2. 💡 Previous Research and New Ideas: The paper builds upon causal video diffusion models and Distribution Matching Distillation (DMD), proposing Context Forcing which resolves the student-teacher mismatch by training a long-context student via a long-context teacher aware of full generation history.
3. ❓ Problem: The paper solves the "Forgetting-Drifting Dilemma" where current models either lose track of subjects with short memory windows or accumulate errors with long contexts, limiting effective context to just a few seconds.
4. 🛠️ Methods: The method uses a two-stage curriculum with contextual distribution matching distillation, a Slow-Fast Memory architecture for context management, and Error-Recycling Fine-Tuning (ERFT) to train robust teachers.
5. 📊 Results and Evaluation: Context Forcing achieves 20+ seconds effective context length (2-10× longer than state-of-the-art), maintaining superior consistency on VBench metrics and enabling minute-level video generation with minimal drifting.

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Context Forcing Workflow Stage 1: Local Distribution Matching • Train on short windows (1-5s) • Minimize L_local = KL(p_θ(X_1:k) || p_T(X_1:k)) • Warm up causal student with DMD Stage 2: Contextual Distribution Matching • Train on long continuations (10-30s) • Minimize L_context with long teacher • Progressive rollout curriculum Context Management System Attention Sink Initial N_s tokens Stabilize attention Slow Memory Long-term buffer High-entropy frames Fast Memory Rolling FIFO queue Local context Robust Context Teacher • Error-Recycling Fine-Tuning • Process long contexts (20s+) Long Context Student • Contextual DMD training • Generate 60s+ videos Key Innovations ✓ Bounded Positional Encoding ✓ Surprisal-Based Consolidation ✓ Clean Context Policy Result: 20s+ effective context length, 2-10× improvement over SOTA
Q1
1. What is the fundamental issue that Context Forcing addresses in current autoregressive video generation methods?
The computational cost of generating videos longer than 5 seconds
The student-teacher mismatch where teachers can't access long-term history to guide students on global temporal dependencies
The lack of high-quality training data for long video sequences
Q2
2. How does the Slow-Fast Memory architecture in Context Forcing decide which frames to consolidate into long-term memory?
By randomly sampling frames at fixed intervals to ensure diversity
By comparing key vector similarity between consecutive frames and storing high-surprisal (dissimilar) frames
By using a neural network to predict which frames will be important for future generation
Q3
3. What specific training technique does Context Forcing use to make the Context Teacher robust to accumulated errors during inference?
Error-Recycling Fine-Tuning (ERFT) that injects realistic accumulated errors into the teacher's context during training
Adversarial training with a discriminator that detects synthetic frames
Multi-scale temporal augmentation with random frame dropping
1/2

Paper 3

RISE-Video: Can Video Generators Decode Implicit World Rules?

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.05986

1. 📘 Topic and Domain: The paper focuses on evaluating the reasoning capabilities of Text-Image-to-Video (TI2V) generation models, specifically their ability to internalize and apply implicit world rules beyond surface-level visual quality.
2. 💡 Previous Research and New Ideas: Building on existing video generation benchmarks that emphasize perceptual quality and temporal coherence (like VBench), this work introduces a reasoning-oriented evaluation framework with 467 human-annotated samples across eight reasoning dimensions and an automated LMM-based evaluation pipeline.
3. ❓ Problem: The paper addresses the lack of evaluation protocols for assessing whether TI2V models can reliably understand and reason over implicit world rules, as current benchmarks predominantly focus on visual fidelity rather than cognitive reasoning abilities.
4. 🛠️ Methods: The authors develop RISE-Video benchmark with four evaluation metrics (Reasoning Alignment, Temporal Consistency, Physical Rationality, Visual Quality), use GPT-5 as an automated judge with specially designed prompts, and evaluate 11 state-of-the-art TI2V models across eight reasoning categories.
5. 📊 Results and Evaluation: Experiments reveal that all models achieve low accuracy scores (best: 22.5% by Hailuo 2.3), with closed-source models outperforming open-source ones; models perform better on perceptual knowledge but struggle significantly with logical capability tasks, and the LMM-based evaluation shows strong alignment with human judgments.

RISE-Video: Can Video Generators Decode Implicit World Rules?

RISE-Video: Methodology Flow Chart Data Construction Experiential Knowledge Perceptual Knowledge Temporal Knowledge Spatial Knowledge Commonsense Knowledge Societal Knowledge Subject Knowledge Logical Capability 467 Human-Annotated Samples Evaluation Metrics Reasoning Alignment • Knowledge-aware questions • Binary (Yes/No) Temporal Consistency • Non-instructed element stability • 1-5 scale Physical Rationality • Physics law adherence • 1-5 scale Visual Quality • Perceptual fidelity • Technical integrity • 1-3 scale LMM-based Evaluation Pipeline Frame Extraction Dimension-specific GPT-5 as Judge (GPT-5-mini for VQ) Automated Scoring High human alignment Evaluation Results 11 TI2V Models Evaluated Reasoning Limitations Revealed
Q1
1. What unique evaluation strategy does RISE-Video employ for 'Schematic Puzzles' like maze navigation and board games?
It uses standard LMM-as-Judge with longer prompts to describe the geometric constraints
It bypasses linguistic judging and uses specialized strategies like color-based trajectory tracking and grid-level structural alignment
It requires human annotators to manually verify each puzzle solution frame-by-frame
Q2
2. In the bottle drinking example shown in Figure 1, which models successfully demonstrated experiential reasoning by inferring the need to unscrew the bottle cap?
Only Veo 3.1 and Hailuo 2.3
All closed-source models including Kling 2.6 and Sora 2
CogVideoX1.5 and HunyuanVideo 1.5
Q3
3. How does RISE-Video weight its four evaluation metrics to calculate the overall Weighted Score?
Equal weights of 0.25 for all four metrics
Reasoning Alignment (0.4), Temporal Consistency (0.25), Physical Rationality (0.25), Visual Quality (0.1)
Visual Quality (0.4), Physical Rationality (0.3), Temporal Consistency (0.2), Reasoning Alignment (0.1)