2026-02-06 Papers

1/2

Paper 1

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.05386

1. 📘 Topic and Domain: The paper addresses security vulnerabilities in LLM-powered autonomous agents by proposing an intrinsic defense mechanism for real-world agent deployments.

2. 💡 Previous Research and New Ideas: Building on existing mandatory checking paradigms that forcibly trigger security validation at predefined stages, the paper proposes Intrinsic Risk Sensing (IRS) - a novel event-driven defense that embeds risk awareness directly into the agent's execution flow.

3. ❓ Problem: The paper solves the problem of excessive latency and false positives in agent security systems caused by mandatory stage-wise security checks that accumulate overhead in complex, multi-step agent workflows.

4. 🛠️ Methods: The authors use instruction-level conditioning to enable autonomous risk sensing across four critical stages (query, planning, action, observation), combined with Hierarchical Adaptive Screening that balances fast similarity matching with deep reasoning for threat validation.

5. 📊 Results and Evaluation: SPIDER-SENSE achieves the lowest Attack Success Rate and False Positive Rate on benchmarks including their new S2Bench dataset, with only 8.3% latency overhead compared to 197-381% for baseline methods.

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

1/2

Paper 2

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.06028

1. 📘 Topic and Domain: The paper addresses autoregressive video generation with long context, focusing on maintaining consistency in long-form video synthesis within the domain of generative AI and computer vision.

2. 💡 Previous Research and New Ideas: The paper builds upon causal video diffusion models and Distribution Matching Distillation (DMD), proposing Context Forcing which resolves the student-teacher mismatch by training a long-context student via a long-context teacher aware of full generation history.

3. ❓ Problem: The paper solves the "Forgetting-Drifting Dilemma" where current models either lose track of subjects with short memory windows or accumulate errors with long contexts, limiting effective context to just a few seconds.

4. 🛠️ Methods: The method uses a two-stage curriculum with contextual distribution matching distillation, a Slow-Fast Memory architecture for context management, and Error-Recycling Fine-Tuning (ERFT) to train robust teachers.

5. 📊 Results and Evaluation: Context Forcing achieves 20+ seconds effective context length (2-10× longer than state-of-the-art), maintaining superior consistency on VBench metrics and enabling minute-level video generation with minimal drifting.

Context Forcing: Consistent Autoregressive Video Generation with Long Context

1/2

Paper 3

RISE-Video: Can Video Generators Decode Implicit World Rules?

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.05986

1. 📘 Topic and Domain: The paper focuses on evaluating the reasoning capabilities of Text-Image-to-Video (TI2V) generation models, specifically their ability to internalize and apply implicit world rules beyond surface-level visual quality.

2. 💡 Previous Research and New Ideas: Building on existing video generation benchmarks that emphasize perceptual quality and temporal coherence (like VBench), this work introduces a reasoning-oriented evaluation framework with 467 human-annotated samples across eight reasoning dimensions and an automated LMM-based evaluation pipeline.

3. ❓ Problem: The paper addresses the lack of evaluation protocols for assessing whether TI2V models can reliably understand and reason over implicit world rules, as current benchmarks predominantly focus on visual fidelity rather than cognitive reasoning abilities.

4. 🛠️ Methods: The authors develop RISE-Video benchmark with four evaluation metrics (Reasoning Alignment, Temporal Consistency, Physical Rationality, Visual Quality), use GPT-5 as an automated judge with specially designed prompts, and evaluate 11 state-of-the-art TI2V models across eight reasoning categories.

5. 📊 Results and Evaluation: Experiments reveal that all models achieve low accuracy scores (best: 22.5% by Hailuo 2.3), with closed-source models outperforming open-source ones; models perform better on perceptual knowledge but struggle significantly with logical capability tasks, and the LMM-based evaluation shows strong alignment with human judgments.