2026-03-16 Papers

1/2

Paper 1

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Published: 2026-03-12

Link: http://arxiv.org/pdf/2603.12255

1. 📘 Topic and Domain: The paper focuses on streaming visual-based spatial intelligence using test-time training (TTT) for multimodal large language models (MLLMs) to understand 3D spatial relationships from long-horizon video streams.
2. 💡 Previous Research and New Ideas: The paper builds on test-time training methods and existing spatial MLLMs, proposing a novel hybrid TTT architecture with spatial-predictive mechanisms using 3D spatiotemporal convolutions and dense scene-description supervision for effective fast-weight updates.
3. ❓ Problem: The paper addresses MLLMs' inability to maintain and update spatial understanding from unbounded video streams, where spatial information emerges gradually through continuous observations with changing viewpoints and occlusions.
4. 🛠️ Methods: The authors use a hybrid architecture interleaving TTT layers with self-attention layers (3:1 ratio), large-chunk updates with sliding-window attention, spatial-predictive 3D convolutions, and dense scene-description training data.
5. 📊 Results and Evaluation: Spatial-TTT-2B achieves state-of-the-art performance on VSI-Bench (64.4 Avg.), MindCube-Tiny (76.2 ACC), and VSI-SUPER benchmarks, outperforming both proprietary and open-source models while maintaining efficient memory usage for long videos.

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Spatial-TTT: Streaming Visual-based Spatial Intelligence Video Frames (Long-horizon stream) Spatial Question (Navigation, counting, etc.) Vision Transformer (Feature extraction) Split into Chunks (Large chunk size) Hybrid TTT Architecture TTT Layer (75%) 3D Spatiotemporal Convolution Fast Weight Update Sliding Window Attn Anchor Layer (25%) Full Self-Attention (Preserve pretrained knowledge) Dense Scene Description Dataset (Global context, objects, spatial relations) Two-Stage Training Stage 1: Dense supervision Stage 2: Spatial VQA Spatial Answer Generation (Navigation, counting, spatial reasoning) Key Innovations 1 Hybrid TTT Architecture 2 Spatial-Predictive Mechanism 3 Large Chunk Updates 4 Dense Scene Supervision 5 Progressive Training
Q1
1. What is the key architectural innovation in Spatial-TTT that enables efficient processing of long-horizon spatial videos?
A hybrid architecture that interleaves TTT layers with self-attention anchor layers at a 3:1 ratio
A pure TTT architecture that completely replaces all attention layers with fast-weight networks
A standard transformer architecture with increased context window size
Q2
2. How does Spatial-TTT's spatial-predictive mechanism enhance the model's ability to capture geometric structure?
By using point-wise linear projections to generate Q, K, V for isolated tokens
By applying depth-wise 3D spatiotemporal convolutions with 3×3×3 kernels on Q, K, V projections
By increasing the number of attention heads in each transformer layer
Q3
3. What unique supervision strategy does Spatial-TTT employ to improve fast-weight update dynamics for spatial understanding?
Training exclusively on sparse spatial QA pairs with short answers like multiple-choice options
Using reinforcement learning with spatial rewards from robotic manipulation tasks
Constructing dense scene-description data requiring comprehensive 3D scene walkthroughs covering global context, object counts, and spatial relations
1/2

Paper 2

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Published: 2026-03-12

Link: http://arxiv.org/pdf/2603.12180

1. 📘 Topic and Domain: The paper introduces MADQA, a benchmark for evaluating multimodal agentic systems on document-intensive workflows involving complex reasoning over heterogeneous PDF collections.
2. 💡 Previous Research and New Ideas: Building on existing document QA benchmarks that focus on single documents or simple retrieval, the paper proposes a new benchmark requiring genuine multi-step planning and iterative retrieval across document collections with fully human-authored questions.
3. ❓ Problem: The paper aims to determine whether multimodal agents demonstrate strategic reasoning capabilities or merely rely on brute-force stochastic search when answering questions across document collections.
4. 🛠️ Methods: The authors created 2,250 human-authored questions over 800 PDFs, used Classical Test Theory for principled dataset splits, and introduced a novel evaluation protocol measuring accuracy-effort trade-offs using the Kuiper statistic.
5. 📊 Results and Evaluation: While the best agents matched human accuracy (~82%), they succeeded on different questions, relied on brute-force search, failed to close a 20% gap to oracle performance, and showed poor effort calibration compared to humans.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

MADQA: Strategic Navigation or Stochastic Search? Document Corpus 800 PDFs 13 domains 18,619 pages Human Annotation 2,250 questions 1,200+ hours Minimal evidence sets Construct Validity Lexical overlap test Guessability: 11.2% Visual: 58% required CTT Splits Test: 500 Dev: 200 Train: 1,550 Benchmark Design & Validation 6 Core Properties: Extractive, Multi-Hop, Closed-World, Grounded, Agentic, Visual Baseline Approaches BM25 MLLM Agent Iterative search + VLM Managed RAG Gemini/OpenAI APIs RLM Recursive reasoning Human Baseline Same search tools Evaluation Protocol Accuracy LLM Judge Page F1 Attribution Kuiper Calibration Doc F1 Retrieval Key Findings Best agents match human accuracy (82%) but use brute-force search • 18% oracle gap remains • Humans calibrate effort better
Q1
1. What surprising finding did the researchers discover when comparing human and AI agent performance on MADQA?
AI agents were 50% more accurate than humans due to superior pattern recognition
Despite achieving similar accuracy (~82%), humans and agents succeeded on largely different questions with low agreement (κ=0.24)
Humans took 10x longer than AI agents to answer questions but were more accurate
Q2
2. Which evaluation metric did the authors introduce to measure whether agents efficiently allocate computational resources?
The Kuiper statistic, which measures effort-accuracy alignment through cumulative difference curves
The F1-score modified with a computational penalty term
A simple ratio of accuracy divided by number of API calls
Q3
3. What percentage of MADQA questions require visual understanding beyond plain text extraction?
Only 15% require visual artifacts like charts or checkboxes
Approximately 58% benefit from understanding structured layouts, tables, or visual artifacts
Nearly 90% are impossible without computer vision capabilities
1/2

Paper 3

LMEB: Long-horizon Memory Embedding Benchmark

Published: 2026-03-12

Link: http://arxiv.org/pdf/2603.12572

1. 📘 Topic and Domain: The paper introduces LMEB (Long-horizon Memory Embedding Benchmark), a comprehensive evaluation framework for text embeddings focused on long-term, context-dependent memory retrieval tasks.
2. 💡 Previous Research and New Ideas: The paper builds on existing text embedding benchmarks like MTEB, BEIR, and MIRACL that focus on traditional passage retrieval, proposing a new benchmark that specifically evaluates models' ability to handle fragmented, temporally distant, and context-dependent memory retrieval across episodic, dialogue, semantic, and procedural memory types.
3. ❓ Problem: Current embedding benchmarks fail to adequately evaluate models' capacity to handle long-horizon memory retrieval tasks that involve recalling fragmented, context-dependent information over extended periods, leaving a gap in understanding how models perform in memory-intensive scenarios.
4. 🛠️ Methods: The authors compiled 22 datasets across 4 memory types with 193 zero-shot retrieval tasks, evaluated 15 embedding models (ranging from 239M to 12B parameters) using NDCG@10 and Recall@10 metrics, and analyzed correlations between LMEB and MTEB performance.
5. 📊 Results and Evaluation: The best model achieved 61.41 Mean (Dataset) score on NDCG@10, larger models didn't consistently outperform smaller ones, and LMEB showed orthogonality to MTEB (Pearson correlation: -0.115, Spearman: -0.130), indicating that traditional passage retrieval performance doesn't generalize to long-horizon memory retrieval.

LMEB: Long-horizon Memory Embedding Benchmark

LMEB: Long-horizon Memory Embedding Benchmark Method Workflow Memory Types • Episodic Memory • Dialogue Memory • Semantic Memory • Procedural Memory (22 datasets) Dataset Processing • Unified IR format • queries.jsonl • corpus.jsonl • qrels.tsv • candidates.jsonl Task Construction • 193 retrieval tasks • Multi-granularity • Task instructions • Zero-shot setup Model Evaluation • 15 embedding models • 239M to 12B parameters • w/ inst. vs w/o inst. • Max 1024 tokens Evaluation Metrics • NDCG@10 (main) • Recall@10 • MAP, MRR • Precision Result Analysis • Performance comparison • LMEB vs MTEB correlation • Model size impact • Instruction effectiveness Key Findings • LMEB offers reasonable difficulty • Larger ≠ Better performance • LMEB ⊥ MTEB (orthogonal) • Instruction impact varies LMEB Outputs • Open-source evaluation toolkit • Public leaderboard • Standardized benchmark for long-horizon memory retrieval Memory Types Episodic: 2 datasets Dialogue: 6 datasets Semantic: 8 datasets Procedural: 6 datasets Total: 22 datasets
Q1
1. What surprising finding did the LMEB benchmark reveal about the relationship between model size and performance in long-horizon memory retrieval tasks?
Larger models consistently outperformed smaller models across all memory types
Model size had no impact on performance whatsoever
Smaller models like EmbeddingGemma-300M sometimes outperformed larger models like bge-multilingual-gemma2
Q2
2. How does LMEB categorize memory types along two key dimensions in its taxonomy?
Level of Abstraction and Temporal Dependency
Query Complexity and Document Length
Language Diversity and Task Difficulty
Q3
3. What does the near-zero correlation between LMEB and MTEB (Pearson: -0.115, Spearman: -0.130) indicate about these benchmarks?
They measure identical capabilities and can be used interchangeably
They are orthogonal, meaning excellence in traditional passage retrieval doesn't guarantee success in long-horizon memory retrieval
LMEB is simply a harder version of MTEB with longer documents