2026-03-16 Papers

1/2

Paper 1

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Published: 2026-03-12

Link: http://arxiv.org/pdf/2603.12255

1. 📘 Topic and Domain: The paper focuses on streaming visual-based spatial intelligence using test-time training (TTT) for multimodal large language models (MLLMs) to understand 3D spatial relationships from long-horizon video streams.

2. 💡 Previous Research and New Ideas: The paper builds on test-time training methods and existing spatial MLLMs, proposing a novel hybrid TTT architecture with spatial-predictive mechanisms using 3D spatiotemporal convolutions and dense scene-description supervision for effective fast-weight updates.

3. ❓ Problem: The paper addresses MLLMs' inability to maintain and update spatial understanding from unbounded video streams, where spatial information emerges gradually through continuous observations with changing viewpoints and occlusions.

4. 🛠️ Methods: The authors use a hybrid architecture interleaving TTT layers with self-attention layers (3:1 ratio), large-chunk updates with sliding-window attention, spatial-predictive 3D convolutions, and dense scene-description training data.

5. 📊 Results and Evaluation: Spatial-TTT-2B achieves state-of-the-art performance on VSI-Bench (64.4 Avg.), MindCube-Tiny (76.2 ACC), and VSI-SUPER benchmarks, outperforming both proprietary and open-source models while maintaining efficient memory usage for long videos.

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

1/2

Paper 2

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Published: 2026-03-12

Link: http://arxiv.org/pdf/2603.12180

1. 📘 Topic and Domain: The paper introduces MADQA, a benchmark for evaluating multimodal agentic systems on document-intensive workflows involving complex reasoning over heterogeneous PDF collections.

2. 💡 Previous Research and New Ideas: Building on existing document QA benchmarks that focus on single documents or simple retrieval, the paper proposes a new benchmark requiring genuine multi-step planning and iterative retrieval across document collections with fully human-authored questions.

3. ❓ Problem: The paper aims to determine whether multimodal agents demonstrate strategic reasoning capabilities or merely rely on brute-force stochastic search when answering questions across document collections.

4. 🛠️ Methods: The authors created 2,250 human-authored questions over 800 PDFs, used Classical Test Theory for principled dataset splits, and introduced a novel evaluation protocol measuring accuracy-effort trade-offs using the Kuiper statistic.

5. 📊 Results and Evaluation: While the best agents matched human accuracy (~82%), they succeeded on different questions, relied on brute-force search, failed to close a 20% gap to oracle performance, and showed poor effort calibration compared to humans.

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

1/2

Paper 3

LMEB: Long-horizon Memory Embedding Benchmark

Published: 2026-03-12

Link: http://arxiv.org/pdf/2603.12572

1. 📘 Topic and Domain: The paper introduces LMEB (Long-horizon Memory Embedding Benchmark), a comprehensive evaluation framework for text embeddings focused on long-term, context-dependent memory retrieval tasks.

2. 💡 Previous Research and New Ideas: The paper builds on existing text embedding benchmarks like MTEB, BEIR, and MIRACL that focus on traditional passage retrieval, proposing a new benchmark that specifically evaluates models' ability to handle fragmented, temporally distant, and context-dependent memory retrieval across episodic, dialogue, semantic, and procedural memory types.

3. ❓ Problem: Current embedding benchmarks fail to adequately evaluate models' capacity to handle long-horizon memory retrieval tasks that involve recalling fragmented, context-dependent information over extended periods, leaving a gap in understanding how models perform in memory-intensive scenarios.

4. 🛠️ Methods: The authors compiled 22 datasets across 4 memory types with 193 zero-shot retrieval tasks, evaluated 15 embedding models (ranging from 239M to 12B parameters) using NDCG@10 and Recall@10 metrics, and analyzed correlations between LMEB and MTEB performance.

5. 📊 Results and Evaluation: The best model achieved 61.41 Mean (Dataset) score on NDCG@10, larger models didn't consistently outperform smaller ones, and LMEB showed orthogonality to MTEB (Pearson correlation: -0.115, Spearman: -0.130), indicating that traditional passage retrieval performance doesn't generalize to long-horizon memory retrieval.