2026-03-05 Papers

1/2

Paper 1

Helios: Real Real-Time Long Video Generation Model

Published: 2026-03-04

Link: http://arxiv.org/pdf/2603.04379

1. 📘 Topic and Domain: The paper presents Helios, a 14B parameter autoregressive diffusion model for real-time long video generation in the domain of computer vision and generative AI.
2. 💡 Previous Research and New Ideas: The paper builds on diffusion transformers and autoregressive video generation methods, proposing new ideas including Unified History Injection for infinite video generation, Easy Anti-Drifting strategies without self-forcing, and Deep Compression Flow for efficient computation.
3. ❓ Problem: The paper aims to solve the challenge of generating high-quality, temporally coherent long videos in real-time, addressing issues of drifting, computational efficiency, and the limitations of existing models that are either too slow or produce low-quality results.
4. 🛠️ Methods: The authors use an autoregressive diffusion transformer with Guidance Attention blocks, Multi-Term Memory Patchification for context compression, Pyramid Unified Predictor Corrector for multi-scale generation, and Adversarial Hierarchical Distillation to reduce sampling steps from 50 to 3.
5. 📊 Results and Evaluation: Helios achieves 19.5 FPS on a single H100 GPU while generating minute-scale videos, outperforming existing methods on the newly introduced HeliosBench across metrics including aesthetic quality, motion smoothness, semantic alignment, and naturalness, with a 128× speedup compared to baseline models.

Helios: Real Real-Time Long Video Generation Model

Helios: Real-Time Long Video Generation Workflow Unified History Injection Historical Context + Noisy Context → Autoregressive Generation Supports T2V, I2V, V2V tasks Infinity Generation Representation Control Guidance Attention Adaptive Sampling High-Quality Generation Relative RoPE First-Frame Anchor Frame-Aware Corrupt Real-Time Generation Multi-Term Memory Patchification Pyramid Unified Predictor Corrector Adversarial Hierarchical Distillation Three-Stage Training Pipeline Stage 1: Base Architectural Adaptation Bidirectional → Autoregressive Stage 2: Mid Token Compression Reduce Computation Stage 3: Distilled Step Reduction: 50 → 3 Eliminate CFG Infrastructure Optimizations Flash Normalization Flash RoPE Sharded EMA Cache Grad 19.5 FPS on Single H100 GPU
Q1
1. What unique approach does Helios use to enable infinite video generation without relying on causal masking?
It uses Unified History Injection with Guidance Attention to treat long-video generation as video continuation
It employs a GPT-style architecture with self-forcing rollouts during training
It implements error-bank mechanisms with keyframe sampling for temporal consistency
Q2
2. How does Helios achieve real-time performance (19.5 FPS) on a single H100 GPU despite being a 14B parameter model?
By using KV-cache, sparse attention, and quantization techniques
By compressing historical/noisy contexts and reducing sampling steps from 50 to 3 via Adversarial Hierarchical Distillation
By distributing computation across 8 GPUs with hidden-state caching
Q3
3. What are the three canonical manifestations of drifting that Helios identifies and addresses in long-video generation?
Position shift, color shift, and restoration shift (blur/noise artifacts)
Temporal jitter, semantic inconsistency, and motion blur
Frame dropping, resolution degradation, and audio desynchronization
1/2

Paper 2

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Published: 2026-03-04

Link: http://arxiv.org/pdf/2603.03790

1. 📘 Topic and Domain: The paper focuses on benchmarking and improving text-to-structure reasoning capabilities of large language models in the domain of natural language processing and structured reasoning.
2. 💡 Previous Research and New Ideas: The paper builds on existing text-processing benchmarks and chain-of-thought prompting, proposing Structure of Thought (SoT) prompting and T2S-Bench as new approaches for explicit text structuring.
3. ❓ Problem: The paper aims to solve the lack of stable intermediate representations in complex text processing tasks, which causes unstable retrieval and uncontrollable generation in current language models.
4. 🛠️ Methods: The authors use Structure of Thought prompting to guide models in constructing node-link structures, and create T2S-Bench through extracting text-structure pairs from academic papers across 6 domains and 32 structural types.
5. 📊 Results and Evaluation: SoT prompting yields +5.7% average improvement on Qwen2.5-7B across eight tasks, while evaluation of 45 models on T2S-Bench shows only 52.1% average accuracy on multi-hop reasoning with significant room for improvement in structure extraction.

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

T2S-Bench: Text-to-Structure Reasoning Workflow Sample Collection Model Search Paper Selection PDF Download Figure Extraction Valid? Human Filter 672 Text-Structure Pairs Multi-hop Reasoning Construction Fault Localization Functional Mapping Boundary Testing Counterfactual Reasoning Template-based Generation Correctness Check Text Dependency Human Filter T2S-Train-1.2K T2S-Bench-MR E2E Dataset Construction Key Structure Extraction Model Logic & Consistency Check Human Filter Node Evaluation Link Evaluation T2S-Bench-E2E (87) Downstream Applications Structure of Thought (SoT) Model Fine-tuning Performance Improvements +5.7% (SoT) | +8.6% (Fine-tuning) on Text Tasks Key Findings 45 Models Tested 52.1% Avg Accuracy (MR) 58.1% Best Node Accuracy Significant Headroom Structure extraction remains the key bottleneck
Q1
1. What is the key innovation that Structure of Thought (SoT) prompting introduces compared to traditional Chain-of-Thought (CoT) prompting?
SoT requires models to explicitly construct node-link graph structures before answering
SoT uses multiple reasoning paths and aggregates them for better accuracy
SoT focuses on mathematical and coding tasks rather than text processing
Q2
2. How was T2S-Bench constructed to ensure high structural accuracy?
By using GPT-4 to generate synthetic text-structure pairs from scratch
By extracting text-structure pairs from rigorously vetted academic papers
By crowdsourcing annotations from undergraduate students
Q3
3. What was the most challenging aspect of structure extraction revealed by the T2S-Bench evaluation?
Link extraction, with even top models achieving less than 50% F1 score
Node extraction, with even Gemini-2.5-Pro achieving only 58.1% accuracy
Domain adaptation, with models failing completely on scientific texts
1/2

Paper 3

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Published: 2026-03-03

Link: http://arxiv.org/pdf/2603.03276

1. 📘 Topic and Domain: The paper explores unified multimodal pretraining, specifically how to effectively train foundation models that jointly handle vision and language from scratch.
2. 💡 Previous Research and New Ideas: Building on the Transfusion framework (next-token prediction for language, diffusion for vision), the paper proposes using Representation Autoencoders (RAE) as a single visual encoder for both understanding and generation, and employs Mixture-of-Experts (MoE) to dynamically allocate capacity between modalities.
3. ❓ Problem: The paper addresses the lack of clarity in multimodal pretraining design space, particularly how to train models that handle both vision and language without mutual degradation or relying on pretrained language model initialization.
4. 🛠️ Methods: The authors conduct controlled from-scratch pretraining experiments using diverse data (text, video, image-text pairs, action-conditioned video), evaluate multiple visual representations, explore MoE architectures with modality-specific routing, and derive scaling laws through IsoFLOP analysis.
5. 📊 Results and Evaluation: RAE (SigLIP2) outperforms VAEs for both generation and understanding; vision and language show complementary synergy; MoE effectively balances the scaling asymmetry between modalities (vision is more data-hungry, language is more parameter-hungry); world modeling capabilities emerge naturally from general multimodal pretraining with minimal domain-specific data.

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Beyond Language Modeling: Multimodal Pretraining Workflow Visual Representation VAE vs RAE vs Pixels SigLIP2, DINOv2, WebSSL RAE optimal for both understanding & generation Data Sources Text (DCLM) Video + Image-Text Action-conditioned Multimodal synergy Architecture Modality-specific FFN Mixture-of-Experts Emergent modality specialization Scaling Properties IsoFLOP analysis Vision data-hungry MoE harmonizes scaling asymmetry Transfusion Framework Unified Decoder-only Transformer Next-token prediction for language Flow matching diffusion for vision Joint loss: L = λ_LM * L_LM + λ_flow * L_flow Training Process Hybrid masking From scratch World Modeling Actions as text tokens Zero-shot emergence Evaluation Text PPL, VQA DPGBench, GenEval Key Findings 1. RAE unifies visual tasks 2. Multimodal synergy exists 3. MoE scales efficiently 4. World modeling emerges naturally 5. Vision is data-hungry
Q1
1. What key finding did the authors discover about the scaling properties of vision and language in unified multimodal models?
Vision and language have identical scaling requirements, making them easy to train together
Vision is significantly more data-hungry while language is more parameter-hungry, creating a scaling asymmetry
Language requires more data than vision, which is why text-only models perform better
Q2
2. How did the authors enable world modeling capabilities in their unified multimodal framework?
By training a separate specialized architecture with custom action encoders for navigation
By representing actions directly as text tokens and showing that capabilities emerge from general multimodal pretraining
By collecting massive amounts of domain-specific navigation data (>50% of training data)
Q3
3. What architectural choice did the authors find most effective for handling both visual understanding and generation tasks?
Using dual representations with VAE for generation and SigLIP2 for understanding, similar to Janus
Training separate transformer backbones for each modality with cross-attention bridges
Using a single RAE-based encoder (SigLIP2) that excels at both tasks, simplifying the architecture