2026-03-05 Papers

1/2

Paper 1

Helios: Real Real-Time Long Video Generation Model

Published: 2026-03-04

Link: http://arxiv.org/pdf/2603.04379

1. 📘 Topic and Domain: The paper presents Helios, a 14B parameter autoregressive diffusion model for real-time long video generation in the domain of computer vision and generative AI.

2. 💡 Previous Research and New Ideas: The paper builds on diffusion transformers and autoregressive video generation methods, proposing new ideas including Unified History Injection for infinite video generation, Easy Anti-Drifting strategies without self-forcing, and Deep Compression Flow for efficient computation.

3. ❓ Problem: The paper aims to solve the challenge of generating high-quality, temporally coherent long videos in real-time, addressing issues of drifting, computational efficiency, and the limitations of existing models that are either too slow or produce low-quality results.

4. 🛠️ Methods: The authors use an autoregressive diffusion transformer with Guidance Attention blocks, Multi-Term Memory Patchification for context compression, Pyramid Unified Predictor Corrector for multi-scale generation, and Adversarial Hierarchical Distillation to reduce sampling steps from 50 to 3.

5. 📊 Results and Evaluation: Helios achieves 19.5 FPS on a single H100 GPU while generating minute-scale videos, outperforming existing methods on the newly introduced HeliosBench across metrics including aesthetic quality, motion smoothness, semantic alignment, and naturalness, with a 128× speedup compared to baseline models.

Helios: Real Real-Time Long Video Generation Model

1/2

Paper 2

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Published: 2026-03-04

Link: http://arxiv.org/pdf/2603.03790

1. 📘 Topic and Domain: The paper focuses on benchmarking and improving text-to-structure reasoning capabilities of large language models in the domain of natural language processing and structured reasoning.

2. 💡 Previous Research and New Ideas: The paper builds on existing text-processing benchmarks and chain-of-thought prompting, proposing Structure of Thought (SoT) prompting and T2S-Bench as new approaches for explicit text structuring.

3. ❓ Problem: The paper aims to solve the lack of stable intermediate representations in complex text processing tasks, which causes unstable retrieval and uncontrollable generation in current language models.

4. 🛠️ Methods: The authors use Structure of Thought prompting to guide models in constructing node-link structures, and create T2S-Bench through extracting text-structure pairs from academic papers across 6 domains and 32 structural types.

5. 📊 Results and Evaluation: SoT prompting yields +5.7% average improvement on Qwen2.5-7B across eight tasks, while evaluation of 45 models on T2S-Bench shows only 52.1% average accuracy on multi-hop reasoning with significant room for improvement in structure extraction.

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

1/2

Paper 3

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Published: 2026-03-03

Link: http://arxiv.org/pdf/2603.03276

1. 📘 Topic and Domain: The paper explores unified multimodal pretraining, specifically how to effectively train foundation models that jointly handle vision and language from scratch.

2. 💡 Previous Research and New Ideas: Building on the Transfusion framework (next-token prediction for language, diffusion for vision), the paper proposes using Representation Autoencoders (RAE) as a single visual encoder for both understanding and generation, and employs Mixture-of-Experts (MoE) to dynamically allocate capacity between modalities.

3. ❓ Problem: The paper addresses the lack of clarity in multimodal pretraining design space, particularly how to train models that handle both vision and language without mutual degradation or relying on pretrained language model initialization.

4. 🛠️ Methods: The authors conduct controlled from-scratch pretraining experiments using diverse data (text, video, image-text pairs, action-conditioned video), evaluate multiple visual representations, explore MoE architectures with modality-specific routing, and derive scaling laws through IsoFLOP analysis.

5. 📊 Results and Evaluation: RAE (SigLIP2) outperforms VAEs for both generation and understanding; vision and language show complementary synergy; MoE effectively balances the scaling asymmetry between modalities (vision is more data-hungry, language is more parameter-hungry); world modeling capabilities emerge naturally from general multimodal pretraining with minimal domain-specific data.