2025-09-25 Papers

1/2

Paper 1

SIM-CoT: Supervised Implicit Chain-of-Thought

Published: 2025-09-24

Link: http://arxiv.org/pdf/2509.20317

1. 📘 Topic and Domain: The paper focuses on improving implicit Chain-of-Thought (CoT) reasoning in Large Language Models within the domain of natural language processing and machine learning.
2. 💡 Previous Research and New Ideas: Based on existing implicit CoT methods like Coconut and CODI, it proposes SIM-CoT, a novel approach that introduces step-level supervision to stabilize and enrich latent reasoning space.
3. ❓ Problem: The paper addresses the latent instability issue in implicit CoT approaches, where increasing the number of implicit reasoning tokens leads to training instability and performance collapse.
4. 🛠️ Methods: The authors implement a plug-and-play training module with an auxiliary decoder that aligns each implicit token with corresponding explicit reasoning steps during training, while removing the decoder during inference.
5. 📊 Results and Evaluation: SIM-CoT improved performance across multiple models and benchmarks, achieving +8.2% improvement over Coconut on GPT-2, +3.0% over CODI on LLaMA-3.1 8B, and surpassing explicit CoT baseline by 2.1% with 2.3× greater token efficiency.

SIM-CoT: Supervised Implicit Chain-of-Thought

SIM-CoT: Supervised Implicit Chain-of-Thought Workflow Problem Analysis Latent Instability Issue • Information Loss • Semantic Homogenization • Shifted Distance • Training Collapse Training Data GSM8K-Aug 385k examples Mathematical reasoning SIM-CoT Method Implicit Phase Latent Construction z_k = H_θ(U^(k-1)) Explicit Phase Answer Decoding p_θ(a|x, z_1:K) Step-level Supervision (Training Only) Auxiliary Decoder: p_φ(s_k|z_k) L = λ_step * L_step + λ_lm * L_ans-lm Baseline Methods Coconut (Answer-level) CODI (Trajectory-level) SFT-CoT (Explicit) iCoT Evaluation Framework In-Domain GSM8K-Aug Test Set Out-of-Domain GSM-Hard | MultiArith | SVAMP Robustness Evaluation Model Scales GPT-2 | LLaMA 1B | 3B | 8B Key Results & Contributions Performance Gains +8.2% over Coconut +3.0% over CODI +2.1% over SFT-CoT 2.3× Token Efficiency Training Stability Prevents collapse Scales to 8-16 tokens Maintains diversity Semantic grounding Interpretability Step visualization Semantic diagnosis Human-readable No inference overhead Plug & Play
Q1
1. What is the main problem that SIM-CoT aims to solve?
High computational costs of explicit Chain-of-Thought reasoning
Latent instability when scaling implicit reasoning tokens
Poor performance on mathematical word problems
Q2
2. How does SIM-CoT maintain efficiency during inference?
By using smaller language models
By compressing the reasoning steps
By removing the auxiliary decoder after training
Q3
3. What unique advantage does SIM-CoT provide compared to previous implicit CoT methods?
It achieves better performance with fewer training examples
It provides interpretability by projecting latent tokens onto explicit reasoning vocabulary
It eliminates the need for any supervision during training
1/2

Paper 2

EmbeddingGemma: Powerful and Lightweight Text Representations

Published: 2025-09-24

Link: http://arxiv.org/pdf/2509.20354

1. 📘 Topic and Domain: Development of EmbeddingGemma, a lightweight text embedding model for natural language processing, focusing on efficient text representation.
2. 💡 Previous Research and New Ideas: Based on Gemma 3 language model family and encoder-decoder models; proposes new training techniques combining encoder-decoder initialization, geometric embedding distillation, and spread-out regularization.
3. ❓ Problem: The trade-off between model capability and computational cost in text embedding models, where state-of-the-art models are too large and expensive for real-world applications.
4. 🛠️ Methods: Uses a 308M parameter model initialized from T5Gemma encoder, trained with noise-contrastive estimation loss, spread-out regularizer, and embedding matching loss, combined with model souping from multiple finetuned checkpoints.
5. 📊 Results and Evaluation: Achieves state-of-the-art results on MTEB benchmarks for models under 500M parameters, outperforming larger models and maintaining performance even with quantization and embedding truncation.

EmbeddingGemma: Powerful and Lightweight Text Representations

EmbeddingGemma Training Pipeline Encoder-Decoder Training Gemma 3 → T5Gemma UL2 Objective EmbeddingGemma Architecture 24-layer Transformer Bidirectional Attention Pre-finetuning Large-scale unsupervised 314B tokens Finetuning High-quality mixture 20B tokens Input Processing Query + Task Prompt Embedding Generation Mean Pooling Linear Projections NCE Loss Contrastive Spread-out Loss Regularization Distillation Loss Gemini Teacher Model Souping Parameter Averaging Quantization int4/int8/mixed Aware Training EmbeddingGemma 308M Parameters 768-dim Embeddings MTEB Multilingual MTEB English MTEB Code XOR-Retrieve Cross-lingual XTREME-UP Low-resource Key Innovations Encoder-Decoder Init Geometric Distillation Spread-out Regularizer Model Souping MRL Support State-of-the-art performance on MTEB benchmarks with <500M parameters Competitive with models 2x larger • Robust to quantization • Supports dimension truncation
Q1
1. What is the main innovation that allows EmbeddingGemma to achieve better performance compared to previous models of similar size?
Using larger batch sizes during training
Combining encoder-decoder initialization with geometric embedding distillation
Adding more attention layers to the architecture
Q2
2. When EmbeddingGemma's embeddings are truncated to just 128 dimensions, what happens to its performance?
It completely fails to work
It maintains state-of-the-art performance for its size class
It performs worse than random chance
Q3
3. What real-world application challenge does EmbeddingGemma specifically address?
The need for models that can only work with English text
The need for massive computing infrastructure
The need for efficient, on-device deployment for privacy-sensitive applications
1/2

Paper 3

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Published: 2025-09-24

Link: http://arxiv.org/pdf/2509.20360

1. 📘 Topic and Domain: A unified framework called EditVerse for both image and video editing/generation using in-context learning in computer vision.
2. 💡 Previous Research and New Ideas: Based on previous fragmented approaches to image/video editing, proposes a novel unified architecture that represents all modalities (text, image, video) as a single token sequence.
3. ❓ Problem: Addresses the fragmentation and data scarcity in video editing by creating a unified framework that can transfer knowledge from image to video domain.
4. 🛠️ Methods: Uses a transformer architecture with full self-attention, interleaved text/vision inputs, 4D rotary positional embeddings, and a scalable data pipeline generating 232K video editing samples.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance on EditVerseBench (their proposed benchmark), surpassing existing open-source methods and commercial models in both automated metrics and user studies.

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

EditVerse: Unified Image and Video Editing Framework Input Modalities Text, Image, Video Arbitrary Resolution Tokenization VAE for Vision T5 for Text Interleaved Sequence Unified Token Stream Start/End Vision Tokens 4D RoPE Height, Width, Sequential Temporal Dimensions Transformer with Full Self-Attention 2B Dense Architecture In-Context Learning & Cross-Modal Knowledge Transfer Flow Matching Training Objective Data Pipeline 232K Video Editing 6M Image Editing 4M Video Generation 2M Image Generation VLM Filtering Video Editing Tasks Object Add/Remove/Change Style Transfer Camera Movement Mask Detection Propagation Emergent Abilities Beyond Training Tasks Material Change Weather Effects Multi-task Combination Reference Insertion EditVerseBench 100 Videos 20 Editing Categories Horizontal & Vertical 200 Editing Pairs Comprehensive Evaluation Unified Image & Video Generation and Editing State-of-the-art Performance Flexible Input/Output Handling Performance Surpasses Open-source Competitive with Commercial Key Innovation Cross-modal Transfer Image to Video Knowledge Evaluation VLM Evaluation User Studies
Q1
1. What is the key innovation in EditVerse's architectural design that enables effective knowledge transfer between image and video domains?
Using separate neural networks for image and video processing
Representing all modalities as a unified token sequence with interleaved design
Implementing a cascaded pipeline of specialized models
Q2
2. How did EditVerse address the challenge of limited video editing training data?
By using only synthetic data generation
By collecting manual annotations from experts
By developing a pipeline that generates and filters 232K video editing samples combined with image editing data
Q3
3. What is a key limitation of EditVerse according to the paper?
High computational cost due to full self-attention on long sequences
Inability to handle high-resolution videos
Limited support for different video formats