1. 📘 Topic and Domain: Development of EmbeddingGemma, a lightweight text embedding model for natural language processing, focusing on efficient text representation.
2. 💡 Previous Research and New Ideas: Based on Gemma 3 language model family and encoder-decoder models; proposes new training techniques combining encoder-decoder initialization, geometric embedding distillation, and spread-out regularization.
3. ❓ Problem: The trade-off between model capability and computational cost in text embedding models, where state-of-the-art models are too large and expensive for real-world applications.
4. 🛠️ Methods: Uses a 308M parameter model initialized from T5Gemma encoder, trained with noise-contrastive estimation loss, spread-out regularizer, and embedding matching loss, combined with model souping from multiple finetuned checkpoints.
5. 📊 Results and Evaluation: Achieves state-of-the-art results on MTEB benchmarks for models under 500M parameters, outperforming larger models and maintaining performance even with quantization and embedding truncation.