2026-03-03 Papers

1/2

Paper 1

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Published: 2026-03-02

Link: http://arxiv.org/pdf/2603.02138

1. 📘 Topic and Domain: The paper focuses on generating vector animations in Lottie format from multi-modal instructions (text, image, video) using deep learning approaches in computer vision and graphics.
2. 💡 Previous Research and New Ideas: The paper builds on prior work in vector graphics generation, video generation models, and visual autoregressive models, proposing a novel Lottie tokenizer that converts JSON files into structured command sequences and an end-to-end framework for multi-modal vector animation generation.
3. ❓ Problem: The paper addresses the challenge of generating editable, resolution-independent vector animations from multi-modal inputs, as existing methods either generate raster videos lacking editability or struggle with the complex JSON structure of Lottie files.
4. 🛠️ Methods: The authors develop OmniLottie using a specialized Lottie tokenizer for efficient representation, train on a curated MMLottie-2M dataset with 2 million animations, and employ a pretrained vision-language model (Qwen2.5-VL) for autoregressive generation.
5. 📊 Results and Evaluation: OmniLottie achieves 88.3%, 93.3%, and 88.1% success rates for text-to-Lottie, text-image-to-Lottie, and video-to-Lottie tasks respectively, significantly outperforming baselines in visual quality (lowest FVD scores) and semantic alignment metrics.

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

OmniLottie: Workflow Overview Phase 1: MMLottie-2M Dataset Construction Web Crawling (1.2M Lottie files) SVG Conversion (800K animations) Data Cleaning (Remove non-visual) Normalization (512×512, 0-60fps) Annotation (Multi-modal) Phase 2: Lottie Tokenization Lottie JSON Structure • Layers (5 types) • Transform properties • Animation keyframes • Effects & masks • Metadata Parameterization • Command extraction • Parameter mapping • Offset-based encoding • Text tokenization • 81% compression Lottie Vocabulary • Command tokens • Spatial tokens • Temporal tokens • Index tokens • Text tokens (Qwen) Token Sequences CMD_ANIMATION FR, IP, OP, W, H CMD_SHAPE X, Y, IX, IY, OX, OY ... Phase 3: OmniLottie Model Qwen2.5-VL Backbone Pretrained VLM Extended Vocabulary +Lottie tokens Autoregressive Training Cross-entropy loss Multi-modal Input Text, Image, Video Phase 4: Generation Tasks Text-to-Lottie Text-Image-to-Lottie Video-to-Lottie Phase 5: Output & Rendering Detokenization Lottie JSON Generation
Q1
1. What is the key innovation of the Lottie tokenizer in OmniLottie that makes it more efficient than directly generating raw JSON?
It compresses Lottie files using lossy compression to reduce file size by 90%
It converts JSON into command-parameter sequences, achieving 81% token reduction while preserving vector fidelity
It replaces vector graphics with rasterized images that are easier to generate
Q2
2. Why does OmniLottie exclude certain layer types (Image, Audio, Camera) from its parameterization scheme?
These layers are too computationally expensive to process during training
These layers contain non-vector content or 3D complexities that cannot be fully parameterized in the model's design space
These layers were accidentally omitted and will be added in future versions
Q3
3. What strategy does OmniLottie use to improve motion understanding by combining different data sources?
It only uses professionally designed Lottie animations from web platforms for authentic motion patterns
It trains exclusively on synthetic data generated from text-to-video models
It mixes web-crawled Lottie files with SVG-derived animations augmented with procedural motions, finding 30% SVG mixture optimal
1/2

Paper 2

OpenAutoNLU: Open Source AutoML Library for NLU

Published: 2026-03-02

Link: http://arxiv.org/pdf/2603.01824

1. 📘 Topic and Domain: The paper presents OpenAutoNLU, an open-source AutoML library specifically designed for natural language understanding tasks including text classification and named entity recognition.
2. 💡 Previous Research and New Ideas: The paper builds on existing AutoML frameworks (AutoIntent, AutoGluon, LightAutoML, H2O) but introduces automatic data-aware training regime selection that requires no manual configuration, choosing between AncSetFit, SetFit, or full fine-tuning based on dataset characteristics.
3. ❓ Problem: The paper addresses the challenge that existing AutoML frameworks lack ease of use and NLP-centric design, requiring complex configuration and failing to automatically select appropriate training methods based on data size and label distribution.
4. 🛠️ Methods: The authors use a deterministic method selection based on minimum per-class sample count (AncSetFit for 2-5 examples, SetFit for 5-80 examples, full transformer fine-tuning for >80 examples) with integrated data quality diagnostics, configurable OOD detection, and LLM-powered data augmentation.
5. 📊 Results and Evaluation: OpenAutoNLU achieved best or tied performance on 3 out of 4 intent classification benchmarks (HWU64, MASSIVE, SNIPS) with superior OOD detection capabilities, maintaining strong in-domain classification quality while effectively detecting out-of-distribution samples without explicit OOD supervision.

OpenAutoNLU: Open Source AutoML Library for NLU

OpenAutoNLU Workflow Training Data Data Quality (Optional) • Retag • Uncertainty • V-Info • Cartography Data-aware Method Selection 2 ≤ nₘᵢₙ ≤ 5: AncSetFit 5 < nₘᵢₙ ≤ 80: SetFit nₘᵢₙ > 80: Fine-tuning AncSetFit SetFit Fine-tuning Data Optimization • Upsampling • Downsampling • Augmentation OOD Detection • Mahalanobis • Max Softmax • Logit-based HPO Engine (Optuna) Model Export ONNX / Torch Inference LLM Integration • Test Generation • Augmentation • Domain Analysis NER Support BIO Tagging Key Features • Automatic regime selection • Integrated data quality tools • Configurable OOD detection • Unified API for classification/NER
Q1
1. What is the key innovation of OpenAutoNLU's data-aware method selection?
It uses a meta-learning model trained on thousands of datasets to predict the best algorithm
It deterministically selects training methods based on minimum per-class sample count thresholds
It runs all methods in parallel and chooses the best performer
Q2
2. How does OpenAutoNLU handle out-of-distribution (OOD) detection differently from AutoIntent?
OpenAutoNLU can detect OOD samples without explicit OOD supervision during training
OpenAutoNLU requires more OOD training examples than AutoIntent
OpenAutoNLU only works with supervised OOD detection
Q3
3. What happens when OpenAutoNLU encounters a dataset where 40% of classes have fewer than 80 training examples?
It discards the underrepresented classes and only trains on well-represented ones
It automatically upsamples underrepresented classes using data augmentation to reach 81 examples
It switches to a completely different algorithm designed for imbalanced datasets
1/2

Paper 3

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Published: 2026-03-02

Link: http://arxiv.org/pdf/2603.01562

1. 📘 Topic and Domain: The paper addresses rubric-based evaluation for reward models in large language model alignment, focusing on benchmarking how well models can generate and apply evaluation criteria.
2. 💡 Previous Research and New Ideas: The paper builds on existing reward model benchmarks (RewardBench, RM-Bench) and rubric-guided evaluation paradigms, proposing RubricBench as the first benchmark with human-annotated rubrics for assessing model-generated evaluation criteria.
3. ❓ Problem: The paper aims to solve the lack of reliable benchmarks for rubric-guided evaluation and the gap between model-generated and human-quality evaluation rubrics in reward modeling.
4. 🛠️ Methods: The authors curated 1,147 challenging preference pairs through multi-dimensional filtering, annotated them with expert-derived atomic rubrics, and evaluated models under three conditions: vanilla, self-generated rubrics, and human-annotated rubrics.
5. 📊 Results and Evaluation: Results show a 27% accuracy gap between model-generated and human rubrics, with rubric-aware models reaching ~58% accuracy versus 40-47% for traditional approaches, demonstrating that rubric quality is the primary bottleneck in evaluation performance.

RubricBench: Aligning Model-Generated Rubrics with Human Standards

RubricBench: Methodology Flow Stage I: Data Curation Source Pools: 5 Domains Input Complexity Output Surface Bias Process Failures Stage II: Rubric Annotation Instruction-Only Analysis Explicit Rules Implicit Rules 2-10 Atomic Rubrics/Item Stage III: Quality Control Expert Reconciliation Structural Validation Stress Testing RubricBench 1,147 Samples with Human Rubrics Evaluation Framework Vanilla Direct preference verdict No intermediate reasoning ~40-47% Accuracy Self-Generated Rubrics Model derives rubrics Then verifies responses ~58% Accuracy Human Rubrics Expert-authored rubrics Ground truth constraints ~85% Accuracy Key Finding: The Rubric Gap 27% performance gap between self-generated and human-annotated rubrics
Q1
1. What is the 'Rubric Gap' identified in the paper and what does it reveal about current LLMs?
A 27% performance deficit showing that models struggle to autonomously generate valid evaluation criteria compared to using human-annotated rubrics
A 58% accuracy ceiling that represents the maximum performance achievable by any rubric-based evaluation system
A 40-47% baseline accuracy that shows traditional reward models perform better than rubric-aware approaches
Q2
2. According to the paper, what happens when you scale test-time compute (more rubrics or refinement steps) for model-generated rubrics?
Performance improves linearly, eventually matching human rubric quality at 32 generated rubrics
Diminishing or even negative returns occur, while scaling human rubrics shows robust positive correlation
The rubric gap closes completely after 3 refinement iterations, proving compute can solve quality issues
Q3
3. What type of failure mode does the paper identify when models generate rubrics for impossible or underspecified tasks?
Models correctly identify task infeasibility and generate rubrics that enforce honest refusal
Models suffer from 'Attention Displacement', focusing on surface implementation details while missing feasibility constraints
Models simply refuse to generate any rubrics when encountering ambiguous instructions