1. 📘 Topic and Domain: The paper focuses on automated rubric generation for evaluating open-ended language model outputs across multiple domains including medical, science, writing, instruction following, and chat.
2. 💡 Previous Research and New Ideas: Building on existing rubric-based evaluation methods and LLM-as-a-Judge paradigms, the paper proposes a novel Coarse-to-Fine Rubric Generation framework that synergizes principle-guided synthesis, multi-model aggregation, and difficulty evolution to create highly discriminative evaluation criteria.
3. ❓ Problem: The paper addresses the lack of ground truth in open-ended generation tasks and the limitations of existing rubric-based methods, including manual creation bottlenecks, narrow domain coverage, and low discriminability leading to supervision ceiling effects.
4. 🛠️ Methods: The authors employ a three-stage automated framework: (1) response-grounded and principle-guided generation, (2) multi-model aggregation to reduce bias, and (3) difficulty evolution to enhance discriminability, followed by rubric-based rejection sampling fine-tuning (RuFT) and reinforcement learning (RuRL).
5. 📊 Results and Evaluation: The resulting RubricHub dataset (~110k samples) enables Qwen3-14B to achieve state-of-the-art performance on HealthBench (69.3), surpassing GPT-5 (67.2), with consistent improvements across five evaluation domains demonstrating the framework's effectiveness in unlocking model potential.