2025-07-31 Papers

1/2

Paper 1

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Published: 2025-07-30

Link: http://arxiv.org/pdf/2507.22827

1. 📘 Topic and Domain: Automating the conversion of UI designs into front-end code using vision-language models and multi-agent systems.
2. 💡 Previous Research and New Ideas: Based on previous vision-language models and UI-to-code generation research, proposing a novel modular multi-agent framework that decomposes the task into grounding, planning, and generation stages.
3. ❓ Problem: Addressing the limitations of existing text-based and vision-based code generation systems that struggle with capturing spatial layouts and visual design intent in UI development.
4. 🛠️ Methods: Implements a three-stage pipeline with a grounding agent for UI component detection, a planning agent for hierarchical layout construction, and a generation agent for HTML/CSS code synthesis, plus dual-stage post-training of vision-language models.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance across five metrics (block match, text similarity, position alignment, color consistency, and CLIP similarity), outperforming existing open-source models and competing with proprietary systems.

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

ScreenCoder: Modular Multimodal Agents Framework Input UI Screenshots Design Sketches Grounding Agent Vision-Language Model Component Detection Semantic Labeling Output: Bounding boxes + Labels (sidebar, header, navigation, content) Planning Agent Hierarchical Layout Tree CSS Grid-based Design Spatial Heuristics Output: Layout Tree T with grid configurations Generation Agent Adaptive Prompt-based HTML/CSS Synthesis Interactive Design Support Output: HTML/CSS Code with placeholders Placeholder Mapping UI Element Detection (UIED) Image Restoration Scalable Data Engine for VLM Enhancement Dataset Generation 50K UI-Code Pairs Diverse Domains Real-world Sources Cold-start SFT Supervised Fine-tuning Qwen2.5-VL Autoregressive Training Reinforcement Learning GRPO Optimization Multi-reward System Visual-Semantic Alignment Enhanced VLM Improved UI Understanding Better Code Quality State-of-the-art Performance Evaluation Metrics Block Match Text Similarity Position Color CLIP
Q1
1. What is the main limitation of existing text-based UI-to-code generation systems that ScreenCoder aims to address?
They are too slow in processing user inputs
They require extremely long and verbose prompts to capture spatial relationships
They can only work with simple layouts
Q2
2. Which stage in ScreenCoder's pipeline is responsible for organizing UI components into a hierarchical structure?
The Grounding Agent
The Planning Agent
The Generation Agent
Q3
3. How does ScreenCoder improve the training of vision-language models?
By using transfer learning from existing code repositories
By creating synthetic UI designs randomly
By functioning as a data engine to generate large-scale image-code pairs
1/2

Paper 2

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Published: 2025-07-30

Link: http://arxiv.org/pdf/2507.22607

1. 📘 Topic and Domain: The paper focuses on multimodal reasoning in AI, specifically developing a reinforcement learning approach to improve visual-language models' reasoning capabilities across diverse tasks.
2. 💡 Previous Research and New Ideas: It builds on previous reinforcement learning work in language models and extends it to multimodal reasoning, proposing a novel progressive curriculum learning framework with dynamic length rewards.
3. ❓ Problem: The paper aims to solve the challenge of unstable performance of multimodal models across different domains and difficulty levels of reasoning tasks.
4. 🛠️ Methods: The authors developed PCuRL (Progressive Curriculum Reinforcement Learning) framework with two key components: online difficulty soft weighting for curriculum learning and dynamic length reward mechanism to adapt reasoning path lengths.
5. 📊 Results and Evaluation: VL-Cogito achieved state-of-the-art or highly competitive performance across multiple multimodal benchmarks spanning mathematics, science, logic and general understanding domains, demonstrating consistent improvements without requiring cold-start initialization.

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

VL-Cogito: Progressive Curriculum Reinforcement Learning Data Curation 23 Datasets 6 Task Categories Open-ended Format Difficulty Sampling (>50% accuracy filter) Progressive Curriculum Reinforcement Learning (PCuRL) Easy Stage 100 steps ODSW Easy Focus on simple tasks Accuracy + Format Rewards Stable foundation building Medium Stage 100 steps ODSW Medium Moderate difficulty Accuracy + Format Rewards Progressive enhancement Hard Stage ~200 steps (1 epoch) ODSW Hard Complex tasks Accuracy + Format + Dynamic Length Deep reasoning capability Online Difficulty Soft Weighting (ODSW) Dynamic weight adjustment based on rollout accuracy Sine + constant function F(Acc) Emphasizes optimal learnability (Acc ≈ 0.5) Smooth transition between difficulty levels Dynamic Length Reward (DyLR) Adaptive reasoning length based on task complexity Target length = avg length of correct responses Cosine function for length reward calculation Balances efficiency with reasoning depth Group Relative Policy Optimization (GRPO) Foundation Advantage estimation within response groups | Clipped surrogate objective VL-Cogito Model Enhanced multimodal reasoning capabilities Key Innovation 1 Key Innovation 2
Q1
1. What is the main innovation of the PCuRL framework that distinguishes it from previous approaches?
Its use of binary weighting for task difficulty
Its dynamic length reward mechanism that adapts to task complexity
Its requirement for cold-start initialization
Q2
2. Why did the authors introduce the dynamic length reward only in the hard stage of curriculum learning?
To save computational resources in earlier stages
Because it wasn't necessary for simpler tasks
To allow free exploration in early stages while strengthening complex reasoning capabilities later
Q3
3. What unique feature of VL-Cogito's training process sets it apart from competing models like R1-VL and OpenVLThinker?
It achieves superior performance without requiring cold-start warm-up
It uses a much larger training dataset
It requires more computational resources
1/2

Paper 3

ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge

Published: 2025-07-29

Link: http://arxiv.org/pdf/2507.21990

1. 📘 Topic and Domain: The paper presents ChemDFM-R, a chemical reasoner large language model enhanced with atomized chemical knowledge, in the domain of chemistry and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous work in general domain reasoning LLMs and chemical LLMs, it proposes incorporating atomized functional group knowledge and developing chemical-specific reasoning capabilities through a novel training pipeline.
3. ❓ Problem: The paper addresses the limitations of current LLMs in chemistry: shallow domain understanding and limited reasoning capabilities that hinder reliable practical applications.
4. 🛠️ Methods: The method involves: 1) Constructing a functional group-centric pretraining corpus (ChemFG), 2) Domain pretraining and instruction tuning, 3) Mix-sourced distillation combining expert knowledge with general reasoning skills, 4) Domain-specific reinforcement learning.
5. 📊 Results and Evaluation: ChemDFM-R achieved state-of-the-art performance on chemical benchmarks while providing interpretable rationales. It demonstrated strong chemical reasoning capabilities and enabled reliable human-AI collaboration in chemistry research scenarios.

ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge

ChemDFM-R: Chemical Reasoner LLM with Atomized Knowledge Phase 1: ChemFG Dataset Construction Raw Data Collection (101B tokens) Literature 12M papers Molecules 30M compounds Reactions 7M reactions Functional Group Identification Toolkit 241 functional groups identified Phase 2: Atomized Knowledge Enhancement Domain Pre-training on ChemFG Base: Qwen2.5-14B Instruction Tuning Chemical + General tasks (1:2 ratio) ChemDFM-I Model Phase 3: Chemical Rationale Learning Mix-sourced Distillation 70% Direct + 22% Pseudo CoT + 8% Teacher Teachers: DeepSeek-R1, o3-mini Reinforcement Learning (DAPO) Format + Accuracy Rewards ChemDFM-R Model Evaluation Results Benchmark Performance SciKnowEval: 0.70 ChemEval: 0.78 Outperforms GPT-4o, DeepSeek-R1 Rationale Analysis 67% high quality rationales 23% minor flaws 10% substantial issues Human-AI Collaboration Enhanced reliability Transparent reasoning Error detection capability Key Innovations Functional group toolkit Mix-sourced distillation Chemical reasoning focus Technical Methodology Data Processing • Literature: 79B tokens • Molecules: 30M with properties • Reactions: 7M with changes • FG annotation: >90% accuracy • Quality control by experts Training Strategy • Domain pre-training: ChemFG • Instruction tuning: diverse tasks • Distillation: 3-source mixing • RL: DAPO algorithm • Chemical/General: 1:2 ratio Evaluation Framework • Text-centric tasks • Molecule-centric tasks • Reaction-centric tasks • Rationale quality analysis • Human collaboration study Functional Group Categories (241 total) Nitrogen (62) Oxygen (36) Sulfur (85) Halogen (14) Phosphorus (17) Others (27) Covers hydrocarbon, boron, silicon, organometallic, and aromatic groups Enables precise molecular analysis and reaction mechanism understanding ChemDFM-R: First Chemical Reasoner LLM with Atomized Knowledge and Transparent Reasoning
Q1
1. What is the key innovation in ChemDFM-R's training data compared to previous chemical LLMs?
Using a larger volume of chemical literature
Incorporating atomized functional group knowledge
Including more chemical reaction examples
Q2
2. How does ChemDFM-R improve the reliability of human-AI collaboration?
By generating more accurate predictions
By providing interpretable reasoning chains
By having a larger knowledge base
Q3
3. What unique approach does ChemDFM-R take in its distillation process?
Using only expert-curated knowledge
Relying solely on general domain reasoning skills
Combining expert knowledge with general reasoning capabilities