2025-07-22 Papers

1/2

Paper 1

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Published: 2025-07-21

Link: http://arxiv.org/pdf/2507.15846

1. 📘 Topic and Domain: The paper focuses on GUI grounding - mapping natural language instructions to precise interface locations for automated computer interaction.

2. 💡 Previous Research and New Ideas: Previous research used binary rewards for GUI interaction, while this paper introduces continuous Gaussian reward modeling based on observed human clicking behavior that naturally forms Gaussian distributions.

3. ❓ Problem: The paper addresses the limitation of binary hit-or-miss reward systems that create sparse learning signals and ignore the continuous nature of spatial interactions in GUI elements.

4. 🛠️ Methods: The authors developed GUI-G², which uses dual Gaussian rewards (point rewards for precision and coverage rewards for spatial overlap) with adaptive variance mechanisms that scale based on element dimensions.

5. 📊 Results and Evaluation: The approach achieved state-of-the-art performance across three benchmarks (ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro), with up to 24.7% improvement over baseline models while using fewer parameters.

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

1/2

Paper 2

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Published: 2025-07-21

Link: http://arxiv.org/pdf/2507.15852

1. 📘 Topic and Domain: Video Object Segmentation (VOS) with a focus on complex temporal tracking and segmentation of target objects across video frames.

2. 💡 Previous Research and New Ideas: Based on SAM2 and memory-based VOS methods, proposing a novel concept-driven framework that uses Large Vision-Language Models for semantic understanding instead of just appearance matching.

3. ❓ Problem: Current VOS techniques struggle with drastic visual variations, occlusions, and complex scene changes due to their reliance on appearance matching rather than conceptual understanding.

4. 🛠️ Methods: Introduces Segment Concept (SeC), which uses LVLMs to integrate visual cues across frames to build conceptual priors, combined with a scene-adaptive activation strategy that balances LVLM-based reasoning with feature matching.

5. 📊 Results and Evaluation: Achieved significant improvements over state-of-the-art approaches, including an 11.8-point improvement over SAM 2.1 on their new SeCVOS benchmark, and consistent superior performance across standard VOS benchmarks.

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

1/2

Paper 3

GR-3 Technical Report

Published: 2025-07-21

Link: http://arxiv.org/pdf/2507.15493

1. 📘 Topic and Domain: Development of GR-3, a large-scale vision-language-action (VLA) model for robotic manipulation and control.

2. 💡 Previous Research and New Ideas: Based on previous VLA models and pre-trained vision-language models, proposing new co-training with web-scale vision-language data and efficient fine-tuning from human trajectory data.

3. ❓ Problem: Addressing the challenge of creating generalist robot policies that can handle novel objects, environments, and instructions while performing complex long-horizon tasks reliably.

4. 🛠️ Methods: Combines three-way training using robot trajectory data, web-scale vision-language data, and human trajectory data collected via VR devices, implemented with a flow-matching architecture and RMSNorm optimization.

5. 📊 Results and Evaluation: GR-3 outperformed baseline π0 across three challenging tasks (pick-and-place, table bussing, cloth manipulation), showing superior instruction following capabilities and generalization to novel scenarios.