2025-07-22 Papers

1/2

Paper 1

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Published: 2025-07-21

Link: http://arxiv.org/pdf/2507.15846

1. 📘 Topic and Domain: The paper focuses on GUI grounding - mapping natural language instructions to precise interface locations for automated computer interaction.
2. 💡 Previous Research and New Ideas: Previous research used binary rewards for GUI interaction, while this paper introduces continuous Gaussian reward modeling based on observed human clicking behavior that naturally forms Gaussian distributions.
3. ❓ Problem: The paper addresses the limitation of binary hit-or-miss reward systems that create sparse learning signals and ignore the continuous nature of spatial interactions in GUI elements.
4. 🛠️ Methods: The authors developed GUI-G², which uses dual Gaussian rewards (point rewards for precision and coverage rewards for spatial overlap) with adaptive variance mechanisms that scale based on element dimensions.
5. 📊 Results and Evaluation: The approach achieved state-of-the-art performance across three benchmarks (ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro), with up to 24.7% improvement over baseline models while using fewer parameters.

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

GUI-G² Workflow: Gaussian Reward Modeling for GUI Grounding Screenshot + Instruction Policy Model (Qwen2.5-VL-7B) Generate N Predictions GUI-G² Gaussian Reward Modeling Gaussian Point Rewards R_point = exp(-½[(c_x^p - c_x^gt)²/σ_x² + (c_y^p - c_y^gt)²/σ_y²]) • Evaluates localization precision • Smooth exponential decay from center • Continuous gradients everywhere Gaussian Coverage Rewards R_coverage = Bhattacharyya Coefficient BC(N_p, N_gt) = ∫√(N_p · N_gt) dx • Measures spatial overlap • Captures regional interactions • Size and shape similarity Adaptive Variance Mechanism σ_x = α · (x₂ - x₁), σ_y = α · (y₂ - y₁) • Scales with element dimensions • α = 0.5 optimal • Handles diverse GUI element scales Combined Reward Function R_total = ν · R_point + γ · R_coverage GRPO Training • Group advantage estimation • Continuous optimization • KL regularization Performance Results • ScreenSpot: 92.0% (+4.1%) • ScreenSpot-v2: 93.3% (+3.3%) • ScreenSpot-Pro: 47.5% (+24.7%) Key Benefits ✓ Dense feedback ✓ Smooth gradients ✓ Spatial awareness ✓ Scale adaptive ✓ Better convergence ✓ Human-aligned ✓ Robust to noise ✓ Generalizable Input Data Model Processing Multi-sampling Policy Update Evaluation
Q1
1. What key insight about human behavior inspired the GUI-G² reward system?
Humans tend to click randomly on interface elements
Human clicks naturally form Gaussian distributions centered on target elements
Humans prefer clicking on the edges of interface elements
Q2
2. Why did the authors introduce an adaptive variance mechanism in their approach?
To make the system more computationally efficient
To handle elements of different sizes appropriately, from tiny icons to full-screen panels
To reduce the training time of the model
Q3
3. What surprising finding did the study reveal about 'thinking' in GUI grounding tasks?
Thinking improved performance by 5%
Thinking had no effect on performance
Explicit reasoning significantly harmed performance while using more tokens
1/2

Paper 2

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Published: 2025-07-21

Link: http://arxiv.org/pdf/2507.15852

1. 📘 Topic and Domain: Video Object Segmentation (VOS) with a focus on complex temporal tracking and segmentation of target objects across video frames.
2. 💡 Previous Research and New Ideas: Based on SAM2 and memory-based VOS methods, proposing a novel concept-driven framework that uses Large Vision-Language Models for semantic understanding instead of just appearance matching.
3. ❓ Problem: Current VOS techniques struggle with drastic visual variations, occlusions, and complex scene changes due to their reliance on appearance matching rather than conceptual understanding.
4. 🛠️ Methods: Introduces Segment Concept (SeC), which uses LVLMs to integrate visual cues across frames to build conceptual priors, combined with a scene-adaptive activation strategy that balances LVLM-based reasoning with feature matching.
5. 📊 Results and Evaluation: Achieved significant improvements over state-of-the-art approaches, including an 11.8-point improvement over SAM 2.1 on their new SeCVOS benchmark, and consistent superior performance across standard VOS benchmarks.

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

SeC: Video Object Segmentation Workflow Input Video Frames Scene Change Detection HSV-based Scene Change? Pixel-level Association SAM 2 Memory Keyframe Bank Update FIFO Buffer LVLM Concept Extraction InternVL 2.5 Concept Guidance Cross-attention Feature Fusion Pointwise Addition Mask Decoder SAM 2 Final Prediction Segmentation Mask Output Two-Stage Training 1. Memory Module + 2. LVLM Fine-tuning SeCVOS Benchmark 160 videos, multi-scene complexity No Yes Key Innovation • Concept-driven approach • Scene-adaptive activation • LVLM integration • Progressive construction • Robust to appearance changes
Q1
1. What is the main innovation of SeC compared to traditional VOS methods?
Using more training data
Incorporating LVLMs for concept-level understanding
Improving pixel-level feature matching
Q2
2. How often does SeC activate its LVLM-based concept reasoning during video processing?
For every single frame
Less than 10% of frames
Only at the start of the video
Q3
3. What distinguishes the SeCVOS benchmark from existing VOS datasets?
It has longer video sequences
It has higher resolution videos
It has more scene transitions and complex semantic changes
1/2

Paper 3

GR-3 Technical Report

Published: 2025-07-21

Link: http://arxiv.org/pdf/2507.15493

1. 📘 Topic and Domain: Development of GR-3, a large-scale vision-language-action (VLA) model for robotic manipulation and control.
2. 💡 Previous Research and New Ideas: Based on previous VLA models and pre-trained vision-language models, proposing new co-training with web-scale vision-language data and efficient fine-tuning from human trajectory data.
3. ❓ Problem: Addressing the challenge of creating generalist robot policies that can handle novel objects, environments, and instructions while performing complex long-horizon tasks reliably.
4. 🛠️ Methods: Combines three-way training using robot trajectory data, web-scale vision-language data, and human trajectory data collected via VR devices, implemented with a flow-matching architecture and RMSNorm optimization.
5. 📊 Results and Evaluation: GR-3 outperformed baseline π0 across three challenging tasks (pick-and-place, table bussing, cloth manipulation), showing superior instruction following capabilities and generalization to novel scenarios.

GR-3 Technical Report

GR-3 Training & Evaluation Workflow Data Sources Robot Trajectory Data Teleoperation + Scheduler Task Status Supervision Vision-Language Data Image Captioning, VQA Grounding Tasks Human Trajectory Data VR Collection (PICO 4) 450 traj/hour GR-3 Model Architecture VLM Backbone Qwen2.5-VL-3B Multi-camera + Language Action DiT Flow Matching RMSNorm + Causal Attention Action Output k-length Action Chunk 19 DoF Control Training Objectives Flow Matching Loss Robot + Human Data Action Prediction Next Token Prediction Vision-Language Data VLM Training ByteMini Robot 22-DoF Bi-manual Sphere Wrist Joint Mobile Base + Lift Control System Whole-body Compliance VR Teleoperation Cameras Head + Wrist RGBD Experimental Evaluation Generalizable Pick-and-Place Basic Setting Unseen Env Unseen Instr Unseen Obj Few-shot Human Data Long-Horizon Table Bussing Flat Setting IF Setting Multi-Obj Multi-Dest Novel Invalid Tasks Dexterous Cloth Manipulation Basic Position Unseen Instances 4 Milestones: Pick Hanger → Right Shoulder → Left Shoulder → Hang Key Results: GR-3 outperforms π0 baseline across all tasks Strong generalization • Few-shot adaptation • Robust long-horizon performance
Q1
1. What unique combination of data sources does GR-3 use in its training recipe?
Only robot trajectory data and simulation data
Robot trajectory data, human VR data, and vision-language web data
Only vision-language data and synthetic robot data
Q2
2. What key innovation in GR-3's architecture helped improve training stability and language following capabilities?
The use of attention masks in the transformer
The addition of RMSNorm after linear layers
The implementation of a larger model size
Q3
3. How many human trajectories per object were needed to achieve successful adaptation to novel objects in the pick-and-place experiments?
100 trajectories
50 trajectories
10 trajectories