Accurate localization of GUI elements is essential for developing effective GUI agents. Traditional methods often rely on bounding box or center-point regression but overlook spatial interaction uncertainty and visual-semantic hierarchies. Recent attention-based methods still suffer from two main problems: ignoring background regions leads to attention drift, and uniform labeling does not differentiate between the center and edges of UI elements, causing imprecise clicks. To overcome these challenges, the Valley-to-Peak (V2P) method introduces a suppression attention mechanism that reduces focus on irrelevant background areas, thereby highlighting the intended GUI element.
Additionally, V2P models GUI interactions using 2D Gaussian heatmaps inspired by Fitts’ Law, where the weight decreases from the center to the edges based on the target’s size. This approach helps the model concentrate on the most critical point of the UI element, improving click precision. The method was evaluated on two benchmarks, ScreenSpot-v2 and ScreenSpot-Pro, achieving performance scores of 92.3% and 50.5%, respectively. Ablation studies confirm the effectiveness of each component, demonstrating V2P’s generalizability and robustness for precise GUI grounding.
The implications of V2P include enhanced accuracy in GUI element localization, which is vital for user interface automation and interaction. By mitigating background distractions and refining spatial attention, V2P can improve the reliability of GUI agents in real-world applications. Future work may explore integrating V2P with other modalities or extending it to more complex GUI environments. Overall, V2P represents a significant advancement in addressing key limitations of prior methods and offers a promising direction for robust GUI grounding.
👉 Pročitaj original: arXiv AI Papers