Abstract
GroundCUA, a large-scale desktop grounding dataset, enables the development of GroundNext models that achieve state-of-the-art performance in mapping instructions to UI elements with less training data.
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.
Community
TL;DR: The largest human-annotated desktop-centric GUI grounding dataset.
paper: https://arxiv.org/abs/2511.07332
github: https://github.com/ServiceNow/GroundCUA/
website: https://groundcua.github.io/
model: https://huggingface.co/ServiceNow/GroundNext-7B-V0
dataset: https://huggingface.co/datasets/ServiceNow/GroundCUA
GroundCUA introduces the first large-scale, human-demonstrated desktop grounding dataset (56K screenshots, 3.56M UI elements, 700 K instruction pairs) for training computer-use agents. Built on it, the GroundNext models (3B & 7B) achieve state-of-the-art grounding performance across desktop, web, and mobile benchmarks, all from real human interactions data. Together, they establish a new foundation for multimodal grounding and practical AI-agent research.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents (2025)
- GUI-Spotlight: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding (2025)
- GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents (2025)
- HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration (2025)
- UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning (2025)
- UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action (2025)
- Watch and Learn: Learning to Use Computers from Online Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/grounding-computer-use-agents-on-human-demonstrations
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper