GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Abstract
GeoVista, an agentic model integrating tool invocation and reinforcement learning, achieves high geolocalization performance on GeoBench, outperforming open-source models and matching closed-source models.
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
Community
Introducing GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization 🌍
🤔 Ever wondered how a reasoning model can guess where a photo was taken? And how well today’s reasoning models actually perform on this task?
We present GeoVista, the first open-sourced reasoning model that seamlessly integrates tool use (e.g., web search) into its reasoning loop, achieving strong performance on real-world geolocalization.
We also release GeoVista-Bench (GeoBench), the first benchmark designed to evaluate agentic models’ general geolocalization ability.
Project page: https://ekonwang.github.io/geo-vista/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales (2025)
- Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning (2025)
- Video Spatial Reasoning with Object-Centric 3D Rollout (2025)
- GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation (2025)
- Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning (2025)
- Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning (2025)
- Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper