UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
Abstract
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only 6K unlabeled images and 0.02% additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over 11 benchmarks, UnSAMv2 improves NoC_{90} (5.69 rightarrow 4.75), 1-IoU (58.0 rightarrow 73.1), and AR_{1000} (49.6 rightarrow 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data (2025)
- Class-agnostic 3D Segmentation by Granularity-Consistent Automatic 2D Mask Tracking (2025)
- NOVO: Bridging LLaVA and SAM with Visual-only Prompts for Reasoning Segmentation (2025)
- Prompt-DAS: Annotation-Efficient Prompt Learning for Domain Adaptive Semantic Segmentation of Electron Microscopy Images (2025)
- CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D (2025)
- DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation (2025)
- Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper