File size: 1,659 Bytes
0b09a4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f6785f
 
 
0b09a4b
 
7f6785f
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
library_name: transformers
license: mit
tags:
- vision
- image-segmentation
- pytorch
---
# EoMT

[![PyTorch](https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white)](https://pytorch.org/)

**EoMT (Encoder-only Mask Transformer)** is a Vision Transformer (ViT) architecture designed for high-quality and efficient image segmentation. It was introduced in the CVPR 2025 highlight paper:  
**[Your ViT is Secretly an Image Segmentation Model](https://www.tue-mps.org/eomt)**  
by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus.

> **Key Insight**: Given sufficient scale and pretraining, a plain ViT along with additional few params can perform segmentation without the need for task-specific decoders or pixel fusion modules. The same model backbone supports semantic, instance, and panoptic segmentation with different post-processing 🤗

The original implementation can be found in this [repository](https://github.com/tue-mps/eomt).

The HuggingFace model page is available at this [link](https://huggingface.co/papers/2503.19108).

---

## Citation
If you find our work useful, please consider citing us as:
```bibtex
@inproceedings{kerssies2025eomt,
  author    = {Kerssies, Tommie and Cavagnero, Niccolò and Hermans, Alexander and Norouzi, Narges and Averta, Giuseppe and Leibe, Bastian and Dubbelman, Gijs and de Geus, Daan},
  title     = {Your ViT is Secretly an Image Segmentation Model},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
}
```