File size: 3,617 Bytes
19426cf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
license: apache-2.0
language: en
library_name: pytorch
pipeline_tag: object-detection
tags:
- rtdetr
- object-detection
---
license: apache-2.0
language: en
library_name: pytorch
pipeline_tag: object-detection
tags:
- rtdetr
- object-detection
- knowledge-distillation
- taco-dataset
- dinov3
- vit
---
# RT-DisDINOv3-ViT: A Distilled RT-DETR-L Model
This model is an **RT-DETR-L** whose backbone and encoder have been pre-trained using knowledge distillation from a powerful **DINOv3 ViT-Base** teacher model. The distillation process was performed on feature maps from the [TACO (Trash Annotations in Context)](https://tacodataset.org/) dataset using the `lightly-train` framework.
This pre-trained checkpoint contains the "distilled knowledge" and is intended to be used as a starting point for fine-tuning on downstream object detection tasks.
This work is part of the **RT-DisDINOv3** project. For full details on the training pipeline, baseline comparisons, and analysis, please visit the [main GitHub repository](https://github.com/your-username/your-repo-name). <!--- <<< TODO: Add your GitHub repo link here -->
## How to Use
You can load these distilled weights and apply them to the original RT-DETR-L model's backbone and encoder before fine-tuning.
```python
import torch
from torch.hub import load_state_dict_from_url
# 1. Load the original RT-DETR-L model architecture
# Make sure you have the 'rtdetr' repository cloned locally or installed
rtdetr_l = torch.hub.load('lyuwenyu/RT-DETR', 'rtdetrv2_l', pretrained=True)
model = rtdetr_l.model
# 2. Load the distilled weights from this Hugging Face Hub repository
MODEL_URL = "https://huggingface.co/hnamt/RT-DisDINOv3-ViT-Base/resolve/main/distilled_rtdetr_vit_teacher_BEST.pth"
distilled_state_dict = load_state_dict_from_url(MODEL_URL, map_location='cpu')['model']
# 3. Load the weights into the model's backbone and encoder
# The `strict=False` flag ensures that only matching keys (backbone + encoder) are loaded.
model.load_state_dict(distilled_state_dict, strict=False)
print("Successfully loaded and applied distilled knowledge from ViT teacher!")
# Now the 'model' is ready for fine-tuning on your own dataset.
# For example:
# optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# model.train()
# ... your fine-tuning loop ...
```
## Training Details
- **Student Model**: RT-DETR-L (`rtdetrv2_l` from [lyuwenyu/RT-DETR](https://github.com/lyuwenyu/RT-DETR)).
- **Teacher Model**: DINOv3 ViT-Base (`dinov3/vitb16` via Lightly).
- **Dataset for Distillation**: TACO dataset images.
- **Distillation Procedure**: The student model's backbone and encoder were trained to minimize the Mean Squared Error (MSE) between their output feature maps and those of the teacher model, orchestrated by the `lightly-train` library.
## Evaluation Results
After the distillation pre-training, the model was fine-tuned on the TACO dataset. In our experiments, this particular teacher did not yield an improvement over the baseline.
| Model | mAP@50-95 | mAP@50 | Speed (ms) | Notes |
| ----------------------------- | :-------: | :----: | :--------: | ----------------------------------- |
| RT-DETR-L (Baseline) | 2.80% | 4.60% | 50.05 | Fine-tuned from COCO pre-trained. |
| **RT-DisDINOv3 (w/ ViT)** | 2.80% | 4.20% | 49.80 | No performance improvement observed.|
## License
The weights in this repository are released under the Apache 2.0 License. Please be aware that the models used for training (RT-DETR, DINOv3) have their own licenses. |