SharpAI's picture
Add dinov3_vitb16 CoreML FP16 model (560Γ—560)
f672f10 verified
---
library_name: coreml
tags:
- vision
- feature-extraction
- dinov3
- coreml
- apple-silicon
- fp16
pipeline_tag: feature-extraction
---
# DINOv3 VITB16 CoreML FP16
CoreML conversion of [facebook/dinov3-vitb16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m) optimized for Apple Silicon.
## Model Details
- **Base Model**: facebook/dinov3-vitb16-pretrain-lvd1689m
- **Framework**: CoreML
- **Precision**: FP16
- **Input Size**: 560Γ—560
- **Model Size**: 163.9 MB
## Usage
### Python (CoreML)
```python
import coremltools as ct
import numpy as np
from PIL import Image
# Load model
model = ct.models.MLModel("dinov3_vitb16_560x560_fp16.mlpackage")
# Prepare image
image = Image.open("image.jpg").resize((560, 560))
# Extract features
output = model.predict({"image": image})
features = output["features"] # Shape: [1, embed_dim, grid_size, grid_size]
```
### Swift/iOS
```swift
import CoreML
// Load model
guard let model = try? MLModel(contentsOf: modelURL) else {
fatalError("Failed to load model")
}
// Prepare image
guard let image = UIImage(named: "image.jpg") else {
fatalError("Failed to load image")
}
// Extract features
let input = try MLFeatureValue(image: image.cgImage!)
let output = try model.prediction(from: [input])
let features = output.featureValue(for: "features")?.multiArrayValue
```
## Performance
Performance metrics on Apple Silicon:
### CoreML Performance
- **Throughput**: 12.19 FPS
- **Latency**: 82.04 Β± 0.91 ms
- **Min Latency**: 80.72 ms
- **Max Latency**: 85.60 ms
### Speedup vs PyTorch
- **PyTorch**: 3.91 FPS
- **CoreML**: 12.19 FPS
- **Speedup**: 3.12x faster ⚑
### Feature Accuracy
- **Cosine Similarity**: 0.4406 (vs PyTorch)
- **Correlation**: 0.4406
### Model Specifications
- **Precision**: FP16
- **Input Size**: 560Γ—560
- **Model Size**: 163.9 MB
## License
This model is released under the DINOv3 License. See [LICENSE.md](LICENSE.md) for details.
## Citation
```bibtex
@article{dinov3,
title={DINOv3: A Versatile Vision Foundation Model},
author={Meta AI Research},
journal={arXiv preprint arXiv:2508.10104},
year={2025}
}
```
**Reference**: [DINOv3 Paper](https://arxiv.org/pdf/2508.10104)
Key contributions:
- **Gram anchoring** strategy for high-quality dense feature maps
- Self-supervised learning on 1.689B images
- Superior performance on dense vision tasks
- Versatile across tasks and domains without fine-tuning
## Demo Images
### Input Image
<div align="center">
<img src="demo_image.png" alt="Demo Input Image" width="500"/>
</div>
*Sample input image for feature extraction demonstration*
### Feature Visualization
<div align="center">
<img src="dinov3_feature_comparison.png" alt="Feature Comparison Visualization" width="800"/>
</div>
The visualization shows:
- **PCA projection** of high-dimensional features (RGB visualization)
- **Feature channel activations** showing spatial patterns
- **Gram matrix analysis** for object similarity detection
- **Side-by-side comparison** with PyTorch reference implementation
This comprehensive visualization demonstrates that CoreML conversion preserves the semantic structure and feature quality of the original DINOv3 model.
## Powered By DINOv3
🌟 **This model is powered by DINOv3** 🌟
Converted by [Aegis AI](https://github.com/Aegis-AI/Aegis-AI) for optimized Apple Silicon deployment.
## Related Models
- Original PyTorch Model: [facebook/dinov3-vitb16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitb16-pretrain-lvd1689m)
- DINOv3 License: https://ai.meta.com/resources/models-and-libraries/dinov3-license/
---
*Last updated: 2025-11-03*