CLIP ViT-B/16 Model

This is a CLIP (Contrastive Language-Image Pre-training) model trained on DataComp-12M dataset.

Training Details

Architecture: ViT-B/16
Dataset: DataComp-12M
Batch size per GPU: 512
Number of GPUs: 4
Total epochs: 20
Precision: amp
Total training samples: 203,020,740

Training Command

torchrun --nproc_per_node 4 -m open_clip_train.main \
--save-frequency 1 \
--train-data '/pasteur2/u/yuhuiz/yiming/datacomp_12m/processed_dataset/{00000000..00001023}.tar' \
--train-num-samples $((203020740 / 20)) \
--local-loss \
--gather-with-grad \
--warmup 1000 \
--dataset-type webdataset \
--batch-size 512 \
--epochs 20 \
--model ViT-B-16 \
--precision amp \
--seed 0 \
--workers 4

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support