CLIP ViT-B/16 Model

This is a CLIP (Contrastive Language-Image Pre-training) model trained on DataComp-12M dataset.

Training Details

  • Architecture: ViT-B/16
  • Dataset: DataComp-12M
  • Batch size per GPU: 512
  • Number of GPUs: 4
  • Total epochs: 20
  • Precision: amp
  • Total training samples: 203,020,740

Training Command

torchrun --nproc_per_node 4 -m open_clip_train.main \
--save-frequency 1 \
--train-data '/pasteur2/u/yuhuiz/yiming/datacomp_12m/processed_dataset/{00000000..00001023}.tar' \
--train-num-samples $((203020740 / 20)) \
--local-loss \
--gather-with-grad \
--warmup 1000 \
--dataset-type webdataset \
--batch-size 512 \
--epochs 20 \
--model ViT-B-16 \
--precision amp \
--seed 0 \
--workers 4
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support