BitGen Stage 1: Vision-Language Pre-training
Model Description
BitGen is a tiny, efficient vision-language model designed for edge devices and resource-constrained environments. This is the Stage 1 checkpoint focusing on vision-language pre-training using the COCO dataset.
Architecture
BitGen combines three powerful components:
- BitMar Encoder-Decoder: 1.58-bit quantized transformer (BitNet b1.58) for extreme efficiency
- FIBER Cross-Modal Fusion: Queue-based contrastive learning for vision-language alignment
- Larimar GPM: Generative Parametric Memory for episodic memory and reasoning
Model Size (Tiny Configuration)
- Embedding Dimension: 128
- Encoder Layers: 3
- Decoder Layers: 2
- Attention Heads: 4
- FFN Dimension: 256
- Vocabulary Size: 50257 (GPT-2 tokenizer)
- Memory Slots: 32
- Max Sequence Length: 256
- Total Parameters: ~5-10M (tiny enough for edge devices!)
Training Data
- Dataset: MS-COCO Captions (validated subset)
- Image-Caption Pairs: ~118k training samples
- Tokenizer: GPT-2 BPE tokenizer
Training Objectives
Image-Text Contrastive (ITC) Loss [Weight: 1.0 - PRIMARY]
- FIBER-style queue-based contrastive learning
- Aligns vision and language representations
- Hard negative mining from queue
Image-Text Matching (ITM) Loss [Weight: 0.5]
- Binary classification with hard negatives
- Learns fine-grained image-caption associations
Text Reconstruction Loss [Weight: 0.0 - AUXILIARY]
- Decoder reconstructs captions from fused features
- Maintains language understanding
- Label smoothing (0.1) to prevent mode collapse
Memory KL Divergence [Weight: 0.1]
- Larimar GPM episodic memory regularization
- Bayesian inference over memory parameters
Key Features
β
Tiny Model: Suitable for edge devices (Raspberry Pi, mobile phones)
β
1.58-bit Quantization: Extreme efficiency via BitNet b1.58
β
Vision-Language Alignment: FIBER-style contrastive learning
β
Episodic Memory: Larimar GPM for memory-augmented reasoning
β
Hard Negative Mining: ITM loss for robust alignment
β
DINOv2 Vision Encoder: State-of-the-art vision features (trainable)
Usage
Note: This repository contains only the latest checkpoint. Each training epoch overwrites the previous model file (
pytorch_model.bin) to save storage. Git history preserves all versions.
Loading the Model
from transformers import AutoModel
import torch
# Load model from HuggingFace Hub (always the latest checkpoint)
model = AutoModel.from_pretrained("babylm-ntust/BitGen-PreReasoning-stage1")
model.eval()
Inference Example
from transformers import GPT2Tokenizer
from PIL import Image
import torchvision.transforms as transforms
# Setup
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])
# Load image and caption
image = Image.open("path/to/image.jpg").convert('RGB')
caption = "A cat sitting on a couch"
# Prepare inputs
image_tensor = transform(image).unsqueeze(0)
tokens = tokenizer(caption, return_tensors='pt', padding=True, truncation=True, max_length=256)
input_ids = tokens['input_ids']
# Forward pass
with torch.no_grad():
outputs = model(
input_ids=input_ids,
images=image_tensor,
return_contrastive_features=True
)
# Get similarity
text_feat = outputs['contrastive_features']['text_features']
image_feat = outputs['contrastive_features']['image_features']
similarity = (text_feat @ image_feat.T).item()
print(f"Similarity: {similarity:.4f}")
Training Details
Hyperparameters
- Batch Size: 128 (effective: 256)
- Learning Rate: 0.0002
- Optimizer: AdamW (weight_decay=0.02)
- Gradient Accumulation: 2 steps
- Max Gradient Norm: 1.0
- Mixed Precision: AMP
- Temperature: 0.5
- Queue Size: 4096
Training Schedule
- Warmup Steps: 1000
- Scheduler: Cosine decay with min LR = 0.1 Γ initial LR
- Early Stopping: Patience = 5 epochs
Limitations and Biases
Limitations
- Tiny Model: Designed for efficiency, not SOTA performance
- English Only: Trained on English captions
- Stage 1 Only: Pre-training phase; reasoning module in Stage 2
- Limited Context: Max sequence length of 256 tokens
- COCO-Centric: Training data from MS-COCO
Biases
- Dataset bias from MS-COCO (Western-centric, object-focused)
- Vision bias from DINOv2 training data
- Language bias from GPT-2 tokenizer
Citation
@software{bitgen2025,
title={BitGen: Tiny Vision-Language Model for Edge Devices},
author={BitGen Team},
year={2025},
url={https://huggingface.co/babylm-ntust/BitGen-PreReasoning-stage1}
}
Model Card Contact
For questions or issues, please open an issue on the GitHub repository.
License: MIT
Model Version: Stage 1 (Vision-Language Pre-training)
Last Updated: 2025-10-16
- Downloads last month
- 6