BitGen Stage 1: Vision-Language Pre-training

Model Description

BitGen is a tiny, efficient vision-language model designed for edge devices and resource-constrained environments. This is the Stage 1 checkpoint focusing on vision-language pre-training using the COCO dataset.

Architecture

BitGen combines three powerful components:

BitMar Encoder-Decoder: 1.58-bit quantized transformer (BitNet b1.58) for extreme efficiency
FIBER Cross-Modal Fusion: Queue-based contrastive learning for vision-language alignment
Larimar GPM: Generative Parametric Memory for episodic memory and reasoning

Model Size (Tiny Configuration)

Embedding Dimension: 128
Encoder Layers: 3
Decoder Layers: 2
Attention Heads: 4
FFN Dimension: 256
Vocabulary Size: 50257 (GPT-2 tokenizer)
Memory Slots: 32
Max Sequence Length: 256
Total Parameters: ~5-10M (tiny enough for edge devices!)

Training Data

Dataset: MS-COCO Captions (validated subset)
Image-Caption Pairs: ~118k training samples
Tokenizer: GPT-2 BPE tokenizer

Training Objectives

Image-Text Contrastive (ITC) Loss [Weight: 1.0 - PRIMARY]
- FIBER-style queue-based contrastive learning
- Aligns vision and language representations
- Hard negative mining from queue
Image-Text Matching (ITM) Loss [Weight: 0.5]
- Binary classification with hard negatives
- Learns fine-grained image-caption associations
Text Reconstruction Loss [Weight: 0.0 - AUXILIARY]
- Decoder reconstructs captions from fused features
- Maintains language understanding
- Label smoothing (0.1) to prevent mode collapse
Memory KL Divergence [Weight: 0.1]
- Larimar GPM episodic memory regularization
- Bayesian inference over memory parameters

Key Features

✅ Tiny Model: Suitable for edge devices (Raspberry Pi, mobile phones)
✅ 1.58-bit Quantization: Extreme efficiency via BitNet b1.58
✅ Vision-Language Alignment: FIBER-style contrastive learning
✅ Episodic Memory: Larimar GPM for memory-augmented reasoning
✅ Hard Negative Mining: ITM loss for robust alignment
✅ DINOv2 Vision Encoder: State-of-the-art vision features (trainable)

Usage

Note: This repository contains only the latest checkpoint. Each training epoch overwrites the previous model file (pytorch_model.bin) to save storage. Git history preserves all versions.

Loading the Model

from transformers import AutoModel
import torch

# Load model from HuggingFace Hub (always the latest checkpoint)
model = AutoModel.from_pretrained("babylm-ntust/BitGen-PreReasoning-stage1")
model.eval()

Inference Example

from transformers import GPT2Tokenizer
from PIL import Image
import torchvision.transforms as transforms

# Setup
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

# Load image and caption
image = Image.open("path/to/image.jpg").convert('RGB')
caption = "A cat sitting on a couch"

# Prepare inputs
image_tensor = transform(image).unsqueeze(0)
tokens = tokenizer(caption, return_tensors='pt', padding=True, truncation=True, max_length=256)
input_ids = tokens['input_ids']

# Forward pass
with torch.no_grad():
    outputs = model(
        input_ids=input_ids,
        images=image_tensor,
        return_contrastive_features=True
    )
    
    # Get similarity
    text_feat = outputs['contrastive_features']['text_features']
    image_feat = outputs['contrastive_features']['image_features']
    similarity = (text_feat @ image_feat.T).item()
    print(f"Similarity: {similarity:.4f}")

Training Details

Hyperparameters

Batch Size: 128 (effective: 256)
Learning Rate: 0.0002
Optimizer: AdamW (weight_decay=0.02)
Gradient Accumulation: 2 steps
Max Gradient Norm: 1.0
Mixed Precision: AMP
Temperature: 0.5
Queue Size: 4096

Training Schedule

Warmup Steps: 1000
Scheduler: Cosine decay with min LR = 0.1 × initial LR
Early Stopping: Patience = 5 epochs

Limitations and Biases

Limitations

Tiny Model: Designed for efficiency, not SOTA performance
English Only: Trained on English captions
Stage 1 Only: Pre-training phase; reasoning module in Stage 2
Limited Context: Max sequence length of 256 tokens
COCO-Centric: Training data from MS-COCO

Biases

Dataset bias from MS-COCO (Western-centric, object-focused)
Vision bias from DINOv2 training data
Language bias from GPT-2 tokenizer

Citation

@software{bitgen2025,
  title={BitGen: Tiny Vision-Language Model for Edge Devices},
  author={BitGen Team},
  year={2025},
  url={https://huggingface.co/babylm-ntust/BitGen-PreReasoning-stage1}
}

Model Card Contact

For questions or issues, please open an issue on the GitHub repository.

License: MIT
Model Version: Stage 1 (Vision-Language Pre-training)
Last Updated: 2025-10-16

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support