BitGen Stage 1: Vision-Language Pre-training

Model Description

BitGen is a tiny, efficient vision-language model designed for edge devices and resource-constrained environments. This is the Stage 1 checkpoint focusing on vision-language pre-training using the COCO dataset.

Architecture

BitGen combines three powerful components:

  1. BitMar Encoder-Decoder: 1.58-bit quantized transformer (BitNet b1.58) for extreme efficiency
  2. FIBER Cross-Modal Fusion: Queue-based contrastive learning for vision-language alignment
  3. Larimar GPM: Generative Parametric Memory for episodic memory and reasoning

Model Size (Tiny Configuration)

  • Embedding Dimension: 128
  • Encoder Layers: 3
  • Decoder Layers: 2
  • Attention Heads: 4
  • FFN Dimension: 256
  • Vocabulary Size: 50257 (GPT-2 tokenizer)
  • Memory Slots: 32
  • Max Sequence Length: 256
  • Total Parameters: ~5-10M (tiny enough for edge devices!)

Training Data

  • Dataset: MS-COCO Captions (validated subset)
  • Image-Caption Pairs: ~118k training samples
  • Tokenizer: GPT-2 BPE tokenizer

Training Objectives

  1. Image-Text Contrastive (ITC) Loss [Weight: 1.0 - PRIMARY]

    • FIBER-style queue-based contrastive learning
    • Aligns vision and language representations
    • Hard negative mining from queue
  2. Image-Text Matching (ITM) Loss [Weight: 0.5]

    • Binary classification with hard negatives
    • Learns fine-grained image-caption associations
  3. Text Reconstruction Loss [Weight: 0.0 - AUXILIARY]

    • Decoder reconstructs captions from fused features
    • Maintains language understanding
    • Label smoothing (0.1) to prevent mode collapse
  4. Memory KL Divergence [Weight: 0.1]

    • Larimar GPM episodic memory regularization
    • Bayesian inference over memory parameters

Key Features

βœ… Tiny Model: Suitable for edge devices (Raspberry Pi, mobile phones)
βœ… 1.58-bit Quantization: Extreme efficiency via BitNet b1.58
βœ… Vision-Language Alignment: FIBER-style contrastive learning
βœ… Episodic Memory: Larimar GPM for memory-augmented reasoning
βœ… Hard Negative Mining: ITM loss for robust alignment
βœ… DINOv2 Vision Encoder: State-of-the-art vision features (trainable)

Usage

Note: This repository contains only the latest checkpoint. Each training epoch overwrites the previous model file (pytorch_model.bin) to save storage. Git history preserves all versions.

Loading the Model

from transformers import AutoModel
import torch

# Load model from HuggingFace Hub (always the latest checkpoint)
model = AutoModel.from_pretrained("babylm-ntust/BitGen-PreReasoning-stage1")
model.eval()

Inference Example

from transformers import GPT2Tokenizer
from PIL import Image
import torchvision.transforms as transforms

# Setup
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

# Load image and caption
image = Image.open("path/to/image.jpg").convert('RGB')
caption = "A cat sitting on a couch"

# Prepare inputs
image_tensor = transform(image).unsqueeze(0)
tokens = tokenizer(caption, return_tensors='pt', padding=True, truncation=True, max_length=256)
input_ids = tokens['input_ids']

# Forward pass
with torch.no_grad():
    outputs = model(
        input_ids=input_ids,
        images=image_tensor,
        return_contrastive_features=True
    )
    
    # Get similarity
    text_feat = outputs['contrastive_features']['text_features']
    image_feat = outputs['contrastive_features']['image_features']
    similarity = (text_feat @ image_feat.T).item()
    print(f"Similarity: {similarity:.4f}")

Training Details

Hyperparameters

  • Batch Size: 128 (effective: 256)
  • Learning Rate: 0.0002
  • Optimizer: AdamW (weight_decay=0.02)
  • Gradient Accumulation: 2 steps
  • Max Gradient Norm: 1.0
  • Mixed Precision: AMP
  • Temperature: 0.5
  • Queue Size: 4096

Training Schedule

  • Warmup Steps: 1000
  • Scheduler: Cosine decay with min LR = 0.1 Γ— initial LR
  • Early Stopping: Patience = 5 epochs

Limitations and Biases

Limitations

  1. Tiny Model: Designed for efficiency, not SOTA performance
  2. English Only: Trained on English captions
  3. Stage 1 Only: Pre-training phase; reasoning module in Stage 2
  4. Limited Context: Max sequence length of 256 tokens
  5. COCO-Centric: Training data from MS-COCO

Biases

  • Dataset bias from MS-COCO (Western-centric, object-focused)
  • Vision bias from DINOv2 training data
  • Language bias from GPT-2 tokenizer

Citation

@software{bitgen2025,
  title={BitGen: Tiny Vision-Language Model for Edge Devices},
  author={BitGen Team},
  year={2025},
  url={https://huggingface.co/babylm-ntust/BitGen-PreReasoning-stage1}
}

Model Card Contact

For questions or issues, please open an issue on the GitHub repository.


License: MIT
Model Version: Stage 1 (Vision-Language Pre-training)
Last Updated: 2025-10-16

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support