Qwerky Optimized Llama3.2 Mamba Hybrid - 3B Instruct

This is a hybrid Mamba-Transformer model based on the Llama 3.2 architecture, distilled from Llama 3.1 8B into a 3B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 3B parameter model comparable in quality to Llama's 3.2 3B but running at speeds as fast or faster than Llama's 3.2 1B model.

Model Developer: Qwerky AI

Model Details

  • Model Type: QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
  • Architecture: QwerkyLlamaMambaHybridForCausalLM
  • Base Model: Llama-3.1-8B
  • Mamba Type: MAMBA

Model Configuration

  • Vocabulary Size: 128256
  • Hidden Size: 3072
  • Number of Layers: 28
  • Number of Attention Heads: 24
  • Intermediate Size: 8192

How to Use

This model can be loaded using HuggingFace Transformers with AutoTokenizer and AutoModelForCausalLM. The model uses custom configuration and modeling files that are automatically loaded via the auto_map in config.json.

⚠️ Important Requirements

CUDA is required to run this model. This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:

  • A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
  • CUDA toolkit installed
  • PyTorch with CUDA support

Installation

First, install the required dependencies:

pip install transformers torch safetensors
pip install flash-attn --no-build-isolation
pip install mamba-ssm --no-build-isolation
pip install causal-conv1d>=1.2.0 --no-build-isolation

Note: flash-attn compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:

MAX_JOBS=1 pip install flash-attn --no-build-isolation

Or set it as an environment variable:

export MAX_JOBS=1
pip install flash-attn --no-build-isolation

Loading the Model

From HuggingFace Hub

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct",
    torch_dtype=torch.bfloat16,  # or torch.float16
    trust_remote_code=True
)
model.cuda()

From Local Directory

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model from local directory
tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
model = AutoModelForCausalLM.from_pretrained(
    "./path/to/model",
    torch_dtype=torch.bfloat16,  # or torch.float16
    trust_remote_code=True
)
model.cuda()

Generating Text

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize and move to CUDA
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.7,
    eos_token_id=tokenizer.eos_token_id
)

# Decode output
response = tokenizer.decode(outputs[0])
print(response)

Model Files

This model repository contains:

  • config.json - Model configuration with auto_map for custom classes
  • modeling_qwerky_llama_mamba_hybrid.py - Custom modeling class
  • configuration_qwerky_llama_mamba_hybrid.py - Custom configuration class
  • model.safetensors or model-*.safetensors - Model weights (sharded if >5GB)
  • model.safetensors.index.json - Index file for sharded weights (if applicable)
  • tokenizer.json, tokenizer_config.json - Tokenizer files
  • README.md - This file

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers 4.30+
  • safetensors
  • mamba-ssm
  • causal-conv1d>=1.2.0
  • flash-attn (for optimized attention)

Performance Results

The following metrics are an overall average of several different prompts that covered a wide array of tasks (reasoning, lite coding, creativity, security, etc.) with 10 runs per task.

Tokens Per Second

Metric Qwerky Optimized Llama 3B Llama 3.2 3B Inst
Tokens per Second - Total 127.51 99.60
Tokens per Second - Output 81.48 46.23
Tokens per Second - Request 1.24 0.87

Time Breakdown

Metric Qwerky Optimized Llama 3B Llama 3.2 3B Inst
Total Duration (s) 9.71 16.47
Time-to-First Token (ms) 12.09 21.16
End-to-End Latency (ms) 971.46 1647.07
Time per Output Token (ms) 12.28 21.63
Inter-Token Latency (ms) 81.48 46.23

Evaluation Results

Benchmark Shots Qwerky Optimized Llama 3B Llama 3.2 3B Inst
GPQA 0 25.76 28.52
Musr 0 35.19 35.85
HellaSwag 0 49.41 52.18
ARC-Challenge 0 39.51 43.52
ARC-Easy 0 74.71 69.60

Citation

If you use this model, please cite:

@misc{qwerky_llama_mamba_hybrid,
  title={QwerkyLlamaMambaHybrid},
  author={Qwerky AI, Inc.},
  year={2025},
  publisher={HuggingFace}
}

License

This model is licensed under the Qwerky Distilled Model License Agreement. See the LICENSE file for more details.

Downloads last month
505
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct