Qwerky Optimized Llama3.2 Mamba Hybrid - 3B Instruct

This is a hybrid Mamba-Transformer model based on the Llama 3.2 architecture, distilled from Llama 3.1 8B into a 3B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 3B parameter model comparable in quality to Llama's 3.2 3B but running at speeds as fast or faster than Llama's 3.2 1B model.

Model Developer: Qwerky AI

Model Details

Model Type: QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
Architecture: QwerkyLlamaMambaHybridForCausalLM
Base Model: Llama-3.1-8B
Mamba Type: MAMBA

Model Configuration

Vocabulary Size: 128256
Hidden Size: 3072
Number of Layers: 28
Number of Attention Heads: 24
Intermediate Size: 8192

How to Use

This model can be loaded using HuggingFace Transformers with AutoTokenizer and AutoModelForCausalLM. The model uses custom configuration and modeling files that are automatically loaded via the auto_map in config.json.

⚠️ Important Requirements

CUDA is required to run this model. This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:

A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
CUDA toolkit installed
PyTorch with CUDA support

Installation

First, install the required dependencies:

pip install transformers torch safetensors
pip install flash-attn --no-build-isolation
pip install mamba-ssm --no-build-isolation
pip install causal-conv1d>=1.2.0 --no-build-isolation

Note: flash-attn compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:

MAX_JOBS=1 pip install flash-attn --no-build-isolation

Or set it as an environment variable:

export MAX_JOBS=1
pip install flash-attn --no-build-isolation

Loading the Model

From HuggingFace Hub

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct",
    torch_dtype=torch.bfloat16,  # or torch.float16
    trust_remote_code=True
)
model.cuda()

From Local Directory

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model from local directory
tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
model = AutoModelForCausalLM.from_pretrained(
    "./path/to/model",
    torch_dtype=torch.bfloat16,  # or torch.float16
    trust_remote_code=True
)
model.cuda()

Generating Text

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize and move to CUDA
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.7,
    eos_token_id=tokenizer.eos_token_id
)

# Decode output
response = tokenizer.decode(outputs[0])
print(response)

Model Files

This model repository contains:

config.json - Model configuration with auto_map for custom classes
modeling_qwerky_llama_mamba_hybrid.py - Custom modeling class
configuration_qwerky_llama_mamba_hybrid.py - Custom configuration class
model.safetensors or model-*.safetensors - Model weights (sharded if >5GB)
model.safetensors.index.json - Index file for sharded weights (if applicable)
tokenizer.json, tokenizer_config.json - Tokenizer files
README.md - This file

Requirements

Python 3.8+
PyTorch 2.0+
Transformers 4.30+
safetensors
mamba-ssm
causal-conv1d>=1.2.0
flash-attn (for optimized attention)

Performance Results

The following metrics are an overall average of several different prompts that covered a wide array of tasks (reasoning, lite coding, creativity, security, etc.) with 10 runs per task.

Tokens Per Second

Metric	Qwerky Optimized Llama 3B	Llama 3.2 3B Inst
Tokens per Second - Total	127.51	99.60
Tokens per Second - Output	81.48	46.23
Tokens per Second - Request	1.24	0.87

Time Breakdown

Metric	Qwerky Optimized Llama 3B	Llama 3.2 3B Inst
Total Duration (s)	9.71	16.47
Time-to-First Token (ms)	12.09	21.16
End-to-End Latency (ms)	971.46	1647.07
Time per Output Token (ms)	12.28	21.63
Inter-Token Latency (ms)	81.48	46.23

Evaluation Results

Benchmark	Qwerky Optimized Llama 3B	Llama 3.2 3B Inst
GPQA	25.76	28.52
Musr	35.19	35.85
HellaSwag	49.41	52.18
ARC-Challenge	39.51	43.52
ARC-Easy	74.71	69.60

Citation

If you use this model, please cite:

@misc{qwerky_llama_mamba_hybrid,
  title={QwerkyLlamaMambaHybrid},
  author={Qwerky AI, Inc.},
  year={2025},
  publisher={HuggingFace}
}

License

This model is licensed under the Qwerky Distilled Model License Agreement. See the LICENSE file for more details.

Downloads last month: 505

Safetensors

Model size

4B params

Tensor type

BF16

Collection including QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct

Qwerky Optimized Hybrid Attention Experiments

Collection

I can't believe it's not attention. • 1 item • Updated 5 days ago • 1