Qwerky Optimized Llama3.2 Mamba Hybrid - 3B Instruct
This is a hybrid Mamba-Transformer model based on the Llama 3.2 architecture, distilled from Llama 3.1 8B into a 3B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 3B parameter model comparable in quality to Llama's 3.2 3B but running at speeds as fast or faster than Llama's 3.2 1B model.
Model Developer: Qwerky AI
Model Details
- Model Type: QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
- Architecture: QwerkyLlamaMambaHybridForCausalLM
- Base Model: Llama-3.1-8B
- Mamba Type: MAMBA
Model Configuration
- Vocabulary Size: 128256
- Hidden Size: 3072
- Number of Layers: 28
- Number of Attention Heads: 24
- Intermediate Size: 8192
How to Use
This model can be loaded using HuggingFace Transformers with AutoTokenizer and AutoModelForCausalLM. The model uses custom configuration and modeling files that are automatically loaded via the auto_map in config.json.
⚠️ Important Requirements
CUDA is required to run this model. This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:
- A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
- CUDA toolkit installed
- PyTorch with CUDA support
Installation
First, install the required dependencies:
pip install transformers torch safetensors
pip install flash-attn --no-build-isolation
pip install mamba-ssm --no-build-isolation
pip install causal-conv1d>=1.2.0 --no-build-isolation
Note: flash-attn compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:
MAX_JOBS=1 pip install flash-attn --no-build-isolation
Or set it as an environment variable:
export MAX_JOBS=1
pip install flash-attn --no-build-isolation
Loading the Model
From HuggingFace Hub
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-3B-Instruct",
torch_dtype=torch.bfloat16, # or torch.float16
trust_remote_code=True
)
model.cuda()
From Local Directory
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer and model from local directory
tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
model = AutoModelForCausalLM.from_pretrained(
"./path/to/model",
torch_dtype=torch.bfloat16, # or torch.float16
trust_remote_code=True
)
model.cuda()
Generating Text
messages = [
{"role": "user", "content": "Hello, how are you?"}
]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize and move to CUDA
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Generate response
outputs = model.generate(
inputs.input_ids,
max_length=100,
temperature=0.7,
eos_token_id=tokenizer.eos_token_id
)
# Decode output
response = tokenizer.decode(outputs[0])
print(response)
Model Files
This model repository contains:
config.json- Model configuration withauto_mapfor custom classesmodeling_qwerky_llama_mamba_hybrid.py- Custom modeling classconfiguration_qwerky_llama_mamba_hybrid.py- Custom configuration classmodel.safetensorsormodel-*.safetensors- Model weights (sharded if >5GB)model.safetensors.index.json- Index file for sharded weights (if applicable)tokenizer.json,tokenizer_config.json- Tokenizer filesREADME.md- This file
Requirements
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.30+
- safetensors
- mamba-ssm
- causal-conv1d>=1.2.0
- flash-attn (for optimized attention)
Performance Results
The following metrics are an overall average of several different prompts that covered a wide array of tasks (reasoning, lite coding, creativity, security, etc.) with 10 runs per task.
Tokens Per Second
| Metric | Qwerky Optimized Llama 3B | Llama 3.2 3B Inst |
|---|---|---|
| Tokens per Second - Total | 127.51 | 99.60 |
| Tokens per Second - Output | 81.48 | 46.23 |
| Tokens per Second - Request | 1.24 | 0.87 |
Time Breakdown
| Metric | Qwerky Optimized Llama 3B | Llama 3.2 3B Inst |
|---|---|---|
| Total Duration (s) | 9.71 | 16.47 |
| Time-to-First Token (ms) | 12.09 | 21.16 |
| End-to-End Latency (ms) | 971.46 | 1647.07 |
| Time per Output Token (ms) | 12.28 | 21.63 |
| Inter-Token Latency (ms) | 81.48 | 46.23 |
Evaluation Results
| Benchmark | Shots | Qwerky Optimized Llama 3B | Llama 3.2 3B Inst |
|---|---|---|---|
| GPQA | 0 | 25.76 | 28.52 |
| Musr | 0 | 35.19 | 35.85 |
| HellaSwag | 0 | 49.41 | 52.18 |
| ARC-Challenge | 0 | 39.51 | 43.52 |
| ARC-Easy | 0 | 74.71 | 69.60 |
Citation
If you use this model, please cite:
@misc{qwerky_llama_mamba_hybrid,
title={QwerkyLlamaMambaHybrid},
author={Qwerky AI, Inc.},
year={2025},
publisher={HuggingFace}
}
License
This model is licensed under the Qwerky Distilled Model License Agreement. See the LICENSE file for more details.
- Downloads last month
- 505