---
library_name: vllm
license: apache-2.0
language:
- en
- fr
- es
- it
- pt
- zh
- ar
- ru
base_model:
- HuggingFaceTB/SmolLM3-3B
tags:
- neuralmagic
- redhat
- llmcompressor
- int4
- w4a16
- quantized
---
## Model Overview
- **Model Architecture:** SmolLM3-3B
- **Input:** Text
- **Output:** Text
- **Model Optimizations:**
- **Weight quantization:** INT4
- **Activation quantization:** None
- **Release Date:** 07/31/2025
- **Version:** 1.0
- **License(s):** Apache-2.0
- **Model Developers:** RedHat (Neural Magic)
### Model Optimizations
This model was obtained by quantizing weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to INT4 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%).
Weight quantization also reduces disk size requirements by approximately 75%.
Only weights of the linear operators within transformers blocks are quantized.
The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
## Deployment
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/SmolLM3-3B-quantized.w4a16"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
```
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
## Creation
Creation details
This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below with:
```bash
python int4.py --model_path HuggingFaceTB/SmolLM3-3B --calib_size 1024 --dampening_frac 0.1 --observer minmax --actorder group --sym false
```
where `int4.py` is as follows:
```python
import argparse
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from compressed_tensors.quantization import (
QuantizationScheme,
QuantizationArgs,
QuantizationType,
QuantizationStrategy,
)
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
# Constants
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"
MAX_SEQ_LENGTH = 8192
IGNORE_MODULES = ["lm_head"]
# Argument Parsing Utilities
def parse_actorder(value: str):
value_lower = value.lower()
if value_lower == "false":
return False
if value_lower in {"weight", "group"}:
return value_lower
raise argparse.ArgumentTypeError(f"Invalid --actorder. Choose 'group', 'weight', or 'false', got {value}")
def parse_sym(value: str):
value_lower = value.lower()
if value_lower in {"true", "false"}:
return value_lower == "true"
raise argparse.ArgumentTypeError(f"Invalid --sym. Use 'true' or 'false', got {value}")
# Argument Parser
def get_args():
parser = argparse.ArgumentParser(description="Quantize a model with GPTQModifier.")
parser.add_argument('--model_path', type=str, required=True, help="Path to the unquantized model.")
parser.add_argument('--calib_size', type=int, default=256, help="Number of samples for calibration.")
parser.add_argument('--dampening_frac', type=float, default=0.1, help="Dampening fraction for quantization.")
parser.add_argument('--observer', type=str, default="minmax", help="Observer type used for quantization.")
parser.add_argument('--sym', type=parse_sym, default=True, help="Symmetric quantization (true/false).")
parser.add_argument('--actorder', type=parse_actorder, default=False,
help="Activation order: 'group', 'weight', or 'false'.")
return parser.parse_args()
def main():
args = get_args()
model = AutoModelForCausalLM.from_pretrained(
args.model_path,
device_map="auto",
torch_dtype="auto",
use_cache=False,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
# Load and preprocess dataset
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(args.calib_size))
ds = ds.map(lambda x: {"text": x["text"]})
ds = ds.map(
lambda x: tokenizer(x["text"], padding=False, truncation=False, add_special_tokens=True),
remove_columns=ds.column_names
)
# Build Quantization Scheme
quant_scheme = QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
symmetric=args.sym,
group_size=128,
strategy=QuantizationStrategy.GROUP,
observer=args.observer,
actorder=args.actorder
),
input_activations=None,
output_activations=None,
)
# Define compression recipe
recipe = [
GPTQModifier(
targets=["Linear"],
ignore=IGNORE_MODULES,
dampening_frac=args.dampening_frac,
config_groups={"group_0": quant_scheme},
)
]
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
num_calibration_samples=args.calib_size,
max_seq_length=MAX_SEQ_LENGTH,
)
# Save the quantized model
save_path = f"{args.model_path}-quantized.w4a16"
model.save_pretrained(save_path, save_compressed=True)
tokenizer.save_pretrained(save_path)
if __name__ == "__main__":
main()
```
Evaluation details
```
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export MODEL="RedHatAI/SmolLM3-3B-quantized.w4a16"
export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
export TASK=aime24 # {aime24, math_500, gpqa:diamond}
lighteval vllm $MODEL_ARGS "lighteval|${TASK}|0|0" \
--use-chat-template \
--output-dir out_dir
```
| Category | Benchmark | HuggingFaceTB/SmolLM3-3B | RedHatAI/SmolLM3-3B-quantized.w4a16 (this model) |
Recovery |
|---|---|---|---|---|
| Reasoning | AIME24 (pass@1:64) | 45.31 | 39.27 | 86.67% |
| MATH-500 (pass@1:4) | 89.30 | 87.55 | 98.04% | |
| GPQA-Diamond (pass@1:8) | 41.22 | 41.86 | 101.55% | |
| Average | 58.61 | 56.23 | 95.94% | |