--- library_name: vllm license: apache-2.0 language: - en - fr - es - it - pt - zh - ar - ru base_model: - HuggingFaceTB/SmolLM3-3B tags: - neuralmagic - redhat - llmcompressor - int4 - w4a16 - quantized --- ## Model Overview - **Model Architecture:** SmolLM3-3B - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** INT4 - **Activation quantization:** None - **Release Date:** 07/31/2025 - **Version:** 1.0 - **License(s):** Apache-2.0 - **Model Developers:** RedHat (Neural Magic) ### Model Optimizations This model was obtained by quantizing weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to INT4 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%). Weight quantization also reduces disk size requirements by approximately 75%. Only weights of the linear operators within transformers blocks are quantized. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization. ## Deployment This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/SmolLM3-3B-quantized.w4a16" number_gpus = 1 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation
Creation details This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below with: ```bash python int4.py --model_path HuggingFaceTB/SmolLM3-3B --calib_size 1024 --dampening_frac 0.1 --observer minmax --actorder group --sym false ``` where `int4.py` is as follows: ```python import argparse from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM from compressed_tensors.quantization import ( QuantizationScheme, QuantizationArgs, QuantizationType, QuantizationStrategy, ) from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.transformers import oneshot # Constants DATASET_ID = "neuralmagic/LLM_compression_calibration" DATASET_SPLIT = "train" MAX_SEQ_LENGTH = 8192 IGNORE_MODULES = ["lm_head"] # Argument Parsing Utilities def parse_actorder(value: str): value_lower = value.lower() if value_lower == "false": return False if value_lower in {"weight", "group"}: return value_lower raise argparse.ArgumentTypeError(f"Invalid --actorder. Choose 'group', 'weight', or 'false', got {value}") def parse_sym(value: str): value_lower = value.lower() if value_lower in {"true", "false"}: return value_lower == "true" raise argparse.ArgumentTypeError(f"Invalid --sym. Use 'true' or 'false', got {value}") # Argument Parser def get_args(): parser = argparse.ArgumentParser(description="Quantize a model with GPTQModifier.") parser.add_argument('--model_path', type=str, required=True, help="Path to the unquantized model.") parser.add_argument('--calib_size', type=int, default=256, help="Number of samples for calibration.") parser.add_argument('--dampening_frac', type=float, default=0.1, help="Dampening fraction for quantization.") parser.add_argument('--observer', type=str, default="minmax", help="Observer type used for quantization.") parser.add_argument('--sym', type=parse_sym, default=True, help="Symmetric quantization (true/false).") parser.add_argument('--actorder', type=parse_actorder, default=False, help="Activation order: 'group', 'weight', or 'false'.") return parser.parse_args() def main(): args = get_args() model = AutoModelForCausalLM.from_pretrained( args.model_path, device_map="auto", torch_dtype="auto", use_cache=False, trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(args.model_path) # Load and preprocess dataset ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) ds = ds.shuffle(seed=42).select(range(args.calib_size)) ds = ds.map(lambda x: {"text": x["text"]}) ds = ds.map( lambda x: tokenizer(x["text"], padding=False, truncation=False, add_special_tokens=True), remove_columns=ds.column_names ) # Build Quantization Scheme quant_scheme = QuantizationScheme( targets=["Linear"], weights=QuantizationArgs( num_bits=4, type=QuantizationType.INT, symmetric=args.sym, group_size=128, strategy=QuantizationStrategy.GROUP, observer=args.observer, actorder=args.actorder ), input_activations=None, output_activations=None, ) # Define compression recipe recipe = [ GPTQModifier( targets=["Linear"], ignore=IGNORE_MODULES, dampening_frac=args.dampening_frac, config_groups={"group_0": quant_scheme}, ) ] # Apply quantization oneshot( model=model, dataset=ds, recipe=recipe, num_calibration_samples=args.calib_size, max_seq_length=MAX_SEQ_LENGTH, ) # Save the quantized model save_path = f"{args.model_path}-quantized.w4a16" model.save_pretrained(save_path, save_compressed=True) tokenizer.save_pretrained(save_path) if __name__ == "__main__": main() ```
## Evaluation This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond. In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine, and evals are collected through [LightEval](https://github.com/huggingface/lighteval) library.
Evaluation details ``` export VLLM_WORKER_MULTIPROC_METHOD=spawn export MODEL="RedHatAI/SmolLM3-3B-quantized.w4a16" export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}" export TASK=aime24 # {aime24, math_500, gpqa:diamond} lighteval vllm $MODEL_ARGS "lighteval|${TASK}|0|0" \ --use-chat-template \ --output-dir out_dir ```
### Accuracy
Category Benchmark HuggingFaceTB/SmolLM3-3B RedHatAI/SmolLM3-3B-quantized.w4a16
(this model)
Recovery
Reasoning AIME24 (pass@1:64) 45.31 39.27 86.67%
MATH-500 (pass@1:4) 89.30 87.55 98.04%
GPQA-Diamond (pass@1:8) 41.22 41.86 101.55%
Average 58.61 56.23 95.94%