ISTA-DASLab
/

Llama-3.2-3B-Instruct-FPQuant-QAT-NVFP4

8-bit precision

Model card Files Files and versions

Llama-3.2-3B-Instruct-FPQuant-QAT-NVFP4 / README.md

BlackSamorez's picture

Upload README.md with huggingface_hub

d99071c verified 28 days ago

|

history blame contribute delete

1.13 kB

	This is the official QAT FP-Quant checkpoint of `meta-llama/Llama-3.2-3B-Instruct`, produced as described in the ["Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization"](https://arxiv.org/abs/2509.23202) paper.

	This model can be run on Blackwell-generation NVIDIA GPUs via [QuTLASS](https://github.com/IST-DASLab/qutlass) and [FP-Quant](https://github.com/IST-DASLab/FP-Quant) in either [transformers](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant) or [vLLM](https://github.com/vllm-project/vllm/pull/24440).

	The approximate recipe for training this model (up to local batch size and LR) is available [here](https://github.com/IST-DASLab/nanochat-qat/blob/qat/transformers_distill.py).

	This checkpoint has the following performance relative to the original model and the RTN quantization:

	\| Model \| MMLU \| GSM8k \| Hellaswag \| Winogrande \| Avg \|
	\|-------\|------\|-------\|-----------\|------------\|-----\|
	\| `meta-llama/Llama-3.2-3B-Instruct` \| 64.4 \| 78.0 \| 73.4 \| 70.1 \| 71.5 \|
	\| RTN \| 59.9 \| 64.8 \| 69.8 \| 65.6 \| 65.0 \|
	\| QAT (THIS) \| 62.0 \| 72.9 \| 71.0 \| 66.5 \| 68.1 \|