| This is the official QAT FP-Quant checkpoint of `meta-llama/Llama-3.2-3B-Instruct`, produced as described in the [**"Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization"**](https://arxiv.org/abs/2509.23202) paper. | |
| This model can be run on Blackwell-generation NVIDIA GPUs via [QuTLASS](https://github.com/IST-DASLab/qutlass) and [FP-Quant](https://github.com/IST-DASLab/FP-Quant) in either [transformers](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant) or [vLLM](https://github.com/vllm-project/vllm/pull/24440). | |
| The approximate recipe for training this model (up to local batch size and LR) is available [here](https://github.com/IST-DASLab/nanochat-qat/blob/qat/transformers_distill.py). | |
| This checkpoint has the following performance relative to the original model and the RTN quantization: | |
| | Model | MMLU | GSM8k | Hellaswag | Winogrande | Avg | | |
| |-------|------|-------|-----------|------------|-----| | |
| | `meta-llama/Llama-3.2-3B-Instruct` | 64.4 | 78.0 | 73.4 | 70.1 | 71.5 | | |
| | RTN | 59.9 | 64.8 | 69.8 | 65.6 | 65.0 | | |
| | QAT (THIS) | 62.0 | 72.9 | 71.0 | 66.5 | 68.1 | | |