---
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- nvfp4
- quantized
- vllm
- hopper
- dgx
license: llama3.2
---

# Llama-3.2-1B-Instruct-NVFP4

NVFP4 quantized version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) optimized for NVIDIA DGX/Hopper+ architectures (H100, H200, GB10, etc.).

## Quantization Details

- **Format:** NVFP4 (4-bit floating point)
- **Quantized using:** NVIDIA TensorRT Model Optimizer 0.35.0
- **Hardware:** 2× NVIDIA H200 SXM (188GB VRAM each)
- **Original precision:** BF16/FP16
- **Compatible with:** vLLM 0.10+, NVIDIA NGC containers

## Usage with vLLM
```bash
vllm serve tbhot3ww/Llama-3.2-1B-Instruct-NVFP4 \
  --quantization modelopt_fp4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95
```

## Requirements

- NVIDIA GPU with Hopper architecture or newer (compute capability ≥ 9.0)
- CUDA 12.0+
- vLLM 0.10 or newer with ModelOpt support

## Original Model

For architecture details, training data, intended use, and limitations, see the [original model card](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).

## Quantization Notes

This model uses NVIDIA's NVFP4 format which provides 4-bit quantization with minimal quality degradation. Best performance is achieved on NVIDIA Hopper+ GPUs with native FP4 tensor core support.