--- base_model: meta-llama/Llama-3.2-1B-Instruct tags: - nvfp4 - quantized - vllm - hopper - dgx license: llama3.2 --- # Llama-3.2-1B-Instruct-NVFP4 NVFP4 quantized version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) optimized for NVIDIA DGX/Hopper+ architectures (H100, H200, GB10, etc.). ## Quantization Details - **Format:** NVFP4 (4-bit floating point) - **Quantized using:** NVIDIA TensorRT Model Optimizer 0.35.0 - **Hardware:** 2× NVIDIA H200 SXM (188GB VRAM each) - **Original precision:** BF16/FP16 - **Compatible with:** vLLM 0.10+, NVIDIA NGC containers ## Usage with vLLM ```bash vllm serve tbhot3ww/Llama-3.2-1B-Instruct-NVFP4 \ --quantization modelopt_fp4 \ --max-model-len 8192 \ --gpu-memory-utilization 0.95 ``` ## Requirements - NVIDIA GPU with Hopper architecture or newer (compute capability ≥ 9.0) - CUDA 12.0+ - vLLM 0.10 or newer with ModelOpt support ## Original Model For architecture details, training data, intended use, and limitations, see the [original model card](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct). ## Quantization Notes This model uses NVIDIA's NVFP4 format which provides 4-bit quantization with minimal quality degradation. Best performance is achieved on NVIDIA Hopper+ GPUs with native FP4 tensor core support.