tbhot3ww
/

Qwen2.5-Coder-32B-Instruct-NVFP4

8-bit precision

Model card Files Files and versions

tbhot3ww commited on 16 days ago

Commit

d7889c6

·

verified ·

1 Parent(s): f564324

Create README.md

Files changed (1) hide show

README.md +44 -0

README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+---
+base_model: Qwen/Qwen2.5-Coder-32B-Instruct
+tags:
+- nvfp4
+- quantized
+- vllm
+- hopper
+- dgx
+license: apache-2.0
+---
+# Qwen2.5-Coder-32B-Instruct-NVFP4
+NVFP4 quantized version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) optimized for NVIDIA DGX/Hopper+ architectures (H100, H200, GB10, etc.).
+## Quantization Details
+- **Format:** NVFP4 (4-bit floating point)
+- **Quantized using:** NVIDIA TensorRT Model Optimizer 0.35.0
+- **Hardware:** 2× NVIDIA H200 SXM (188GB VRAM each)
+- **Original precision:** BF16/FP16
+- **Compatible with:** vLLM 0.10+, NVIDIA NGC containers
+## Usage with vLLM
+```bash
+vllm serve tbhot3ww/Qwen2.5-Coder-32B-Instruct-NVFP4 \
+  --quantization modelopt_fp4 \
+  --max-model-len 8192 \
+  --gpu-memory-utilization 0.95
+```
+## Requirements
+- NVIDIA GPU with Hopper architecture or newer (compute capability ≥ 9.0)
+- CUDA 12.0+
+- vLLM 0.10 or newer with ModelOpt support
+## Original Model
+For architecture details, training data, intended use, and limitations, see the [original model card](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct).
+## Quantization Notes
+This model uses NVIDIA's NVFP4 format which provides 4-bit quantization with minimal quality degradation. Best performance is achieved on NVIDIA Hopper+ GPUs with native FP4 tensor core support.