tbhot3ww commited on
Commit
d7889c6
·
verified ·
1 Parent(s): f564324

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-Coder-32B-Instruct
3
+ tags:
4
+ - nvfp4
5
+ - quantized
6
+ - vllm
7
+ - hopper
8
+ - dgx
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # Qwen2.5-Coder-32B-Instruct-NVFP4
13
+
14
+ NVFP4 quantized version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) optimized for NVIDIA DGX/Hopper+ architectures (H100, H200, GB10, etc.).
15
+
16
+ ## Quantization Details
17
+
18
+ - **Format:** NVFP4 (4-bit floating point)
19
+ - **Quantized using:** NVIDIA TensorRT Model Optimizer 0.35.0
20
+ - **Hardware:** 2× NVIDIA H200 SXM (188GB VRAM each)
21
+ - **Original precision:** BF16/FP16
22
+ - **Compatible with:** vLLM 0.10+, NVIDIA NGC containers
23
+
24
+ ## Usage with vLLM
25
+ ```bash
26
+ vllm serve tbhot3ww/Qwen2.5-Coder-32B-Instruct-NVFP4 \
27
+ --quantization modelopt_fp4 \
28
+ --max-model-len 8192 \
29
+ --gpu-memory-utilization 0.95
30
+ ```
31
+
32
+ ## Requirements
33
+
34
+ - NVIDIA GPU with Hopper architecture or newer (compute capability ≥ 9.0)
35
+ - CUDA 12.0+
36
+ - vLLM 0.10 or newer with ModelOpt support
37
+
38
+ ## Original Model
39
+
40
+ For architecture details, training data, intended use, and limitations, see the [original model card](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct).
41
+
42
+ ## Quantization Notes
43
+
44
+ This model uses NVIDIA's NVFP4 format which provides 4-bit quantization with minimal quality degradation. Best performance is achieved on NVIDIA Hopper+ GPUs with native FP4 tensor core support.