Model Card for Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4
This repository contains a 4-bit integer (INT4) quantized version of the Qwen/Qwen3-Next-80B-A3B-Instruct model, optimized using the GPTQ method.
The primary goal of this quantization is to enable high-performance inference on AMD Instinct MI100 GPUs and potentially other accelerators that may not have support for bfloat16.
To preserve the model's high accuracy, a selective quantization strategy was employed. Critical layers, including attention mechanisms, layer norms, and specific MLP components, were intentionally excluded from quantization and remain in their original float16 precision.
This targeted approach ensures a balance between computational efficiency and performance, making the model accessible on a wider range of hardware without a significant drop in quality.
Evaluation
The quantized model's performance was validated on the MMLU Pro benchmark. The results confirm a negligible performance difference compared to the original unquantized model, highlighting the effectiveness of the selective quantization strategy.
Performance Summary
| Model | MMLU Pro Score (exact_match) |
|---|---|
| This Quantized Model (GPTQ-Int4A16) | 0.7649 |
| jart25 Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ | 0.7635 |
Intel AutoRound (int4-mixed) |
0.7630 |
Original Qwen3-Next-80B-A3B-Instruct (FP16) |
0.7621 |
Click to view the full MMLU Pro breakdown
Evaluation using lm-evaluation-harness command:
lm_eval
--model local-completions
--tasks mmlu_pro
--batch_size 1
--model_args model=Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenizer=Qwen/Qwen3-Next-80B-A3B-Instruct,max_length=8192,max_gen_toks=4096 -w -s --output_path Qwen3-Next-80B-A3B-Instruct-AWQ-4bit-res
Results:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu_pro | 2.0 | custom-extract | exact_match | ↑ | 0.7649 | ± | 0.0037 | |
| - biology | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.8898 | ± | 0.0117 |
| - business | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.8162 | ± | 0.0138 |
| - chemistry | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.7915 | ± | 0.0121 |
| - computer_science | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.7976 | ± | 0.0199 |
| - economics | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.8507 | ± | 0.0123 |
| - engineering | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.5583 | ± | 0.0160 |
| - health | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.7628 | ± | 0.0149 |
| - history | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.6850 | ± | 0.0238 |
| - law | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.5595 | ± | 0.0150 |
| - math | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.8712 | ± | 0.0091 |
| - other | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.7522 | ± | 0.0142 |
| - philosophy | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.7315 | ± | 0.0199 |
| - physics | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.8122 | ± | 0.0108 |
| - psychology | 2.1 | custom-extract | 5 | exact_match | ↑ | 0.8095 | ± | 0.0139 |
Quantization Script
The model was quantized using the llmcompressor library. The following script details the exact configuration used, including the layers that were ignored to maintain model quality.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn as nn
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from pathlib import Path
import os
MODEL_ID = "/root/llamamodels/Qwen3-Next-80B-A3B-Instruct"
SAVE_DIR = "/root/llamamodels/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 1024
DATASET_ID = "wikitext"
DATASET_NAME = "wikitext-2-raw-v1"
DATASET_SPLIT = "validation"
ds = load_dataset(DATASET_ID, DATASET_NAME, split=DATASET_SPLIT)
ds = ds.filter(lambda ex: ex.get("text", "").strip() != "")
n = min(NUM_CALIBRATION_SAMPLES, len(ds))
ds = ds.shuffle(seed=42).select(range(n))
# Render to chat-style text (batch)
def preprocess(batch):
rendered = [
tokenizer.apply_chat_template(
[{"role": "user", "content": t}],
tokenize=False,
)
for t in batch["text"]
]
return {"text": rendered}
ds = ds.map(preprocess, batched=True, num_proc=4)
# Tokenize in batches
ds = ds.map(
lambda batch: tokenizer(
batch["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
),
batched=True,
remove_columns=ds.column_names,
num_proc=4,
)
recipe = [
GPTQModifier(
block_size=128,
dampening_frac=0.01,
config_groups={
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"type": "int",
"symmetric": True,
"strategy": "group",
"group_size": 32,
},
}
},
# This prevents the attention layers from being quantized.
ignore=[
"model.embed_tokens",
"re:.*input_layernorm$",
"re:.*linear_attn.*",
"re:.*norm.*",
"re:.*RMSNorm.*",
"re:.*rotary.*",
"re:.*shared_expert.*",
"re:.*shared_expert_gate$",
"re:.*mlp[.]gate$",
"re:.*router.*",
"re:.*post_attention_layernorm$",
"re:.*self_attn.*",
"re:mtp.*",
"lm_head"
],
)
]
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
calibrate_moe_context=True,
output_dir=SAVE_DIR,
)
print("Saved to:", SAVE_DIR)
- Downloads last month
- 260
Model tree for dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct