Model Card for Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4

This repository contains a 4-bit integer (INT4) quantized version of the Qwen/Qwen3-Next-80B-A3B-Instruct model, optimized using the GPTQ method.

The primary goal of this quantization is to enable high-performance inference on AMD Instinct MI100 GPUs and potentially other accelerators that may not have support for bfloat16.

To preserve the model's high accuracy, a selective quantization strategy was employed. Critical layers, including attention mechanisms, layer norms, and specific MLP components, were intentionally excluded from quantization and remain in their original float16 precision.

This targeted approach ensures a balance between computational efficiency and performance, making the model accessible on a wider range of hardware without a significant drop in quality.

Evaluation

The quantized model's performance was validated on the MMLU Pro benchmark. The results confirm a negligible performance difference compared to the original unquantized model, highlighting the effectiveness of the selective quantization strategy.

Performance Summary

Model MMLU Pro Score (exact_match)
This Quantized Model (GPTQ-Int4A16) 0.7649
jart25 Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ 0.7635
Intel AutoRound (int4-mixed) 0.7630
Original Qwen3-Next-80B-A3B-Instruct (FP16) 0.7621

Click to view the full MMLU Pro breakdown

Evaluation using lm-evaluation-harness command:

lm_eval
--model local-completions
--tasks mmlu_pro
--batch_size 1
--model_args model=Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenizer=Qwen/Qwen3-Next-80B-A3B-Instruct,max_length=8192,max_gen_toks=4096 -w -s --output_path Qwen3-Next-80B-A3B-Instruct-AWQ-4bit-res

Results:

Tasks Version Filter n-shot Metric Value Stderr
mmlu_pro 2.0 custom-extract exact_match ↑ 0.7649 ± 0.0037
- biology 2.1 custom-extract 5 exact_match ↑ 0.8898 ± 0.0117
- business 2.1 custom-extract 5 exact_match ↑ 0.8162 ± 0.0138
- chemistry 2.1 custom-extract 5 exact_match ↑ 0.7915 ± 0.0121
- computer_science 2.1 custom-extract 5 exact_match ↑ 0.7976 ± 0.0199
- economics 2.1 custom-extract 5 exact_match ↑ 0.8507 ± 0.0123
- engineering 2.1 custom-extract 5 exact_match ↑ 0.5583 ± 0.0160
- health 2.1 custom-extract 5 exact_match ↑ 0.7628 ± 0.0149
- history 2.1 custom-extract 5 exact_match ↑ 0.6850 ± 0.0238
- law 2.1 custom-extract 5 exact_match ↑ 0.5595 ± 0.0150
- math 2.1 custom-extract 5 exact_match ↑ 0.8712 ± 0.0091
- other 2.1 custom-extract 5 exact_match ↑ 0.7522 ± 0.0142
- philosophy 2.1 custom-extract 5 exact_match ↑ 0.7315 ± 0.0199
- physics 2.1 custom-extract 5 exact_match ↑ 0.8122 ± 0.0108
- psychology 2.1 custom-extract 5 exact_match ↑ 0.8095 ± 0.0139

Quantization Script

The model was quantized using the llmcompressor library. The following script details the exact configuration used, including the layers that were ignored to maintain model quality.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn as nn

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

from pathlib import Path
import os


MODEL_ID = "/root/llamamodels/Qwen3-Next-80B-A3B-Instruct"
SAVE_DIR = "/root/llamamodels/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16"


model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)


NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 1024

DATASET_ID = "wikitext"
DATASET_NAME = "wikitext-2-raw-v1"
DATASET_SPLIT = "validation"

ds = load_dataset(DATASET_ID, DATASET_NAME, split=DATASET_SPLIT)

ds = ds.filter(lambda ex: ex.get("text", "").strip() != "")

n = min(NUM_CALIBRATION_SAMPLES, len(ds))
ds = ds.shuffle(seed=42).select(range(n))

# Render to chat-style text (batch)
def preprocess(batch):
    rendered = [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": t}],
            tokenize=False,
        )
        for t in batch["text"]
    ]
    return {"text": rendered}

ds = ds.map(preprocess, batched=True, num_proc=4)

# Tokenize in batches
ds = ds.map(
    lambda batch: tokenizer(
        batch["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    ),
    batched=True,
    remove_columns=ds.column_names,
    num_proc=4,
)

recipe = [
    GPTQModifier(
        block_size=128,
        dampening_frac=0.01,
        config_groups={
            "group_0": {
                "targets": ["Linear"],
                "weights": {
                    "num_bits": 4,
                    "type": "int",
                    "symmetric": True,
                    "strategy": "group",
                    "group_size": 32, 
                },
            }
        },

        # This prevents the attention layers from being quantized.
        ignore=[
            "model.embed_tokens",
            "re:.*input_layernorm$",
            "re:.*linear_attn.*",
            "re:.*norm.*",
            "re:.*RMSNorm.*",
            "re:.*rotary.*",
            "re:.*shared_expert.*",
            "re:.*shared_expert_gate$",
            "re:.*mlp[.]gate$",
            "re:.*router.*",
            "re:.*post_attention_layernorm$",
            "re:.*self_attn.*",
            "re:mtp.*",
            "lm_head"
        ],
    )
]


oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    calibrate_moe_context=True,
    output_dir=SAVE_DIR,
)

print("Saved to:", SAVE_DIR)
Downloads last month
260
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16

Quantized
(51)
this model