Int4 quantization broken

#28

by TheBigBlockPC - opened Sep 24

Sep 24

Due to some technical difficulties loading the model is any int4/ bf4 quantization doesn't work (it fills up all my 48 GB of vram on load even if quantized with botsandbytes. Can anyone help here?

Downtown-Case

Sep 24

•

edited Sep 24

Try loading an exl3 in TabbyAPI if you want to squeeze it in 48GB: https://huggingface.co/turboderp/Qwen3-Next-80B-A3B-Instruct-exl3

Loss is going way lower than bitsandbytes anyway, basically negligible at ~4.06bpw.

TheBigBlockPC

Sep 24

Try loading an exl3 in TabbyAPI if you want to squeeze it in 48GB: https://huggingface.co/turboderp/Qwen3-Next-80B-A3B-Instruct-exl3

Loss is going way lower than bitsandbytes anyway, basically negligible at ~4.06bpw.

how do i do that and does it work on multi GPU because those 48 GB are 2 24 GB GPUs. on transformers the device map splits the model across the GPUs

Downtown-Case

Sep 24

Yep, works on multi GPU, just install:

https://github.com/theroyallab/tabbyAPI

You can use the raw underlying library, exllamav3, via Python (like transformers) if that's what you'd prefer.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment