Int4 quantization broken

#28
by TheBigBlockPC - opened

Due to some technical difficulties loading the model is any int4/ bf4 quantization doesn't work (it fills up all my 48 GB of vram on load even if quantized with botsandbytes. Can anyone help here?

Try loading an exl3 in TabbyAPI if you want to squeeze it in 48GB: https://huggingface.co/turboderp/Qwen3-Next-80B-A3B-Instruct-exl3

Loss is going way lower than bitsandbytes anyway, basically negligible at ~4.06bpw.

Try loading an exl3 in TabbyAPI if you want to squeeze it in 48GB: https://huggingface.co/turboderp/Qwen3-Next-80B-A3B-Instruct-exl3

Loss is going way lower than bitsandbytes anyway, basically negligible at ~4.06bpw.

how do i do that and does it work on multi GPU because those 48 GB are 2 24 GB GPUs. on transformers the device map splits the model across the GPUs

Yep, works on multi GPU, just install:

https://github.com/theroyallab/tabbyAPI

You can use the raw underlying library, exllamav3, via Python (like transformers) if that's what you'd prefer.

Sign up or log in to comment