Int4 quantization broken
Due to some technical difficulties loading the model is any int4/ bf4 quantization doesn't work (it fills up all my 48 GB of vram on load even if quantized with botsandbytes. Can anyone help here?
Try loading an exl3 in TabbyAPI if you want to squeeze it in 48GB: https://huggingface.co/turboderp/Qwen3-Next-80B-A3B-Instruct-exl3
Loss is going way lower than bitsandbytes anyway, basically negligible at ~4.06bpw.
Try loading an exl3 in TabbyAPI if you want to squeeze it in 48GB: https://huggingface.co/turboderp/Qwen3-Next-80B-A3B-Instruct-exl3
Loss is going way lower than bitsandbytes anyway, basically negligible at ~4.06bpw.
how do i do that and does it work on multi GPU because those 48 GB are 2 24 GB GPUs. on transformers the device map splits the model across the GPUs
Yep, works on multi GPU, just install:
https://github.com/theroyallab/tabbyAPI
You can use the raw underlying library, exllamav3, via Python (like transformers) if that's what you'd prefer.