|
|
--- |
|
|
quantized_by: anikifoss |
|
|
pipeline_tag: text-generation |
|
|
base_model: MiniMaxAI/MiniMax-M2 |
|
|
license: mit |
|
|
base_model_relation: quantized |
|
|
tags: |
|
|
- conversational |
|
|
--- |
|
|
|
|
|
# Model Card |
|
|
|
|
|
High quality quantization of **MiniMax-M2** without using imatrix. |
|
|
|
|
|
# Run |
|
|
|
|
|
Currently `llama.cpp` does not return `<think>` token for this model. If you know how to fix that, please share in the "Community" section! |
|
|
|
|
|
As a workaround, to inject the <think> token in OpenWebUI, you can use the [inject_think_token_filter.txt](https://huggingface.co/anikifoss/DeepSeek-V3.1-HQ4_K/blob/main/inject_think_token_filter.txt). You can add filters via `Admin Panel` -> `Functions` -> `Filter` -> `+ button on the right` |
|
|
|
|
|
|
|
|
## llama.cpp - CPU experts offload |
|
|
|
|
|
``` |
|
|
./build/bin/llama-server \ |
|
|
--alias anikifoss/MiniMax-M2-HQ4_K \ |
|
|
--model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \ |
|
|
--temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \ |
|
|
--repeat-penalty 1.04 --repeat-last-n 256 \ |
|
|
--ctx-size 95000 \ |
|
|
-ctk q8_0 -ctv q8_0 \ |
|
|
-fa on \ |
|
|
-b 1024 -ub 1024 \ |
|
|
-ngl 99 \ |
|
|
--device CUDA0 \ |
|
|
-ot "blk\.([0-9])\.attn_.*=CUDA0" \ |
|
|
-ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \ |
|
|
-ot "blk\.([0-4])\.ffn_.*_exps.*=CUDA0" \ |
|
|
-ot "blk\.([5-9])\.ffn_.*_exps.*=CPU" \ |
|
|
-ot "blk\.([1-6][0-9])\.ffn_.*_exps.*=CPU" \ |
|
|
--jinja \ |
|
|
--parallel 1 \ |
|
|
--threads 32 \ |
|
|
--host 127.0.0.1 \ |
|
|
--port 8090 |
|
|
``` |
|
|
|
|
|
## llama.cpp - MI50 experts offload |
|
|
``` |
|
|
./build/bin/llama-server \ |
|
|
--alias anikifoss/MiniMax-M2-HQ4_K \ |
|
|
--model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \ |
|
|
--temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \ |
|
|
--repeat-penalty 1.04 --repeat-last-n 256 \ |
|
|
--ctx-size 95000 \ |
|
|
-ctk q8_0 -ctv q8_0 \ |
|
|
-fa on \ |
|
|
-b 1024 -ub 1024 \ |
|
|
-ngl 99 \ |
|
|
--device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \ |
|
|
--tensor-split 1,0,0,0,0 \ |
|
|
-ot "blk\.([0-9])\.attn_.*=CUDA0" \ |
|
|
-ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \ |
|
|
-ot "blk\.([0-9])\.ffn_.*_exps.*=ROCm0" \ |
|
|
-ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm1" \ |
|
|
-ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm2" \ |
|
|
-ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm3" \ |
|
|
-ot "blk\.(4[0-3])\.ffn_.*_exps.*=ROCm0" \ |
|
|
-ot "blk\.(4[4-7])\.ffn_.*_exps.*=ROCm1" \ |
|
|
-ot "blk\.(4[8-9])\.ffn_.*_exps.*=ROCm2" \ |
|
|
-ot "blk\.(5[0-1])\.ffn_.*_exps.*=ROCm2" \ |
|
|
-ot "blk\.(5[2-5])\.ffn_.*_exps.*=ROCm3" \ |
|
|
-ot "blk\.(5[6-9])\.ffn_.*_exps.*=CUDA0" \ |
|
|
-ot "blk\.(6[0-9])\.ffn_.*_exps.*=CUDA0" \ |
|
|
--jinja \ |
|
|
--parallel 1 \ |
|
|
--threads 32 \ |
|
|
--host 127.0.0.1 \ |
|
|
--port 8090 |
|
|
``` |
|
|
|
|
|
## Quantization Recipe |
|
|
Quantized with [llama.cpp](https://github.com/ggml-org/llama.cpp). See `Custom Quants` section in [this detailed guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) for all the quantization steps. |
|
|
|
|
|
```bash |
|
|
TARGET_MODEL="MiniMax-M2-HQ4_K" |
|
|
mkdir -p ~/Env/models/anikifoss/$TARGET_MODEL |
|
|
./build/bin/llama-quantize \ |
|
|
--output-tensor-type Q8_0 \ |
|
|
--token-embedding-type Q8_0 \ |
|
|
--tensor-type attn_q=Q8_0 \ |
|
|
--tensor-type attn_k=Q8_0 \ |
|
|
--tensor-type attn_v=Q8_0 \ |
|
|
--tensor-type ffn_down_exps=Q6_K \ |
|
|
--tensor-type ffn_gate_exps=Q4_K \ |
|
|
--tensor-type ffn_up_exps=Q4_K \ |
|
|
/mnt/data/Models/MiniMaxAI/MiniMax-M2-GGUF/MiniMax-M2-256x4.9B-BF16-00001-of-00010.gguf \ |
|
|
~/Env/models/anikifoss/$TARGET_MODEL/$TARGET_MODEL.gguf \ |
|
|
Q8_0 \ |
|
|
32 |
|
|
``` |
|
|
|