File size: 3,404 Bytes
afec063 3f86d27 afec063 7251ea9 afec063 eea6b8a afec063 022aeb4 afec063 3ab5e06 afec063 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
quantized_by: anikifoss
pipeline_tag: text-generation
base_model: MiniMaxAI/MiniMax-M2
license: mit
base_model_relation: quantized
tags:
- conversational
---
# Model Card
High quality quantization of **MiniMax-M2** without using imatrix.
# Run
Currently `llama.cpp` does not return `<think>` token for this model. If you know how to fix that, please share in the "Community" section!
As a workaround, to inject the <think> token in OpenWebUI, you can use the [inject_think_token_filter.txt](https://huggingface.co/anikifoss/DeepSeek-V3.1-HQ4_K/blob/main/inject_think_token_filter.txt). You can add filters via `Admin Panel` -> `Functions` -> `Filter` -> `+ button on the right`
## llama.cpp - CPU experts offload
```
./build/bin/llama-server \
--alias anikifoss/MiniMax-M2-HQ4_K \
--model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
--temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \
--repeat-penalty 1.04 --repeat-last-n 256 \
--ctx-size 95000 \
-ctk q8_0 -ctv q8_0 \
-fa on \
-b 1024 -ub 1024 \
-ngl 99 \
--device CUDA0 \
-ot "blk\.([0-9])\.attn_.*=CUDA0" \
-ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
-ot "blk\.([0-4])\.ffn_.*_exps.*=CUDA0" \
-ot "blk\.([5-9])\.ffn_.*_exps.*=CPU" \
-ot "blk\.([1-6][0-9])\.ffn_.*_exps.*=CPU" \
--jinja \
--parallel 1 \
--threads 32 \
--host 127.0.0.1 \
--port 8090
```
## llama.cpp - MI50 experts offload
```
./build/bin/llama-server \
--alias anikifoss/MiniMax-M2-HQ4_K \
--model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
--temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \
--repeat-penalty 1.04 --repeat-last-n 256 \
--ctx-size 95000 \
-ctk q8_0 -ctv q8_0 \
-fa on \
-b 1024 -ub 1024 \
-ngl 99 \
--device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
--tensor-split 1,0,0,0,0 \
-ot "blk\.([0-9])\.attn_.*=CUDA0" \
-ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
-ot "blk\.([0-9])\.ffn_.*_exps.*=ROCm0" \
-ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm3" \
-ot "blk\.(4[0-3])\.ffn_.*_exps.*=ROCm0" \
-ot "blk\.(4[4-7])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.(4[8-9])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.(5[0-1])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.(5[2-5])\.ffn_.*_exps.*=ROCm3" \
-ot "blk\.(5[6-9])\.ffn_.*_exps.*=CUDA0" \
-ot "blk\.(6[0-9])\.ffn_.*_exps.*=CUDA0" \
--jinja \
--parallel 1 \
--threads 32 \
--host 127.0.0.1 \
--port 8090
```
## Quantization Recipe
Quantized with [llama.cpp](https://github.com/ggml-org/llama.cpp). See `Custom Quants` section in [this detailed guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) for all the quantization steps.
```bash
TARGET_MODEL="MiniMax-M2-HQ4_K"
mkdir -p ~/Env/models/anikifoss/$TARGET_MODEL
./build/bin/llama-quantize \
--output-tensor-type Q8_0 \
--token-embedding-type Q8_0 \
--tensor-type attn_q=Q8_0 \
--tensor-type attn_k=Q8_0 \
--tensor-type attn_v=Q8_0 \
--tensor-type ffn_down_exps=Q6_K \
--tensor-type ffn_gate_exps=Q4_K \
--tensor-type ffn_up_exps=Q4_K \
/mnt/data/Models/MiniMaxAI/MiniMax-M2-GGUF/MiniMax-M2-256x4.9B-BF16-00001-of-00010.gguf \
~/Env/models/anikifoss/$TARGET_MODEL/$TARGET_MODEL.gguf \
Q8_0 \
32
```
|