File size: 3,404 Bytes
afec063
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f86d27
afec063
7251ea9
afec063
 
 
 
 
 
 
 
eea6b8a
 
afec063
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
022aeb4
 
afec063
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ab5e06
afec063
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
quantized_by: anikifoss
pipeline_tag: text-generation
base_model: MiniMaxAI/MiniMax-M2
license: mit
base_model_relation: quantized
tags:
- conversational
---

# Model Card

High quality quantization of **MiniMax-M2** without using imatrix.

# Run

Currently `llama.cpp` does not return `<think>` token for this model. If you know how to fix that, please share in the "Community" section!

As a workaround, to inject the <think> token in OpenWebUI, you can use the [inject_think_token_filter.txt](https://huggingface.co/anikifoss/DeepSeek-V3.1-HQ4_K/blob/main/inject_think_token_filter.txt). You can add filters via `Admin Panel` -> `Functions` -> `Filter` -> `+ button on the right`


## llama.cpp - CPU experts offload

```
./build/bin/llama-server \
    --alias anikifoss/MiniMax-M2-HQ4_K \
    --model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
    --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \
    --repeat-penalty 1.04 --repeat-last-n 256 \
    --ctx-size 95000 \
    -ctk q8_0 -ctv q8_0 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-4])\.ffn_.*_exps.*=CUDA0" \
    -ot "blk\.([5-9])\.ffn_.*_exps.*=CPU" \
    -ot "blk\.([1-6][0-9])\.ffn_.*_exps.*=CPU" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090
```

## llama.cpp - MI50 experts offload
```
./build/bin/llama-server \
    --alias anikifoss/MiniMax-M2-HQ4_K \
    --model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
    --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \
    --repeat-penalty 1.04 --repeat-last-n 256 \
    --ctx-size 95000 \
    -ctk q8_0 -ctv q8_0 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(4[0-3])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(4[4-7])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(4[8-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(5[0-1])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(5[2-5])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(5[6-9])\.ffn_.*_exps.*=CUDA0" \
    -ot "blk\.(6[0-9])\.ffn_.*_exps.*=CUDA0" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090
```

## Quantization Recipe
Quantized with [llama.cpp](https://github.com/ggml-org/llama.cpp). See `Custom Quants` section in [this detailed guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) for all the quantization steps.

```bash
TARGET_MODEL="MiniMax-M2-HQ4_K"
mkdir -p ~/Env/models/anikifoss/$TARGET_MODEL
./build/bin/llama-quantize \
    --output-tensor-type Q8_0 \
    --token-embedding-type Q8_0 \
    --tensor-type attn_q=Q8_0 \
    --tensor-type attn_k=Q8_0 \
    --tensor-type attn_v=Q8_0 \
    --tensor-type ffn_down_exps=Q6_K \
    --tensor-type ffn_gate_exps=Q4_K \
    --tensor-type ffn_up_exps=Q4_K \
    /mnt/data/Models/MiniMaxAI/MiniMax-M2-GGUF/MiniMax-M2-256x4.9B-BF16-00001-of-00010.gguf \
    ~/Env/models/anikifoss/$TARGET_MODEL/$TARGET_MODEL.gguf \
    Q8_0 \
    32
```