anikifoss
/

MiniMax-M2-HQ4_K

Text Generation

Model card Files Files and versions

MiniMax-M2-HQ4_K / README.md

anikifoss's picture

Update README.md

eea6b8a verified 12 days ago

|

history blame contribute delete

3.4 kB

	---
	quantized_by: anikifoss
	pipeline_tag: text-generation
	base_model: MiniMaxAI/MiniMax-M2
	license: mit
	base_model_relation: quantized
	tags:
	- conversational
	---

	# Model Card

	High quality quantization of MiniMax-M2 without using imatrix.

	# Run

	Currently `llama.cpp` does not return `<think>` token for this model. If you know how to fix that, please share in the "Community" section!

	As a workaround, to inject the <think> token in OpenWebUI, you can use the [inject_think_token_filter.txt](https://huggingface.co/anikifoss/DeepSeek-V3.1-HQ4_K/blob/main/inject_think_token_filter.txt). You can add filters via `Admin Panel` -> `Functions` -> `Filter` -> `+ button on the right`


	## llama.cpp - CPU experts offload

	```
	./build/bin/llama-server \
	--alias anikifoss/MiniMax-M2-HQ4_K \
	--model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
	--temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \
	--repeat-penalty 1.04 --repeat-last-n 256 \
	--ctx-size 95000 \
	-ctk q8_0 -ctv q8_0 \
	-fa on \
	-b 1024 -ub 1024 \
	-ngl 99 \
	--device CUDA0 \
	-ot "blk\.([0-9])\.attn_.*=CUDA0" \
	-ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
	-ot "blk\.([0-4])\.ffn_._exps.=CUDA0" \
	-ot "blk\.([5-9])\.ffn_._exps.=CPU" \
	-ot "blk\.([1-6][0-9])\.ffn_._exps.=CPU" \
	--jinja \
	--parallel 1 \
	--threads 32 \
	--host 127.0.0.1 \
	--port 8090
	```

	## llama.cpp - MI50 experts offload
	```
	./build/bin/llama-server \
	--alias anikifoss/MiniMax-M2-HQ4_K \
	--model ~/Env/models/anikifoss/MiniMax-M2-HQ4_K/MiniMax-M2-HQ4_K-00001-of-00004.gguf \
	--temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.02 \
	--repeat-penalty 1.04 --repeat-last-n 256 \
	--ctx-size 95000 \
	-ctk q8_0 -ctv q8_0 \
	-fa on \
	-b 1024 -ub 1024 \
	-ngl 99 \
	--device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
	--tensor-split 1,0,0,0,0 \
	-ot "blk\.([0-9])\.attn_.*=CUDA0" \
	-ot "blk\.([1-6][0-9])\.attn_.*=CUDA0" \
	-ot "blk\.([0-9])\.ffn_._exps.=ROCm0" \
	-ot "blk\.(1[0-9])\.ffn_._exps.=ROCm1" \
	-ot "blk\.(2[0-9])\.ffn_._exps.=ROCm2" \
	-ot "blk\.(3[0-9])\.ffn_._exps.=ROCm3" \
	-ot "blk\.(4[0-3])\.ffn_._exps.=ROCm0" \
	-ot "blk\.(4[4-7])\.ffn_._exps.=ROCm1" \
	-ot "blk\.(4[8-9])\.ffn_._exps.=ROCm2" \
	-ot "blk\.(5[0-1])\.ffn_._exps.=ROCm2" \
	-ot "blk\.(5[2-5])\.ffn_._exps.=ROCm3" \
	-ot "blk\.(5[6-9])\.ffn_._exps.=CUDA0" \
	-ot "blk\.(6[0-9])\.ffn_._exps.=CUDA0" \
	--jinja \
	--parallel 1 \
	--threads 32 \
	--host 127.0.0.1 \
	--port 8090
	```

	## Quantization Recipe
	Quantized with [llama.cpp](https://github.com/ggml-org/llama.cpp). See `Custom Quants` section in [this detailed guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) for all the quantization steps.

	```bash
	TARGET_MODEL="MiniMax-M2-HQ4_K"
	mkdir -p ~/Env/models/anikifoss/$TARGET_MODEL
	./build/bin/llama-quantize \
	--output-tensor-type Q8_0 \
	--token-embedding-type Q8_0 \
	--tensor-type attn_q=Q8_0 \
	--tensor-type attn_k=Q8_0 \
	--tensor-type attn_v=Q8_0 \
	--tensor-type ffn_down_exps=Q6_K \
	--tensor-type ffn_gate_exps=Q4_K \
	--tensor-type ffn_up_exps=Q4_K \
	/mnt/data/Models/MiniMaxAI/MiniMax-M2-GGUF/MiniMax-M2-256x4.9B-BF16-00001-of-00010.gguf \
	~/Env/models/anikifoss/$TARGET_MODEL/$TARGET_MODEL.gguf \
	Q8_0 \
	32
	```