Note ik_llama.cpp can run regular GGUFs in addition

76f4800 5 months ago

6.96 kB

	---
	quantized_by: ubergarm
	pipeline_tag: text-generation
	base_model: tencent/Hunyuan-A13B-Instruct
	license: other
	license_name: tencent-hunyuan-a13b
	license_link: https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/LICENSE
	base_model_relation: quantized
	tags:
	- imatrix
	- conversational
	- ik_llama.cpp
	---

	## `ik_llama.cpp` imatrix Quantizations of Hunyuan-A13B-Instruct
	This quant collection REQUIRES [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

	NOTE `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

	Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP.

	These quants provide best in class perplexity for the given memory footprint.

	## Big Thanks
	Shout out to Wendell and the Level1Techs crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

	Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://discord.com/channels/1238219753324281886/1238239819017097246/1238676202357784650) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!

	## Quants
	#### `IQ3_KS` 34.088 GiB (3.642 BPW)
	Special mix `IQ4_KS` `ffn_down` and all new `IQ3_KS` `ffn_(up\|gate)` routed experts. `iq6_k/iq5_k` for attn and shared expert as shown in the recipe below. Test out `-rtr` to run-time-repack tensors to `_r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes.

	With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!

	Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB.

	More context or offload additional layers with extra VRAM.

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	custom="
	# Attention
	blk\..\.attn_k.=iq6_k
	blk\..\.attn_v.=iq6_k

	blk\..\.attn_q.=iq5_k
	blk\..\.attn_o.=iq5_k

	# 1x Shared Expert
	blk\..\.ffn_(down)_shexp.=iq6_k
	blk\..\.ffn_(gate\|up)_shexp.=iq5_k

	# 64x Routed Experts
	blk\..\.ffn_(down)_exps.=iq4_ks
	blk\..\.ffn_(gate\|up)_exps.=iq3_ks

	# Token Embedding
	token_embd\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat \
	/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf \
	/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
	IQ3_KS \
	24
	```

	</details>

	## Quick Start
	#### 16GB VRAM + 24GB RAM Hybrid GPU+CPU Inference
	```
	# Basically trade-off VRAM between longer context or more speed for your configuration.
	./build/bin/llama-server \
	--model /mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
	--alias ubergarm/Hunyuan-A13B-Instruct-IQ3_KS \
	-fa -fmoe \
	-rtr \
	-ctk q8_0 -ctv q8_0 \
	-c 32768 \
	--temp 0.6 \
	--presence-penalty 0.7 \
	--min-p 0.1 \
	-ngl 99 \
	-ot "blk\.([0-9])\.ffn_.*=CUDA0" \
	-ot exps=CPU \
	--parallel 1 \
	--threads 16 \
	--host 127.0.0.1 \
	--port 8083
	```

	## Perplexity

	The perplexity on these Hunyuan-A13B-Instruct models seems really high compared to stuff I've seen before. Check out the mainline llama.cpp [PR14425](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3024357323) for more details.

	* `IQ3_KS` 34.088 GiB (3.642 BPW) `Final estimate: PPL = 522.7473 +/- 5.68072`

	## Speed

	Used built in `llama-sweep-bench` tool for example speeds across a variety of context length chats (N_KV is the kv-cache depth used for generation).

	![llama-sweep-bench graph](images/sweep-bench.png "Chart showing how speed slows down as kv-cache size grows simulating longer multi-turn chats.")

	## llama-sweep-bench
	```bash
	# Offload 15 total layers and increase ubatch from default of -ub 512 up to -ub 2048 for big PP!
	export model=/mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf
	./build/bin/llama-sweep-bench \
	--model "$model" \
	-fa -fmoe \
	-rtr \
	-ctk q8_0 -ctv q8_0 \
	-c 32768 \
	-ngl 99 \
	-ot "blk\.([0-9])\.ffn_.*=CUDA0" \
	-ot "blk\.(1[0-4])\.ffn_.*=CUDA0" \
	-ub 2048 -b 2048 \
	-ot exps=CPU \
	--threads 16 \
	--warmup-batch
	```

	## NOTE Building Experimental PRs
	This PR is based on currently un-released PRs so is quite experimental. To build it before PRs are merged try something like this:
	```bash
	# get the code setup
	cd projects
	git clone https://github.com/ikawrakow/ik_llama.cpp.git
	git ik_llama.cpp
	git remote add ubergarm https://github.com/ubergarm/ik_llama.cpp
	git fetch ubergarm
	git checkout ug/hunyuan-moe-2

	# build for CUDA
	cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
	cmake --build build --config Release -j $(nproc)

	# clean up later if things get merged into main
	git checkout main
	git branch -D merge-stuff-here
	```

	## VRAM Estimations
	Context length = VRAM use:

	* 8k = 3790MiB total with KV self size = 544.00 MiB, K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
	* 32k = 5462MiB total with KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB
	* 64k = 7734MiB total with KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
	* 256k = 21162MiB total with KV self size = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB

	## ROPE Considerations
	The rope-freq-base defaults to about 11 million `11158840` but can be adjusted down to possibly better match shorter context applications.
	```
	# adjust to 3 million
	--rope-freq-base 3000000
	```

	Thanks to [@kooshi for this tip](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3025974262) with which you can experiment.

	## References
	* [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
	* [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)
	* [ik_llama.cpp IQ3_KS PR](https://github.com/ikawrakow/ik_llama.cpp/pull/566)