ubergarm's picture
Note ik_llama.cpp can run regular GGUFs in addition
76f4800
---
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: tencent/Hunyuan-A13B-Instruct
license: other
license_name: tencent-hunyuan-a13b
license_link: https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/LICENSE
base_model_relation: quantized
tags:
- imatrix
- conversational
- ik_llama.cpp
---
## `ik_llama.cpp` imatrix Quantizations of Hunyuan-A13B-Instruct
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
*NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP.
These quants provide best in class perplexity for the given memory footprint.
## Big Thanks
Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://discord.com/channels/1238219753324281886/1238239819017097246/1238676202357784650) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!
## Quants
#### `IQ3_KS` 34.088 GiB (3.642 BPW)
Special mix `IQ4_KS` `ffn_down` and all new `IQ3_KS` `ffn_(up|gate)` routed experts. `iq6_k/iq5_k` for attn and shared expert as shown in the recipe below. Test out `-rtr` to run-time-repack tensors to `_r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes.
With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB.
More context or offload additional layers with extra VRAM.
<details>
<summary>👈 Secret Recipe</summary>
```bash
custom="
# Attention
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_o.*=iq5_k
# 1x Shared Expert
blk\..*\.ffn_(down)_shexp.*=iq6_k
blk\..*\.ffn_(gate|up)_shexp.*=iq5_k
# 64x Routed Experts
blk\..*\.ffn_(down)_exps.*=iq4_ks
blk\..*\.ffn_(gate|up)_exps.*=iq3_ks
# Token Embedding
token_embd\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat \
/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf \
/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
IQ3_KS \
24
```
</details>
## Quick Start
#### 16GB VRAM + 24GB RAM Hybrid GPU+CPU Inference
```
# Basically trade-off VRAM between longer context or more speed for your configuration.
./build/bin/llama-server \
--model /mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
--alias ubergarm/Hunyuan-A13B-Instruct-IQ3_KS \
-fa -fmoe \
-rtr \
-ctk q8_0 -ctv q8_0 \
-c 32768 \
--temp 0.6 \
--presence-penalty 0.7 \
--min-p 0.1 \
-ngl 99 \
-ot "blk\.([0-9])\.ffn_.*=CUDA0" \
-ot exps=CPU \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8083
```
## Perplexity
The perplexity on these Hunyuan-A13B-Instruct models seems really high compared to stuff I've seen before. Check out the mainline llama.cpp [PR14425](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3024357323) for more details.
* `IQ3_KS` 34.088 GiB (3.642 BPW) `Final estimate: PPL = 522.7473 +/- 5.68072`
## Speed
Used built in `llama-sweep-bench` tool for example speeds across a variety of context length chats (N_KV is the kv-cache depth used for generation).
![llama-sweep-bench graph](images/sweep-bench.png "Chart showing how speed slows down as kv-cache size grows simulating longer multi-turn chats.")
## llama-sweep-bench
```bash
# Offload 15 total layers and increase ubatch from default of -ub 512 up to -ub 2048 for big PP!
export model=/mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-fa -fmoe \
-rtr \
-ctk q8_0 -ctv q8_0 \
-c 32768 \
-ngl 99 \
-ot "blk\.([0-9])\.ffn_.*=CUDA0" \
-ot "blk\.(1[0-4])\.ffn_.*=CUDA0" \
-ub 2048 -b 2048 \
-ot exps=CPU \
--threads 16 \
--warmup-batch
```
## *NOTE* Building Experimental PRs
This PR is based on currently un-released PRs so is quite experimental. To build it before PRs are merged try something like this:
```bash
# get the code setup
cd projects
git clone https://github.com/ikawrakow/ik_llama.cpp.git
git ik_llama.cpp
git remote add ubergarm https://github.com/ubergarm/ik_llama.cpp
git fetch ubergarm
git checkout ug/hunyuan-moe-2
# build for CUDA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# clean up later if things get merged into main
git checkout main
git branch -D merge-stuff-here
```
## VRAM Estimations
Context length = VRAM use:
* 8k = 3790MiB total with KV self size = 544.00 MiB, K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
* 32k = 5462MiB total with KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB
* 64k = 7734MiB total with KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
* 256k = 21162MiB total with KV self size = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB
## ROPE Considerations
The rope-freq-base defaults to about 11 million `11158840` but can be adjusted down to possibly better match shorter context applications.
```
# adjust to 3 million
--rope-freq-base 3000000
```
Thanks to [@kooshi for this tip](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3025974262) with which you can experiment.
## References
* [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
* [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)
* [ik_llama.cpp IQ3_KS PR](https://github.com/ikawrakow/ik_llama.cpp/pull/566)