ubergarm
/

Hunyuan-A13B-Instruct-GGUF

@@ -30,6 +30,8 @@ Special mix `IQ4_KS` `ffn_down` and all new `IQ3_KS` `ffn_(up|gate)` routed expe
 With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
 More context or offload additional layers with extra VRAM.
 <details>
@@ -150,6 +152,22 @@ git checkout main
 git branch -D merge-stuff-here
 ```
 ## References
 * [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
 * [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)

 With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
+Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB.
 More context or offload additional layers with extra VRAM.
 <details>
 git branch -D merge-stuff-here
 ```
+## VRAM Estimations
+*  8k = 3790MiB total with KV self size  =  544.00 MiB, K (q8_0):  272.00 MiB, V (q8_0):  272.00 MiB
+* 32k = 5462MiB total with KV self size  = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB
+* 64k = 7734MiB total with KV self size  = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
+* 256k = 21162MiB total with KV self size  = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB
+## ROPE Considerations
+The rope-freq-base defaults to about 11 million `11158840` but can be adjusted down to possibly better match shorter context applications.
+```
+# adjust to 3 million
+--rope-freq-base 3000000
+```
+Thanks to [@kooshi for this tip](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3025974262) with which you can experiment.
 ## References
 * [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
 * [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)