Add `--rope-freq-base` note and VRAM usage notes
Browse files
README.md
CHANGED
|
@@ -30,6 +30,8 @@ Special mix `IQ4_KS` `ffn_down` and all new `IQ3_KS` `ffn_(up|gate)` routed expe
|
|
| 30 |
|
| 31 |
With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
|
| 32 |
|
|
|
|
|
|
|
| 33 |
More context or offload additional layers with extra VRAM.
|
| 34 |
|
| 35 |
<details>
|
|
@@ -150,6 +152,22 @@ git checkout main
|
|
| 150 |
git branch -D merge-stuff-here
|
| 151 |
```
|
| 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
## References
|
| 154 |
* [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
|
| 155 |
* [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)
|
|
|
|
| 30 |
|
| 31 |
With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
|
| 32 |
|
| 33 |
+
Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB.
|
| 34 |
+
|
| 35 |
More context or offload additional layers with extra VRAM.
|
| 36 |
|
| 37 |
<details>
|
|
|
|
| 152 |
git branch -D merge-stuff-here
|
| 153 |
```
|
| 154 |
|
| 155 |
+
## VRAM Estimations
|
| 156 |
+
|
| 157 |
+
* 8k = 3790MiB total with KV self size = 544.00 MiB, K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
|
| 158 |
+
* 32k = 5462MiB total with KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB
|
| 159 |
+
* 64k = 7734MiB total with KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
|
| 160 |
+
* 256k = 21162MiB total with KV self size = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB
|
| 161 |
+
|
| 162 |
+
## ROPE Considerations
|
| 163 |
+
The rope-freq-base defaults to about 11 million `11158840` but can be adjusted down to possibly better match shorter context applications.
|
| 164 |
+
```
|
| 165 |
+
# adjust to 3 million
|
| 166 |
+
--rope-freq-base 3000000
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
Thanks to [@kooshi for this tip](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3025974262) with which you can experiment.
|
| 170 |
+
|
| 171 |
## References
|
| 172 |
* [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
|
| 173 |
* [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)
|