ubergarm commited on
Commit
d1f372b
·
1 Parent(s): df3140a

Add `--rope-freq-base` note and VRAM usage notes

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -30,6 +30,8 @@ Special mix `IQ4_KS` `ffn_down` and all new `IQ3_KS` `ffn_(up|gate)` routed expe
30
 
31
  With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
32
 
 
 
33
  More context or offload additional layers with extra VRAM.
34
 
35
  <details>
@@ -150,6 +152,22 @@ git checkout main
150
  git branch -D merge-stuff-here
151
  ```
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ## References
154
  * [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
155
  * [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)
 
30
 
31
  With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
32
 
33
+ Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB.
34
+
35
  More context or offload additional layers with extra VRAM.
36
 
37
  <details>
 
152
  git branch -D merge-stuff-here
153
  ```
154
 
155
+ ## VRAM Estimations
156
+
157
+ * 8k = 3790MiB total with KV self size = 544.00 MiB, K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
158
+ * 32k = 5462MiB total with KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB
159
+ * 64k = 7734MiB total with KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
160
+ * 256k = 21162MiB total with KV self size = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB
161
+
162
+ ## ROPE Considerations
163
+ The rope-freq-base defaults to about 11 million `11158840` but can be adjusted down to possibly better match shorter context applications.
164
+ ```
165
+ # adjust to 3 million
166
+ --rope-freq-base 3000000
167
+ ```
168
+
169
+ Thanks to [@kooshi for this tip](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3025974262) with which you can experiment.
170
+
171
  ## References
172
  * [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425)
173
  * [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565)