Commit
·
7cc5b4b
1
Parent(s):
a22f244
Add GGUF models + tokenizer with LFS
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +2 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_code.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_general.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_math.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt +190 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt +190 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt +190 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.gguf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 1B F16 | 2.72 GiB | 1.46 B | CUDA | 35 | pp8 | 358.01 ± 11.81 |
|
| 9 |
+
| granitehybrid 1B F16 | 2.72 GiB | 1.46 B | CUDA | 35 | tg128 | 48.37 ± 1.13 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21576 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/granite-4.0-h-350m-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.vocab_size u32 = 100352
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.rope.dimension_count u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.attention.scale f32 = 0.007812
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.embedding_scale f32 = 12.000000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.residual_scale f32 = 0.220000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.logit_scale f32 = 6.000000
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.conv_kernel u32 = 4
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.state_size u32 = 128
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.group_count u32 = 1
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.inner_size u32 = 3072
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.ssm.time_step_rank u32 = 48
|
| 46 |
+
llama_model_loader: - kv 35: granitehybrid.rope.scaling.finetuned bool = false
|
| 47 |
+
llama_model_loader: - kv 36: general.quantization_version u32 = 2
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.pre str = dbrx
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 100257
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 100257
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 100256
|
| 57 |
+
llama_model_loader: - kv 46: tokenizer.ggml.add_bos_token bool = false
|
| 58 |
+
llama_model_loader: - kv 47: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type f16: 209 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = F16
|
| 63 |
+
print_info: file size = 2.72 GiB (16.01 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 1536
|
| 73 |
+
print_info: n_embd_inp = 1536
|
| 74 |
+
print_info: n_layer = 40
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 128
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 128
|
| 81 |
+
print_info: n_embd_head_v = 128
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 6.0e+00
|
| 90 |
+
print_info: f_attn_scale = 7.8e-03
|
| 91 |
+
print_info: n_ff = 4096
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 3072
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 1B
|
| 111 |
+
print_info: model params = 1.46 B
|
| 112 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.220000
|
| 115 |
+
print_info: f_attention_scale = 0.007812
|
| 116 |
+
print_info: n_ff_shexp = 4096
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 2789.26 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 623.82 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 623.82 MiB
|
| 139 |
+
...........................................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 493.00 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 165 |
+
llama_context: graph nodes = 2295
|
| 166 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 110.8 ms
|
| 175 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 1.24 seconds per pass - ETA 0.90 minutes
|
| 177 |
+
[1]3.5997,[2]2.9321,[3]2.0553,[4]1.9020,[5]2.0632,[6]2.2706,[7]2.1718,[8]2.0536,[9]1.9211,[10]1.8087,[11]1.8035,[12]1.8198,[13]1.7581,[14]1.7486,[15]1.7831,[16]1.7376,[17]1.7184,[18]1.7342,[19]1.7069,[20]1.6839,[21]1.6613,[22]1.6510,[23]1.6746,[24]1.6583,[25]1.6721,[26]1.6479,[27]1.6362,[28]1.6319,[29]1.6657,[30]1.6743,[31]1.6716,[32]1.6553,[33]1.6739,[34]1.6701,[35]1.6574,[36]1.6809,[37]1.6863,[38]1.6850,[39]1.7030,[40]1.7017,[41]1.6967,[42]1.7128,[43]1.7178,[44]1.7102,
|
| 178 |
+
Final estimate: PPL = 1.7102 +/- 0.01357
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 567.13 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 50036.19 ms / 90112 tokens ( 0.56 ms per token, 1800.94 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 50953.47 ms / 90113 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20389 + (1142 = 623 + 17 + 500) + 2575 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22812 + ( 682 = 623 + 17 + 41) + 629 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 2856 = 2789 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21576 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/granite-4.0-h-350m-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.vocab_size u32 = 100352
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.rope.dimension_count u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.attention.scale f32 = 0.007812
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.embedding_scale f32 = 12.000000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.residual_scale f32 = 0.220000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.logit_scale f32 = 6.000000
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.conv_kernel u32 = 4
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.state_size u32 = 128
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.group_count u32 = 1
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.inner_size u32 = 3072
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.ssm.time_step_rank u32 = 48
|
| 46 |
+
llama_model_loader: - kv 35: granitehybrid.rope.scaling.finetuned bool = false
|
| 47 |
+
llama_model_loader: - kv 36: general.quantization_version u32 = 2
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.pre str = dbrx
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 100257
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 100257
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 100256
|
| 57 |
+
llama_model_loader: - kv 46: tokenizer.ggml.add_bos_token bool = false
|
| 58 |
+
llama_model_loader: - kv 47: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type f16: 209 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = F16
|
| 63 |
+
print_info: file size = 2.72 GiB (16.01 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 1536
|
| 73 |
+
print_info: n_embd_inp = 1536
|
| 74 |
+
print_info: n_layer = 40
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 128
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 128
|
| 81 |
+
print_info: n_embd_head_v = 128
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 6.0e+00
|
| 90 |
+
print_info: f_attn_scale = 7.8e-03
|
| 91 |
+
print_info: n_ff = 4096
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 3072
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 1B
|
| 111 |
+
print_info: model params = 1.46 B
|
| 112 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.220000
|
| 115 |
+
print_info: f_attention_scale = 0.007812
|
| 116 |
+
print_info: n_ff_shexp = 4096
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 2789.26 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 623.82 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 623.82 MiB
|
| 139 |
+
...........................................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 493.00 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 165 |
+
llama_context: graph nodes = 2295
|
| 166 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 40.179 ms
|
| 175 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 1.30 seconds per pass - ETA 0.30 minutes
|
| 177 |
+
[1]9.3539,[2]11.7830,[3]12.3936,[4]11.2324,[5]10.8874,[6]9.6544,[7]9.4520,[8]9.4373,[9]9.6404,[10]9.6361,[11]9.6130,[12]9.6803,[13]9.7402,[14]9.7879,
|
| 178 |
+
Final estimate: PPL = 9.7879 +/- 0.22713
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 575.85 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 15988.83 ms / 28672 tokens ( 0.56 ms per token, 1793.25 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 16255.53 ms / 28673 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20290 + (1142 = 623 + 17 + 500) + 2674 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22812 + ( 682 = 623 + 17 + 41) + 629 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 2856 = 2789 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21675 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/granite-4.0-h-350m-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.vocab_size u32 = 100352
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.rope.dimension_count u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.attention.scale f32 = 0.007812
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.embedding_scale f32 = 12.000000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.residual_scale f32 = 0.220000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.logit_scale f32 = 6.000000
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.conv_kernel u32 = 4
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.state_size u32 = 128
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.group_count u32 = 1
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.inner_size u32 = 3072
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.ssm.time_step_rank u32 = 48
|
| 46 |
+
llama_model_loader: - kv 35: granitehybrid.rope.scaling.finetuned bool = false
|
| 47 |
+
llama_model_loader: - kv 36: general.quantization_version u32 = 2
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.pre str = dbrx
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 100257
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 100257
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 100256
|
| 57 |
+
llama_model_loader: - kv 46: tokenizer.ggml.add_bos_token bool = false
|
| 58 |
+
llama_model_loader: - kv 47: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type f16: 209 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = F16
|
| 63 |
+
print_info: file size = 2.72 GiB (16.01 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 1536
|
| 73 |
+
print_info: n_embd_inp = 1536
|
| 74 |
+
print_info: n_layer = 40
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 128
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 128
|
| 81 |
+
print_info: n_embd_head_v = 128
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 6.0e+00
|
| 90 |
+
print_info: f_attn_scale = 7.8e-03
|
| 91 |
+
print_info: n_ff = 4096
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 3072
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 1B
|
| 111 |
+
print_info: model params = 1.46 B
|
| 112 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.220000
|
| 115 |
+
print_info: f_attention_scale = 0.007812
|
| 116 |
+
print_info: n_ff_shexp = 4096
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 2789.26 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 623.82 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 623.82 MiB
|
| 139 |
+
...........................................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 493.00 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 165 |
+
llama_context: graph nodes = 2295
|
| 166 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 33.106 ms
|
| 175 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 1.39 seconds per pass - ETA 0.33 minutes
|
| 177 |
+
[1]6.3437,[2]7.2979,[3]7.1882,[4]7.2946,[5]7.5740,[6]7.5546,[7]7.6173,[8]7.4384,[9]7.4731,[10]7.4798,[11]7.6959,[12]7.7258,[13]7.8297,[14]7.8256,[15]7.7678,
|
| 178 |
+
Final estimate: PPL = 7.7678 +/- 0.17135
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 551.04 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 17250.55 ms / 30720 tokens ( 0.56 ms per token, 1780.81 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 17665.29 ms / 30721 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20389 + (1142 = 623 + 17 + 500) + 2575 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22812 + ( 682 = 623 + 17 + 41) + 629 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 2856 = 2789 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 1B MXFP4 MoE | 1.59 GiB | 1.46 B | CUDA | 35 | pp8 | 388.19 ± 7.36 |
|
| 9 |
+
| granitehybrid 1B MXFP4 MoE | 1.59 GiB | 1.46 B | CUDA | 35 | tg128 | 68.55 ± 1.11 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21418 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type f16: 5 tensors
|
| 61 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.59 GiB (9.35 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 961.78 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 333.89 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 333.89 MiB
|
| 140 |
+
...................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 493.00 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 89.525 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.09 seconds per pass - ETA 0.78 minutes
|
| 178 |
+
[1]3.5890,[2]2.9295,[3]2.0542,[4]1.9017,[5]2.0687,[6]2.2753,[7]2.1755,[8]2.0568,[9]1.9245,[10]1.8114,[11]1.8057,[12]1.8221,[13]1.7603,[14]1.7506,[15]1.7852,[16]1.7397,[17]1.7204,[18]1.7359,[19]1.7084,[20]1.6852,[21]1.6625,[22]1.6521,[23]1.6758,[24]1.6594,[25]1.6731,[26]1.6490,[27]1.6372,[28]1.6329,[29]1.6666,[30]1.6752,[31]1.6725,[32]1.6562,[33]1.6748,[34]1.6710,[35]1.6583,[36]1.6818,[37]1.6872,[38]1.6859,[39]1.7038,[40]1.7027,[41]1.6977,[42]1.7138,[43]1.7191,[44]1.7115,
|
| 179 |
+
Final estimate: PPL = 1.7115 +/- 0.01358
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 402.19 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 38904.77 ms / 90112 tokens ( 0.43 ms per token, 2316.22 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 39701.12 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20435 + ( 856 = 333 + 17 + 504) + 2815 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23102 + ( 392 = 333 + 17 + 41) + 628 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 1028 = 961 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21418 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type f16: 5 tensors
|
| 61 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.59 GiB (9.35 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 961.78 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 333.89 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 333.89 MiB
|
| 140 |
+
...................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 493.00 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 40.196 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.09 seconds per pass - ETA 0.25 minutes
|
| 178 |
+
[1]9.3789,[2]11.7858,[3]12.3883,[4]11.2259,[5]10.8862,[6]9.6528,[7]9.4502,[8]9.4391,[9]9.6503,[10]9.6470,[11]9.6221,[12]9.6917,[13]9.7523,[14]9.7995,
|
| 179 |
+
Final estimate: PPL = 9.7995 +/- 0.22718
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 404.36 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 12463.41 ms / 28672 tokens ( 0.43 ms per token, 2300.49 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 12727.57 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20418 + ( 856 = 333 + 17 + 504) + 2832 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23102 + ( 392 = 333 + 17 + 41) + 628 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 1028 = 961 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21427 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type f16: 5 tensors
|
| 61 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.59 GiB (9.35 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 961.78 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 333.89 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 333.89 MiB
|
| 140 |
+
...................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 493.00 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 34.715 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.11 seconds per pass - ETA 0.27 minutes
|
| 178 |
+
[1]6.3835,[2]7.3279,[3]7.2145,[4]7.3284,[5]7.5988,[6]7.5840,[7]7.6419,[8]7.4622,[9]7.4973,[10]7.5028,[11]7.7198,[12]7.7505,[13]7.8540,[14]7.8509,[15]7.7957,
|
| 179 |
+
Final estimate: PPL = 7.7957 +/- 0.17223
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 406.93 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 13376.69 ms / 30720 tokens ( 0.44 ms per token, 2296.53 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 13652.77 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20435 + ( 856 = 333 + 17 + 504) + 2815 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23102 + ( 392 = 333 + 17 + 41) + 628 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 1028 = 961 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 1B MXFP4 MoE | 1.37 GiB | 1.46 B | CUDA | 35 | pp8 | 525.20 ± 20.65 |
|
| 9 |
+
| granitehybrid 1B MXFP4 MoE | 1.37 GiB | 1.46 B | CUDA | 35 | tg128 | 99.16 ± 1.61 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21435 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.37 GiB (8.07 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 744.00 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 330.65 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 330.65 MiB
|
| 140 |
+
................................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 281.69 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 89.308 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.02 seconds per pass - ETA 0.73 minutes
|
| 178 |
+
[1]3.6487,[2]2.9870,[3]2.0807,[4]1.9271,[5]2.0945,[6]2.3102,[7]2.2087,[8]2.0863,[9]1.9492,[10]1.8326,[11]1.8281,[12]1.8456,[13]1.7815,[14]1.7710,[15]1.8062,[16]1.7590,[17]1.7381,[18]1.7546,[19]1.7259,[20]1.7015,[21]1.6778,[22]1.6669,[23]1.6906,[24]1.6734,[25]1.6872,[26]1.6626,[27]1.6508,[28]1.6461,[29]1.6816,[30]1.6908,[31]1.6875,[32]1.6708,[33]1.6897,[34]1.6858,[35]1.6728,[36]1.6975,[37]1.7029,[38]1.7012,[39]1.7197,[40]1.7188,[41]1.7136,[42]1.7304,[43]1.7356,[44]1.7277,
|
| 179 |
+
Final estimate: PPL = 1.7277 +/- 0.01387
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 387.50 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 36913.11 ms / 90112 tokens ( 0.41 ms per token, 2441.19 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 37675.93 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20721 + ( 641 = 330 + 17 + 293) + 2743 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 618 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 810 = 743 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21435 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.37 GiB (8.07 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 744.00 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 330.65 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 330.65 MiB
|
| 140 |
+
................................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 281.69 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 39.237 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.00 seconds per pass - ETA 0.23 minutes
|
| 178 |
+
[1]9.9811,[2]12.4003,[3]13.0837,[4]11.8937,[5]11.4955,[6]10.1422,[7]9.9378,[8]9.9357,[9]10.1743,[10]10.1919,[11]10.1823,[12]10.2650,[13]10.3273,[14]10.3755,
|
| 179 |
+
Final estimate: PPL = 10.3755 +/- 0.24363
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 382.48 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 11909.45 ms / 28672 tokens ( 0.42 ms per token, 2407.50 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 12162.50 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20717 + ( 641 = 330 + 17 + 293) + 2747 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 618 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 810 = 743 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21439 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.37 GiB (8.07 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 744.00 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 330.65 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 330.65 MiB
|
| 140 |
+
................................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 281.69 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 35.938 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.01 seconds per pass - ETA 0.25 minutes
|
| 178 |
+
[1]6.5861,[2]7.6075,[3]7.4918,[4]7.6401,[5]7.9208,[6]7.9382,[7]8.0244,[8]7.8266,[9]7.8840,[10]7.8881,[11]8.1037,[12]8.1331,[13]8.2330,[14]8.2185,[15]8.1524,
|
| 179 |
+
Final estimate: PPL = 8.1524 +/- 0.18323
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 380.14 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 12690.85 ms / 30720 tokens ( 0.41 ms per token, 2420.64 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 12956.49 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20725 + ( 641 = 330 + 17 + 293) + 2739 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 618 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 810 = 743 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 1B MXFP4 MoE | 1.39 GiB | 1.46 B | CUDA | 35 | pp8 | 482.34 ± 12.43 |
|
| 9 |
+
| granitehybrid 1B MXFP4 MoE | 1.39 GiB | 1.46 B | CUDA | 35 | tg128 | 97.99 ± 0.31 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21443 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q5_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.39 GiB (8.18 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 762.93 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 330.93 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 330.93 MiB
|
| 140 |
+
..............................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 300.06 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 89.095 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.00 seconds per pass - ETA 0.73 minutes
|
| 178 |
+
[1]3.6325,[2]2.9501,[3]2.0634,[4]1.9092,[5]2.0760,[6]2.2843,[7]2.1821,[8]2.0628,[9]1.9290,[10]1.8156,[11]1.8098,[12]1.8264,[13]1.7642,[14]1.7540,[15]1.7894,[16]1.7436,[17]1.7240,[18]1.7393,[19]1.7117,[20]1.6882,[21]1.6656,[22]1.6553,[23]1.6791,[24]1.6625,[25]1.6764,[26]1.6520,[27]1.6403,[28]1.6359,[29]1.6698,[30]1.6784,[31]1.6758,[32]1.6594,[33]1.6782,[34]1.6745,[35]1.6618,[36]1.6854,[37]1.6909,[38]1.6896,[39]1.7077,[40]1.7063,[41]1.7013,[42]1.7175,[43]1.7228,[44]1.7150,
|
| 179 |
+
Final estimate: PPL = 1.7150 +/- 0.01363
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 376.86 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 37442.82 ms / 90112 tokens ( 0.42 ms per token, 2406.66 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 38211.38 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20707 + ( 660 = 330 + 17 + 311) + 2739 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 617 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 829 = 762 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21443 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q5_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.39 GiB (8.18 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 762.93 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 330.93 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 330.93 MiB
|
| 140 |
+
..............................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 300.06 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 37.695 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.98 seconds per pass - ETA 0.22 minutes
|
| 178 |
+
[1]9.4100,[2]11.8769,[3]12.5023,[4]11.3636,[5]11.0278,[6]9.7925,[7]9.5719,[8]9.5587,[9]9.7727,[10]9.7651,[11]9.7437,[12]9.8114,[13]9.8784,[14]9.9180,
|
| 179 |
+
Final estimate: PPL = 9.9180 +/- 0.22995
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 388.94 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 11862.45 ms / 28672 tokens ( 0.41 ms per token, 2417.04 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 12114.22 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20707 + ( 660 = 330 + 17 + 311) + 2739 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 617 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 829 = 762 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21443 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q5_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.39 GiB (8.18 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 762.93 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 330.93 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 330.93 MiB
|
| 140 |
+
..............................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 300.06 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 35.626 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.00 seconds per pass - ETA 0.25 minutes
|
| 178 |
+
[1]6.4498,[2]7.3697,[3]7.2987,[4]7.4109,[5]7.6803,[6]7.6592,[7]7.7217,[8]7.5238,[9]7.5607,[10]7.5703,[11]7.7810,[12]7.8095,[13]7.9146,[14]7.9078,[15]7.8456,
|
| 179 |
+
Final estimate: PPL = 7.8456 +/- 0.17315
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 381.31 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 12840.41 ms / 30720 tokens ( 0.42 ms per token, 2392.45 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 13104.54 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20705 + ( 660 = 330 + 17 + 311) + 2741 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 617 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 829 = 762 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 1B MXFP4 MoE | 1.41 GiB | 1.46 B | CUDA | 35 | pp8 | 461.17 ± 15.69 |
|
| 9 |
+
| granitehybrid 1B MXFP4 MoE | 1.41 GiB | 1.46 B | CUDA | 35 | tg128 | 94.60 ± 0.45 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21441 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q6_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.41 GiB (8.30 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 783.05 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 331.23 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 331.23 MiB
|
| 140 |
+
.............................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 319.59 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 90.601 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.01 seconds per pass - ETA 0.73 minutes
|
| 178 |
+
[1]3.6055,[2]2.9400,[3]2.0590,[4]1.9064,[5]2.0699,[6]2.2777,[7]2.1777,[8]2.0587,[9]1.9261,[10]1.8128,[11]1.8080,[12]1.8250,[13]1.7629,[14]1.7530,[15]1.7874,[16]1.7418,[17]1.7224,[18]1.7381,[19]1.7105,[20]1.6871,[21]1.6643,[22]1.6541,[23]1.6778,[24]1.6614,[25]1.6753,[26]1.6511,[27]1.6394,[28]1.6351,[29]1.6690,[30]1.6777,[31]1.6749,[32]1.6584,[33]1.6770,[34]1.6731,[35]1.6604,[36]1.6839,[37]1.6892,[38]1.6877,[39]1.7058,[40]1.7045,[41]1.6995,[42]1.7157,[43]1.7206,[44]1.7129,
|
| 179 |
+
Final estimate: PPL = 1.7129 +/- 0.01361
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 376.36 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 37306.03 ms / 90112 tokens ( 0.41 ms per token, 2415.48 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 38066.99 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20685 + ( 680 = 331 + 17 + 330) + 2741 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 390 = 331 + 17 + 41) + 617 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 849 = 783 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21441 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q6_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.41 GiB (8.30 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 783.05 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 331.23 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 331.23 MiB
|
| 140 |
+
.............................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 319.59 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 36.962 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.01 seconds per pass - ETA 0.23 minutes
|
| 178 |
+
[1]9.4081,[2]11.8378,[3]12.4548,[4]11.3003,[5]10.9564,[6]9.7067,[7]9.5093,[8]9.4925,[9]9.7066,[10]9.7021,[11]9.6703,[12]9.7426,[13]9.8070,[14]9.8548,
|
| 179 |
+
Final estimate: PPL = 9.8548 +/- 0.22889
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 386.00 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 12052.86 ms / 28672 tokens ( 0.42 ms per token, 2378.85 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 12303.19 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20685 + ( 680 = 331 + 17 + 330) + 2741 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 390 = 331 + 17 + 41) + 617 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 849 = 783 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21441 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q6_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 1.41 GiB (8.30 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 1536
|
| 74 |
+
print_info: n_embd_inp = 1536
|
| 75 |
+
print_info: n_layer = 40
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 128
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 128
|
| 82 |
+
print_info: n_embd_head_v = 128
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 6.0e+00
|
| 91 |
+
print_info: f_attn_scale = 7.8e-03
|
| 92 |
+
print_info: n_ff = 4096
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 3072
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 1B
|
| 112 |
+
print_info: model params = 1.46 B
|
| 113 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.220000
|
| 116 |
+
print_info: f_attention_scale = 0.007812
|
| 117 |
+
print_info: n_ff_shexp = 4096
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 783.05 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 331.23 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 331.23 MiB
|
| 140 |
+
.............................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 319.59 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 166 |
+
llama_context: graph nodes = 2295
|
| 167 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 34.185 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 1.02 seconds per pass - ETA 0.25 minutes
|
| 178 |
+
[1]6.3807,[2]7.3648,[3]7.2555,[4]7.3656,[5]7.6303,[6]7.6066,[7]7.6630,[8]7.4811,[9]7.5137,[10]7.5220,[11]7.7362,[12]7.7656,[13]7.8704,[14]7.8679,[15]7.8134,
|
| 179 |
+
Final estimate: PPL = 7.8134 +/- 0.17245
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 385.18 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 13035.61 ms / 30720 tokens ( 0.42 ms per token, 2356.62 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 13301.03 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20685 + ( 680 = 331 + 17 + 330) + 2741 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 390 = 331 + 17 + 41) + 617 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 849 = 783 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 1B MXFP4 MoE | 1.45 GiB | 1.46 B | CUDA | 35 | pp8 | 465.33 ± 28.15 |
|
| 9 |
+
| granitehybrid 1B MXFP4 MoE | 1.45 GiB | 1.46 B | CUDA | 35 | tg128 | 86.79 ± 1.53 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_code.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21441 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 209 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = MXFP4 MoE
|
| 63 |
+
print_info: file size = 1.45 GiB (8.51 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 1536
|
| 73 |
+
print_info: n_embd_inp = 1536
|
| 74 |
+
print_info: n_layer = 40
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 128
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 128
|
| 81 |
+
print_info: n_embd_head_v = 128
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 6.0e+00
|
| 90 |
+
print_info: f_attn_scale = 7.8e-03
|
| 91 |
+
print_info: n_ff = 4096
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 3072
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 1B
|
| 111 |
+
print_info: model params = 1.46 B
|
| 112 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.220000
|
| 115 |
+
print_info: f_attention_scale = 0.007812
|
| 116 |
+
print_info: n_ff_shexp = 4096
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 819.75 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 331.78 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 331.78 MiB
|
| 139 |
+
...........................................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 355.19 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 165 |
+
llama_context: graph nodes = 2295
|
| 166 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 89.074 ms
|
| 175 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 1.01 seconds per pass - ETA 0.73 minutes
|
| 177 |
+
[1]3.5958,[2]2.9356,[3]2.0568,[4]1.9046,[5]2.0700,[6]2.2774,[7]2.1770,[8]2.0582,[9]1.9256,[10]1.8125,[11]1.8067,[12]1.8233,[13]1.7615,[14]1.7517,[15]1.7864,[16]1.7407,[17]1.7212,[18]1.7365,[19]1.7090,[20]1.6858,[21]1.6630,[22]1.6528,[23]1.6765,[24]1.6601,[25]1.6738,[26]1.6497,[27]1.6380,[28]1.6337,[29]1.6676,[30]1.6764,[31]1.6736,[32]1.6572,[33]1.6758,[34]1.6720,[35]1.6593,[36]1.6827,[37]1.6881,[38]1.6867,[39]1.7048,[40]1.7036,[41]1.6986,[42]1.7149,[43]1.7200,[44]1.7124,
|
| 178 |
+
Final estimate: PPL = 1.7124 +/- 0.01360
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 388.43 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 37417.13 ms / 90112 tokens ( 0.42 ms per token, 2408.31 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 38182.48 ms / 90113 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20649 + ( 716 = 331 + 17 + 366) + 2741 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 390 = 331 + 17 + 41) + 617 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 886 = 819 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_general.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21441 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 209 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = MXFP4 MoE
|
| 63 |
+
print_info: file size = 1.45 GiB (8.51 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 1536
|
| 73 |
+
print_info: n_embd_inp = 1536
|
| 74 |
+
print_info: n_layer = 40
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 128
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 128
|
| 81 |
+
print_info: n_embd_head_v = 128
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 6.0e+00
|
| 90 |
+
print_info: f_attn_scale = 7.8e-03
|
| 91 |
+
print_info: n_ff = 4096
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 3072
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 1B
|
| 111 |
+
print_info: model params = 1.46 B
|
| 112 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.220000
|
| 115 |
+
print_info: f_attention_scale = 0.007812
|
| 116 |
+
print_info: n_ff_shexp = 4096
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 819.75 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 331.78 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 331.78 MiB
|
| 139 |
+
...........................................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 355.19 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 165 |
+
llama_context: graph nodes = 2295
|
| 166 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 40.656 ms
|
| 175 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 1.01 seconds per pass - ETA 0.23 minutes
|
| 177 |
+
[1]9.3811,[2]11.7839,[3]12.3785,[4]11.2284,[5]10.8948,[6]9.6655,[7]9.4622,[8]9.4428,[9]9.6568,[10]9.6556,[11]9.6279,[12]9.6975,[13]9.7591,[14]9.8094,
|
| 178 |
+
Final estimate: PPL = 9.8094 +/- 0.22771
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 394.84 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 12035.98 ms / 28672 tokens ( 0.42 ms per token, 2382.19 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 12290.61 ms / 28673 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20649 + ( 716 = 331 + 17 + 366) + 2741 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 390 = 331 + 17 + 41) + 617 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 886 = 819 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_math.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21441 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 209 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = MXFP4 MoE
|
| 63 |
+
print_info: file size = 1.45 GiB (8.51 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 1536
|
| 73 |
+
print_info: n_embd_inp = 1536
|
| 74 |
+
print_info: n_layer = 40
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 128
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 128
|
| 81 |
+
print_info: n_embd_head_v = 128
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 6.0e+00
|
| 90 |
+
print_info: f_attn_scale = 7.8e-03
|
| 91 |
+
print_info: n_ff = 4096
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 3072
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 1B
|
| 111 |
+
print_info: model params = 1.46 B
|
| 112 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.220000
|
| 115 |
+
print_info: f_attention_scale = 0.007812
|
| 116 |
+
print_info: n_ff_shexp = 4096
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 819.75 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 331.78 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 331.78 MiB
|
| 139 |
+
...........................................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 160 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 355.19 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 165 |
+
llama_context: graph nodes = 2295
|
| 166 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 33.485 ms
|
| 175 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 1.02 seconds per pass - ETA 0.25 minutes
|
| 177 |
+
[1]6.3830,[2]7.3477,[3]7.2474,[4]7.3577,[5]7.6232,[6]7.6053,[7]7.6579,[8]7.4769,[9]7.5110,[10]7.5193,[11]7.7285,[12]7.7605,[13]7.8664,[14]7.8635,[15]7.8061,
|
| 178 |
+
Final estimate: PPL = 7.8061 +/- 0.17253
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 391.84 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 12901.65 ms / 30720 tokens ( 0.42 ms per token, 2381.09 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 13167.74 ms / 30721 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20649 + ( 716 = 331 + 17 + 366) + 2741 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 390 = 331 + 17 + 41) + 617 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 886 = 819 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 1B MXFP4 MoE | 1.37 GiB | 1.46 B | CUDA | 35 | pp8 | 543.30 ± 9.48 |
|
| 9 |
+
| granitehybrid 1B MXFP4 MoE | 1.37 GiB | 1.46 B | CUDA | 35 | tg128 | 101.51 ± 1.08 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21439 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 62 |
+
llama_model_loader: - type mxfp4: 4 tensors
|
| 63 |
+
print_info: file format = GGUF V3 (latest)
|
| 64 |
+
print_info: file type = MXFP4 MoE
|
| 65 |
+
print_info: file size = 1.37 GiB (8.06 BPW)
|
| 66 |
+
load: printing all EOG tokens:
|
| 67 |
+
load: - 100257 ('<|end_of_text|>')
|
| 68 |
+
load: - 100261 ('<|fim_pad|>')
|
| 69 |
+
load: special tokens cache size = 96
|
| 70 |
+
load: token to piece cache size = 0.6152 MB
|
| 71 |
+
print_info: arch = granitehybrid
|
| 72 |
+
print_info: vocab_only = 0
|
| 73 |
+
print_info: n_ctx_train = 1048576
|
| 74 |
+
print_info: n_embd = 1536
|
| 75 |
+
print_info: n_embd_inp = 1536
|
| 76 |
+
print_info: n_layer = 40
|
| 77 |
+
print_info: n_head = 12
|
| 78 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 79 |
+
print_info: n_rot = 128
|
| 80 |
+
print_info: n_swa = 0
|
| 81 |
+
print_info: is_swa_any = 0
|
| 82 |
+
print_info: n_embd_head_k = 128
|
| 83 |
+
print_info: n_embd_head_v = 128
|
| 84 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 87 |
+
print_info: f_norm_eps = 0.0e+00
|
| 88 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 89 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 90 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 91 |
+
print_info: f_logit_scale = 6.0e+00
|
| 92 |
+
print_info: f_attn_scale = 7.8e-03
|
| 93 |
+
print_info: n_ff = 4096
|
| 94 |
+
print_info: n_expert = 0
|
| 95 |
+
print_info: n_expert_used = 0
|
| 96 |
+
print_info: n_expert_groups = 0
|
| 97 |
+
print_info: n_group_used = 0
|
| 98 |
+
print_info: causal attn = 1
|
| 99 |
+
print_info: pooling type = 0
|
| 100 |
+
print_info: rope type = 0
|
| 101 |
+
print_info: rope scaling = linear
|
| 102 |
+
print_info: freq_base_train = 10000.0
|
| 103 |
+
print_info: freq_scale_train = 1
|
| 104 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 105 |
+
print_info: rope_finetuned = unknown
|
| 106 |
+
print_info: ssm_d_conv = 4
|
| 107 |
+
print_info: ssm_d_inner = 3072
|
| 108 |
+
print_info: ssm_d_state = 128
|
| 109 |
+
print_info: ssm_dt_rank = 48
|
| 110 |
+
print_info: ssm_n_group = 1
|
| 111 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 112 |
+
print_info: model type = 1B
|
| 113 |
+
print_info: model params = 1.46 B
|
| 114 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 115 |
+
print_info: f_embedding_scale = 12.000000
|
| 116 |
+
print_info: f_residual_scale = 0.220000
|
| 117 |
+
print_info: f_attention_scale = 0.007812
|
| 118 |
+
print_info: n_ff_shexp = 4096
|
| 119 |
+
print_info: vocab type = BPE
|
| 120 |
+
print_info: n_vocab = 100352
|
| 121 |
+
print_info: n_merges = 100000
|
| 122 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 125 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 126 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 127 |
+
print_info: LF token = 198 'Ċ'
|
| 128 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 129 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 130 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 131 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 133 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 134 |
+
print_info: max token length = 256
|
| 135 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 136 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 137 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 138 |
+
load_tensors: CPU_Mapped model buffer size = 743.85 MiB
|
| 139 |
+
load_tensors: CUDA0 model buffer size = 330.58 MiB
|
| 140 |
+
load_tensors: CUDA1 model buffer size = 330.58 MiB
|
| 141 |
+
................................................................................................
|
| 142 |
+
llama_context: constructing llama_context
|
| 143 |
+
llama_context: n_seq_max = 1
|
| 144 |
+
llama_context: n_ctx = 2048
|
| 145 |
+
llama_context: n_ctx_seq = 2048
|
| 146 |
+
llama_context: n_batch = 2048
|
| 147 |
+
llama_context: n_ubatch = 512
|
| 148 |
+
llama_context: causal_attn = 1
|
| 149 |
+
llama_context: flash_attn = auto
|
| 150 |
+
llama_context: kv_unified = false
|
| 151 |
+
llama_context: freq_base = 10000.0
|
| 152 |
+
llama_context: freq_scale = 1
|
| 153 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 154 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 155 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 158 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 159 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 162 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 163 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 164 |
+
llama_context: CUDA0 compute buffer size = 281.69 MiB
|
| 165 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 166 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 167 |
+
llama_context: graph nodes = 2295
|
| 168 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 169 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 170 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 171 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 172 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 173 |
+
|
| 174 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 175 |
+
perplexity: tokenizing the input ..
|
| 176 |
+
perplexity: tokenization took 89.229 ms
|
| 177 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 178 |
+
perplexity: 1.01 seconds per pass - ETA 0.73 minutes
|
| 179 |
+
[1]3.6742,[2]3.0031,[3]2.0894,[4]1.9272,[5]2.0924,[6]2.3091,[7]2.2082,[8]2.0858,[9]1.9490,[10]1.8330,[11]1.8288,[12]1.8458,[13]1.7816,[14]1.7707,[15]1.8059,[16]1.7588,[17]1.7382,[18]1.7548,[19]1.7262,[20]1.7016,[21]1.6783,[22]1.6675,[23]1.6913,[24]1.6744,[25]1.6881,[26]1.6638,[27]1.6522,[28]1.6475,[29]1.6829,[30]1.6921,[31]1.6886,[32]1.6718,[33]1.6908,[34]1.6868,[35]1.6736,[36]1.6982,[37]1.7038,[38]1.7019,[39]1.7205,[40]1.7197,[41]1.7147,[42]1.7318,[43]1.7370,[44]1.7291,
|
| 180 |
+
Final estimate: PPL = 1.7291 +/- 0.01387
|
| 181 |
+
|
| 182 |
+
llama_perf_context_print: load time = 373.08 ms
|
| 183 |
+
llama_perf_context_print: prompt eval time = 37165.47 ms / 90112 tokens ( 0.41 ms per token, 2424.62 tokens per second)
|
| 184 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 185 |
+
llama_perf_context_print: total time = 37931.58 ms / 90113 tokens
|
| 186 |
+
llama_perf_context_print: graphs reused = 0
|
| 187 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20838 + ( 641 = 330 + 17 + 293) + 2626 |
|
| 189 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 618 |
|
| 190 |
+
llama_memory_breakdown_print: | - Host | 810 = 743 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21439 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 62 |
+
llama_model_loader: - type mxfp4: 4 tensors
|
| 63 |
+
print_info: file format = GGUF V3 (latest)
|
| 64 |
+
print_info: file type = MXFP4 MoE
|
| 65 |
+
print_info: file size = 1.37 GiB (8.06 BPW)
|
| 66 |
+
load: printing all EOG tokens:
|
| 67 |
+
load: - 100257 ('<|end_of_text|>')
|
| 68 |
+
load: - 100261 ('<|fim_pad|>')
|
| 69 |
+
load: special tokens cache size = 96
|
| 70 |
+
load: token to piece cache size = 0.6152 MB
|
| 71 |
+
print_info: arch = granitehybrid
|
| 72 |
+
print_info: vocab_only = 0
|
| 73 |
+
print_info: n_ctx_train = 1048576
|
| 74 |
+
print_info: n_embd = 1536
|
| 75 |
+
print_info: n_embd_inp = 1536
|
| 76 |
+
print_info: n_layer = 40
|
| 77 |
+
print_info: n_head = 12
|
| 78 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 79 |
+
print_info: n_rot = 128
|
| 80 |
+
print_info: n_swa = 0
|
| 81 |
+
print_info: is_swa_any = 0
|
| 82 |
+
print_info: n_embd_head_k = 128
|
| 83 |
+
print_info: n_embd_head_v = 128
|
| 84 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 87 |
+
print_info: f_norm_eps = 0.0e+00
|
| 88 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 89 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 90 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 91 |
+
print_info: f_logit_scale = 6.0e+00
|
| 92 |
+
print_info: f_attn_scale = 7.8e-03
|
| 93 |
+
print_info: n_ff = 4096
|
| 94 |
+
print_info: n_expert = 0
|
| 95 |
+
print_info: n_expert_used = 0
|
| 96 |
+
print_info: n_expert_groups = 0
|
| 97 |
+
print_info: n_group_used = 0
|
| 98 |
+
print_info: causal attn = 1
|
| 99 |
+
print_info: pooling type = 0
|
| 100 |
+
print_info: rope type = 0
|
| 101 |
+
print_info: rope scaling = linear
|
| 102 |
+
print_info: freq_base_train = 10000.0
|
| 103 |
+
print_info: freq_scale_train = 1
|
| 104 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 105 |
+
print_info: rope_finetuned = unknown
|
| 106 |
+
print_info: ssm_d_conv = 4
|
| 107 |
+
print_info: ssm_d_inner = 3072
|
| 108 |
+
print_info: ssm_d_state = 128
|
| 109 |
+
print_info: ssm_dt_rank = 48
|
| 110 |
+
print_info: ssm_n_group = 1
|
| 111 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 112 |
+
print_info: model type = 1B
|
| 113 |
+
print_info: model params = 1.46 B
|
| 114 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 115 |
+
print_info: f_embedding_scale = 12.000000
|
| 116 |
+
print_info: f_residual_scale = 0.220000
|
| 117 |
+
print_info: f_attention_scale = 0.007812
|
| 118 |
+
print_info: n_ff_shexp = 4096
|
| 119 |
+
print_info: vocab type = BPE
|
| 120 |
+
print_info: n_vocab = 100352
|
| 121 |
+
print_info: n_merges = 100000
|
| 122 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 125 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 126 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 127 |
+
print_info: LF token = 198 'Ċ'
|
| 128 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 129 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 130 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 131 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 133 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 134 |
+
print_info: max token length = 256
|
| 135 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 136 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 137 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 138 |
+
load_tensors: CPU_Mapped model buffer size = 743.85 MiB
|
| 139 |
+
load_tensors: CUDA0 model buffer size = 330.58 MiB
|
| 140 |
+
load_tensors: CUDA1 model buffer size = 330.58 MiB
|
| 141 |
+
................................................................................................
|
| 142 |
+
llama_context: constructing llama_context
|
| 143 |
+
llama_context: n_seq_max = 1
|
| 144 |
+
llama_context: n_ctx = 2048
|
| 145 |
+
llama_context: n_ctx_seq = 2048
|
| 146 |
+
llama_context: n_batch = 2048
|
| 147 |
+
llama_context: n_ubatch = 512
|
| 148 |
+
llama_context: causal_attn = 1
|
| 149 |
+
llama_context: flash_attn = auto
|
| 150 |
+
llama_context: kv_unified = false
|
| 151 |
+
llama_context: freq_base = 10000.0
|
| 152 |
+
llama_context: freq_scale = 1
|
| 153 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 154 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 155 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 158 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 159 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 162 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 163 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 164 |
+
llama_context: CUDA0 compute buffer size = 281.69 MiB
|
| 165 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 166 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 167 |
+
llama_context: graph nodes = 2295
|
| 168 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 169 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 170 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 171 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 172 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 173 |
+
|
| 174 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 175 |
+
perplexity: tokenizing the input ..
|
| 176 |
+
perplexity: tokenization took 38.248 ms
|
| 177 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 178 |
+
perplexity: 1.02 seconds per pass - ETA 0.23 minutes
|
| 179 |
+
[1]10.0468,[2]12.4641,[3]13.1197,[4]11.8947,[5]11.4919,[6]10.1432,[7]9.9371,[8]9.9438,[9]10.1838,[10]10.2043,[11]10.1841,[12]10.2615,[13]10.3193,[14]10.3658,
|
| 180 |
+
Final estimate: PPL = 10.3658 +/- 0.24215
|
| 181 |
+
|
| 182 |
+
llama_perf_context_print: load time = 376.78 ms
|
| 183 |
+
llama_perf_context_print: prompt eval time = 11904.87 ms / 28672 tokens ( 0.42 ms per token, 2408.43 tokens per second)
|
| 184 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 185 |
+
llama_perf_context_print: total time = 12182.23 ms / 28673 tokens
|
| 186 |
+
llama_perf_context_print: graphs reused = 0
|
| 187 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20721 + ( 641 = 330 + 17 + 293) + 2743 |
|
| 189 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 618 |
|
| 190 |
+
llama_memory_breakdown_print: | - Host | 810 = 743 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21556 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 506 tensors from /mnt/world8/AI/Models/granite-4.0-h-1b-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 1b Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 1B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 1b
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 40
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 1536
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 4096
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,40] = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.007812
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.220000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 6.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 4096
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 3072
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 297 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 204 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 62 |
+
llama_model_loader: - type mxfp4: 4 tensors
|
| 63 |
+
print_info: file format = GGUF V3 (latest)
|
| 64 |
+
print_info: file type = MXFP4 MoE
|
| 65 |
+
print_info: file size = 1.37 GiB (8.06 BPW)
|
| 66 |
+
load: printing all EOG tokens:
|
| 67 |
+
load: - 100257 ('<|end_of_text|>')
|
| 68 |
+
load: - 100261 ('<|fim_pad|>')
|
| 69 |
+
load: special tokens cache size = 96
|
| 70 |
+
load: token to piece cache size = 0.6152 MB
|
| 71 |
+
print_info: arch = granitehybrid
|
| 72 |
+
print_info: vocab_only = 0
|
| 73 |
+
print_info: n_ctx_train = 1048576
|
| 74 |
+
print_info: n_embd = 1536
|
| 75 |
+
print_info: n_embd_inp = 1536
|
| 76 |
+
print_info: n_layer = 40
|
| 77 |
+
print_info: n_head = 12
|
| 78 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 79 |
+
print_info: n_rot = 128
|
| 80 |
+
print_info: n_swa = 0
|
| 81 |
+
print_info: is_swa_any = 0
|
| 82 |
+
print_info: n_embd_head_k = 128
|
| 83 |
+
print_info: n_embd_head_v = 128
|
| 84 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 86 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
|
| 87 |
+
print_info: f_norm_eps = 0.0e+00
|
| 88 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 89 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 90 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 91 |
+
print_info: f_logit_scale = 6.0e+00
|
| 92 |
+
print_info: f_attn_scale = 7.8e-03
|
| 93 |
+
print_info: n_ff = 4096
|
| 94 |
+
print_info: n_expert = 0
|
| 95 |
+
print_info: n_expert_used = 0
|
| 96 |
+
print_info: n_expert_groups = 0
|
| 97 |
+
print_info: n_group_used = 0
|
| 98 |
+
print_info: causal attn = 1
|
| 99 |
+
print_info: pooling type = 0
|
| 100 |
+
print_info: rope type = 0
|
| 101 |
+
print_info: rope scaling = linear
|
| 102 |
+
print_info: freq_base_train = 10000.0
|
| 103 |
+
print_info: freq_scale_train = 1
|
| 104 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 105 |
+
print_info: rope_finetuned = unknown
|
| 106 |
+
print_info: ssm_d_conv = 4
|
| 107 |
+
print_info: ssm_d_inner = 3072
|
| 108 |
+
print_info: ssm_d_state = 128
|
| 109 |
+
print_info: ssm_dt_rank = 48
|
| 110 |
+
print_info: ssm_n_group = 1
|
| 111 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 112 |
+
print_info: model type = 1B
|
| 113 |
+
print_info: model params = 1.46 B
|
| 114 |
+
print_info: general.name = Granite 4.0 H 1b Unsloth
|
| 115 |
+
print_info: f_embedding_scale = 12.000000
|
| 116 |
+
print_info: f_residual_scale = 0.220000
|
| 117 |
+
print_info: f_attention_scale = 0.007812
|
| 118 |
+
print_info: n_ff_shexp = 4096
|
| 119 |
+
print_info: vocab type = BPE
|
| 120 |
+
print_info: n_vocab = 100352
|
| 121 |
+
print_info: n_merges = 100000
|
| 122 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 125 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 126 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 127 |
+
print_info: LF token = 198 'Ċ'
|
| 128 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 129 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 130 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 131 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 133 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 134 |
+
print_info: max token length = 256
|
| 135 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 136 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 137 |
+
load_tensors: offloaded 20/41 layers to GPU
|
| 138 |
+
load_tensors: CPU_Mapped model buffer size = 743.85 MiB
|
| 139 |
+
load_tensors: CUDA0 model buffer size = 330.58 MiB
|
| 140 |
+
load_tensors: CUDA1 model buffer size = 330.58 MiB
|
| 141 |
+
................................................................................................
|
| 142 |
+
llama_context: constructing llama_context
|
| 143 |
+
llama_context: n_seq_max = 1
|
| 144 |
+
llama_context: n_ctx = 2048
|
| 145 |
+
llama_context: n_ctx_seq = 2048
|
| 146 |
+
llama_context: n_batch = 2048
|
| 147 |
+
llama_context: n_ubatch = 512
|
| 148 |
+
llama_context: causal_attn = 1
|
| 149 |
+
llama_context: flash_attn = auto
|
| 150 |
+
llama_context: kv_unified = false
|
| 151 |
+
llama_context: freq_base = 10000.0
|
| 152 |
+
llama_context: freq_scale = 1
|
| 153 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 154 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 155 |
+
llama_kv_cache: CPU KV buffer size = 8.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: CUDA1 KV buffer size = 4.00 MiB
|
| 158 |
+
llama_kv_cache: size = 16.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 8.00 MiB, V (f16): 8.00 MiB
|
| 159 |
+
llama_memory_recurrent: CPU RS buffer size = 27.69 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 13.84 MiB
|
| 161 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 13.84 MiB
|
| 162 |
+
llama_memory_recurrent: size = 55.37 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 1.37 MiB, S (f32): 54.00 MiB
|
| 163 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 164 |
+
llama_context: CUDA0 compute buffer size = 281.69 MiB
|
| 165 |
+
llama_context: CUDA1 compute buffer size = 41.18 MiB
|
| 166 |
+
llama_context: CUDA_Host compute buffer size = 31.08 MiB
|
| 167 |
+
llama_context: graph nodes = 2295
|
| 168 |
+
llama_context: graph splits = 298 (with bs=512), 62 (with bs=1)
|
| 169 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 170 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 171 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 172 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 173 |
+
|
| 174 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 175 |
+
perplexity: tokenizing the input ..
|
| 176 |
+
perplexity: tokenization took 34.381 ms
|
| 177 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 178 |
+
perplexity: 0.99 seconds per pass - ETA 0.23 minutes
|
| 179 |
+
[1]6.5515,[2]7.5631,[3]7.4391,[4]7.5837,[5]7.8514,[6]7.8606,[7]7.9615,[8]7.7570,[9]7.8192,[10]7.8225,[11]8.0331,[12]8.0714,[13]8.1684,[14]8.1518,[15]8.0855,
|
| 180 |
+
Final estimate: PPL = 8.0855 +/- 0.18065
|
| 181 |
+
|
| 182 |
+
llama_perf_context_print: load time = 389.79 ms
|
| 183 |
+
llama_perf_context_print: prompt eval time = 12956.51 ms / 30720 tokens ( 0.42 ms per token, 2371.01 tokens per second)
|
| 184 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 185 |
+
llama_perf_context_print: total time = 13228.97 ms / 30721 tokens
|
| 186 |
+
llama_perf_context_print: graphs reused = 0
|
| 187 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20856 + ( 641 = 330 + 17 + 293) + 2608 |
|
| 189 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23116 + ( 389 = 330 + 17 + 41) + 618 |
|
| 190 |
+
llama_memory_breakdown_print: | - Host | 810 = 743 + 35 + 31 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|